Kartikeya Sharma

Optimizational Paradigm for Supervised Machine Learning

Let there be Training Data (covariates, labels) \((x_{i}, y_{i}) \text{ for } i \in \{1, 2, \dots, n\}.\) Let \(f_{\theta}(\cdot)\) be our model, where \(\vec{\theta}\) is the parameter vector. And \(L(\vec{y}, \hat{y})\) is the loss function.

We minimize the empirical risk: \(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i}))\)

Where: \(\hat{y} = f_{\hat{\theta}}(x)\)

Goal:


Complications

1. Don’t Have Access to

\(P(X, Y)\)

Solution: Collect a test set: \((x_{\text{text i}}, y_{\text{text i}})\) which we never touch after collection, except to calculate: \(\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} L(\vec{y}, f_{\theta}(x_{\text{test i}}))\)


2. The Loss We Care About Is Not Compatible with the Optimizer

Example: The optimizer requires derivatives, but the loss is not differentiable or has zero derivatives.

Solution: Use a surrogate loss that works, such as:

Warning: Only change the training loss function, not the test loss.


3. Huge Values in

\(\hat{\theta}\) (Overfitting)

Solution A: Add a regularizer during training: \(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i})) + R(\theta)\)

Solution B: Perform hyperparameter search:


4. Optimizer Might Have Its Own Hyperparameters

Example: Gradient Descent Learning Rate: \(\theta_{t+1} = \theta_t − \eta \nabla_{\theta} L_{\text{train}, \theta}\)