Optimizational Paradigm for Supervised Machine Learning

November 01, 2024

Let there be Training Data (covariates, labels) \((x_{i}, y_{i}) \text{ for } i \in \{1, 2, \dots, n\}.\) Let \(f_{\theta}(\cdot)\) be our model, where \(\vec{\theta}\) is the parameter vector. And \(L(\vec{y}, \hat{y})\) is the loss function.

We minimize the empirical risk: \(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i}))\)

Where: \(\hat{y} = f_{\hat{\theta}}(x)\)

Goal:

Good performance in the real world on new \(x \text{ (i.e., } x \text{ we didn't see).}\)
Low generalization error: We assume the \(x\text{'s}\) we didn’t see are drawn from some distribution: \(E_{X, Y}[L(y, f_{\hat{\theta}}(x))]\)
We believe the distribution of \(X \text{ and } Y\) exists.

Complications

1. Don’t Have Access to

\(P(X, Y)\)

Solution: Collect a test set: \((x_{\text{text i}}, y_{\text{text i}})\) which we never touch after collection, except to calculate: \(\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} L(\vec{y}, f_{\theta}(x_{\text{test i}}))\)

2. The Loss We Care About Is Not Compatible with the Optimizer

Example: The optimizer requires derivatives, but the loss is not differentiable or has zero derivatives.

Solution: Use a surrogate loss that works, such as:

Logistic Loss or Hinge Loss for binary classification.
Cross Entropy Loss for multi-class classification.

Warning: Only change the training loss function, not the test loss.

3. Huge Values in

\(\hat{\theta}\) (Overfitting)

Solution A: Add a regularizer during training: \(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i})) + R(\theta)\)

Example: Ridge Regularization.
Transition from Maximum Likelihood Estimation (MLE) to Maximum A Posteriori Estimation (MAP).
Introduces a hyperparameter.

Solution B: Perform hyperparameter search:

Hold out additional data (validation set) to evaluate how well you’re adjusting the hyperparameter.

4. Optimizer Might Have Its Own Hyperparameters

Example: Gradient Descent Learning Rate: \(\theta_{t+1} = \theta_t − \eta \nabla_{\theta} L_{\text{train}, \theta}\)