Let there be Training Data (covariates, labels)
\((x_{i}, y_{i}) \text{ for } i \in \{1, 2, \dots, n\}.\)
Let
\(f_{\theta}(\cdot)\)
be our model, where
\(\vec{\theta}\)
is the parameter vector. And
\(L(\vec{y}, \hat{y})\)
is the loss function.
We minimize the empirical risk:
\(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i}))\)
Where:
\(\hat{y} = f_{\hat{\theta}}(x)\)
Goal:
-
Good performance in the real world on new
\(x \text{ (i.e., } x \text{ we didn't see).}\)
-
Low generalization error: We assume the
\(x\text{'s}\)
we didn’t see are drawn from some distribution:
\(E_{X, Y}[L(y, f_{\hat{\theta}}(x))]\)
-
We believe the distribution of
\(X \text{ and } Y\)
exists.
Complications
1. Don’t Have Access to
\(P(X, Y)\)
Solution: Collect a test set:
\((x_{\text{text i}}, y_{\text{text i}})\)
which we never touch after collection, except to calculate:
\(\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} L(\vec{y}, f_{\theta}(x_{\text{test i}}))\)
2. The Loss We Care About Is Not Compatible with the Optimizer
Example: The optimizer requires derivatives, but the loss is not differentiable or has zero derivatives.
Solution: Use a surrogate loss that works, such as:
- Logistic Loss or Hinge Loss for binary classification.
- Cross Entropy Loss for multi-class classification.
Warning: Only change the training loss function, not the test loss.
3. Huge Values in
\(\hat{\theta}\)
(Overfitting)
Solution A: Add a regularizer during training:
\(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i})) + R(\theta)\)
- Example: Ridge Regularization.
- Transition from Maximum Likelihood Estimation (MLE) to Maximum A Posteriori Estimation (MAP).
- Introduces a hyperparameter.
Solution B: Perform hyperparameter search:
- Hold out additional data (validation set) to evaluate how well you’re adjusting the hyperparameter.
4. Optimizer Might Have Its Own Hyperparameters
Example: Gradient Descent Learning Rate:
\(\theta_{t+1} = \theta_t − \eta \nabla_{\theta} L_{\text{train}, \theta}\)