Let there be Training Data (covariates, labels)
\((x_{i}, y_{i}) \text{ for } i \in \{1, 2, \dots, n\}.\)
be our model, where
is the parameter vector. And
\(L(\vec{y}, \hat{y})\)
is the loss function.
We minimize the empirical risk:
\(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i}))\)
\(\hat{y} = f_{\hat{\theta}}(x)\)
Good performance in the real world on new
\(x \text{ (i.e., } x \text{ we didn't see).}\)
Low generalization error: We assume the
we didn’t see are drawn from some distribution:
\(E_{X, Y}[L(y, f_{\hat{\theta}}(x))]\)
We believe the distribution of
\(X \text{ and } Y\)
1. Don’t Have Access to
\(P(X, Y)\)
Solution: Collect a test set:
\((x_{\text{text i}}, y_{\text{text i}})\)
which we never touch after collection, except to calculate:
\(\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} L(\vec{y}, f_{\theta}(x_{\text{test i}}))\)
2. The Loss We Care About Is Not Compatible with the Optimizer
Example: The optimizer requires derivatives, but the loss is not differentiable or has zero derivatives.
Solution: Use a surrogate loss that works, such as:
- Logistic Loss or Hinge Loss for binary classification.
- Cross Entropy Loss for multi-class classification.
Warning: Only change the training loss function, not the test loss.
3. Huge Values in
Solution A: Add a regularizer during training:
\(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i})) + R(\theta)\)
- Example: Ridge Regularization.
- Transition from Maximum Likelihood Estimation (MLE) to Maximum A Posteriori Estimation (MAP).
- Introduces a hyperparameter.
Solution B: Perform hyperparameter search:
- Hold out additional data (validation set) to evaluate how well you’re adjusting the hyperparameter.
4. Optimizer Might Have Its Own Hyperparameters
Example: Gradient Descent Learning Rate:
\(\theta_{t+1} = \theta_t − \eta \nabla_{\theta} L_{\text{train}, \theta}\)