ML 07 Ensemble

Quick Review

The Bias/Variance Trade-off¶

Sam

A model’s generalization error is the sum of 3 different errors:

Bias: Error due to wrong assumptions, eg functional form. High bias ---> underfit.
Variance: Error due to model’s sensitivity to small variations in the training data. High variance ---> overfit.
Irreducible error: Error due to data noise.

Trade-off: Increasing a model’s complexity will typically reduce bias, but increase variance.

Sam

Ensembling methods:

Parallel: Train models in parallel on different subsets of the data. Min variance.
- Bagging: with replacement (Bootstrapped aggregating)
- Pasting: without replacement
- Random Subspaces/Patches: randomize features and/or instances
- Random Forests: Bagging + random feature selection at each split
Sequential (Boosting): Train models sequentially, each correcting predecessor’s errors. Min bias.
- AdaBoost: reweights misclassified instances
- Gradient Boosting: fits to residual errors
- XGBoost: optimized gradient boosting algorithm through parallel processing, tree-pruning, handling missing values, and regularization to avoid overfitting/bias.
Stacking: Train diverse base models in parallel, then combine predictions with a meta-model trained on their outputs.

Bagging and Pasting¶

Sam

Use the same algorithm, but train on different subsets of the training data (at the same time).
Typically helps reduce variance without adding much bias
Works well when each run is making mistakes on different observations

Sam

Bagging vs Pasting:

Bagging: With replacement
- Bagging is higher bias, lower variance. Usually performs better, but should use cv to check
Pasting: Without replacement

Random Patches¶

Sam

Samples rows & columns.
Reduces variance because its comparing averages, not just one answer. Helps keep accuracy high for both in-sample and out-of-sample.

Hyperparameters

feature: max_features<1
feature: bootstrapfeatures=True
instance: max_samples=1 (normally set to size of training set)
instance: boostrap=False

Random Forest¶

Sam

Randomness: We are able to keep the full tree, not pruned

Data: Different random sample
Features: For each tree, selects best feature to split on from a random subset of features.

Extremely randomized trees also uses random thresholds for each feature when splitting

Boosting¶

Pg 205 for hyperparameters

Sam

Boosting Process

Train model (weak learner)
Get residuals
Train next model* (see next card)
Repeat
Result: a strong learner is formed

Sam

AdaBoost: Increases weight of misclassified instances.

Final prediction: Each of the models makes a prediction, weighted vote based on accuracy on the weighted training set.

Gradient Boosting: Train on predecessor's residuals.

Final prediction: Sum of each model's prediction.

Stacking¶

Sam

Instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation?

Components

Instance: Row of data.
Predictors: Each base model.
Predictions: Instance x Predictors
Blender: Takes Predictions as input, outputs final prediction.

Process: Image

Sam

Common Approach: hold-out set (assume we're using 3 predictors.)

Components:

Training data ---> 1st subset (for training each base predictor)
Training data ---> 2nd subset (hold-out set for training the blender)

Process: Image

Split: Split training set into 1st/2nd subsets
Train: Use the 1st subset to train the weak learners.
Predict: Make predictions on the holdout set.
Assemble new training set: Take predicted values ---> use as input features in new training set (3D).
Train/Blend: Train new model based on only these 3 features. (Called a "meta-model" or blender.)
Predict: Make final predictions