ML 07 Ensemble
Quick Review
The Bias/Variance Trade-off¶
Sam
A model’s generalization error is the sum of 3 different errors:
-
Bias: Error due to wrong assumptions, eg functional form. High bias ---> underfit.
-
Variance: Error due to model’s sensitivity to small variations in the training data. High variance ---> overfit.
-
Irreducible error: Error due to data noise.
Trade-off: Increasing a model’s complexity will typically reduce bias, but increase variance.
Sam
Ensembling methods:
-
Parallel: Train models in parallel on different subsets of the data.
Min variance.-
Bagging: with replacement (Bootstrapped aggregating)
-
Pasting: without replacement
-
Random Subspaces/Patches: randomize features and/or instances
-
Random Forests: Bagging + random feature selection at each split
-
-
Sequential (Boosting): Train models sequentially, each correcting predecessor’s errors.
Min bias.-
AdaBoost: reweights misclassified instances
-
Gradient Boosting: fits to residual errors
-
XGBoost: optimized gradient boosting algorithm through parallel processing, tree-pruning, handling missing values, and regularization to avoid overfitting/bias.
-
-
Stacking: Train diverse base models in parallel, then combine predictions with a meta-model trained on their outputs.
Bagging and Pasting¶
Sam
-
Use the same algorithm, but train on different subsets of the training data (at the same time).
-
Typically helps reduce variance without adding much bias
-
Works well when each run is making mistakes on different observations
Sam
Bagging vs Pasting:
-
Bagging: With replacement
- Bagging is higher bias, lower variance. Usually performs better, but should use cv to check
-
Pasting: Without replacement
Random Patches¶
Sam
-
Samples rows & columns.
-
Reduces variance because its comparing averages, not just one answer. Helps keep accuracy high for both in-sample and out-of-sample.
Hyperparameters
-
feature:
max_features<1 -
feature:
bootstrapfeatures=True -
instance:
max_samples=1(normally set to size of training set) -
instance:
boostrap=False
Random Forest¶
Sam
Randomness: We are able to keep the full tree, not pruned
-
Data: Different random sample
-
Features: For each tree, selects best feature to split on from a random subset of features.
Extremely randomized trees also uses random thresholds for each feature when splitting
Boosting¶
Pg 205 for hyperparameters
Sam
Boosting Process
-
Train model (weak learner)
-
Get residuals
-
Train next model* (see next card)
-
Repeat
-
Result: a strong learner is formed
Sam
AdaBoost: Increases weight of misclassified instances.
- Final prediction: Each of the models makes a prediction, weighted vote based on accuracy on the weighted training set.
Gradient Boosting: Train on predecessor's residuals.
- Final prediction: Sum of each model's prediction.
Stacking¶
Sam
Instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation?
Components
-
Instance: Row of data.
-
Predictors: Each base model.
-
Predictions: Instance x Predictors
-
Blender: Takes Predictions as input, outputs final prediction.
Process: Image
Sam
Common Approach: hold-out set (assume we're using 3 predictors.)
Components:
-
Training data ---> 1st subset (for training each base predictor)
-
Training data ---> 2nd subset (hold-out set for training the blender)
Process: Image
-
Split: Split training set into 1st/2nd subsets
-
Train: Use the 1st subset to train the weak learners.
-
Predict: Make predictions on the holdout set.
-
Assemble new training set: Take predicted values ---> use as input features in new training set (3D).
-
Train/Blend: Train new model based on only these 3 features. (Called a "meta-model" or blender.)
-
Predict: Make final predictions