ML 07 Ensemble
Quick Review
The Bias/Variance Trade-off¶
Sam
A model’s generalization error is the sum of 3 different errors:
-
Bias: Error due to wrong assumptions, eg functional form. High bias ⟶ underfit.
-
Variance: Error due to model’s sensitivity to small variations in the training data. High variance ⟶ overfit.
-
Irreducible error (data noise)
Increasing model complexity typically reduces bias, but increase variance.
Sam
Ensembling methods:
-
Parallel: Train models in parallel on different subsets of the data.
Min variance.-
Bagging: with replacement (Bootstrapped aggregating)
-
Pasting: w/o replacement
-
Random Subspaces/Patches: randomize features and/or instances
-
Random Forests: Bagging + random feature selection at each split
-
-
Sequential (Boosting): Train models sequentially, each correcting predecessor’s errors.
Min bias.-
AdaBoost: reweights misclassified instances
-
Gradient Boosting: fits to residual errors
-
XGBoost: optimized gradient boosting algorithm through parallel processing, tree-pruning, handling missing values, and regularization to avoid overfitting/bias.
-
-
Stacking: Train diverse base models in parallel, then combine predictions with a meta-model trained on their outputs.
Parallel Methods¶
Bagging and Pasting¶
Bagging = row sampling with replacement; Pasting = without replacement.
- These are sampling strategies, not actual algorithms.
-
Train multiple copies of the same model on different random subsets of the training data, then aggregate predictions.
-
Goal: reduce variance
-
Sample training data multiple times ⟶ create different subsets
-
Train 1 predictor per subset (same algorithm each time)
-
Repeat to build many predictors
-
Aggregate predictions:
-
Classification ⟶ majority vote
-
Regression ⟶ average
-
-
Feature sampling extensions:
-
Random Subspaces: sample features only
-
Random Patches: sample features AND rows
-
-
Bagging vs Pasting:
-
Bagging ⟶ more diversity (bootstrap), slightly higher bias, lower variance ⟶ usually better
-
Pasting ⟶ less diversity
-
Feature sampling (subspaces/patches) ⟶ even more diversity ⟶ further ↓ variance, slight ↑ bias
-
-
Feature sampling variants:
-
Random Subspaces ⟶ sample features only
-
Random Patches ⟶ sample rows + features (useful for high-dimensional data)
-
-
Extra:
- When row sampling with replacement ⟶ Out-of-bag (OOB) samples (~37%) can be used for validation without a separate dataset
-
n_estimators: more models ⟶ lower variance -
features
-
max_features: controls feature sampling -
bootstrap_features: whether to sample features with replacement
-
-
instances
-
max_samples: controls row sampling (normally set to size of training set) -
bootstrap: True (bagging) vs False (pasting)
-
Random Forest¶
-
Ensemble of DTs trained with bagging (sometimes pasting).
-
Goal: reduce variance vs a single tree while maintaining similar bias.
-
Adds feature randomness at each split to increase diversity.
-
Sample training data (typically with replacement)
-
Train many DTs in parallel
-
At each split, only shown a random subset of features
-
Aggregate predictions:
-
Classification ⟶ majority vote
-
Regression ⟶ average
-
-
Diversity:
- Comes from row sampling (bagging) & feature sampling
-
Extra Trees variant:
- Uses random thresholds instead of best split
-
Pros:
-
Handles nonlinear patterns well
-
Robust to overfitting vs single trees
-
-
Cons:
-
Less interpretable than a single tree
-
Can still overfit if trees too deep / too many
-
-
n_estimators: number of trees -
max_features: number of features considered at each split (controls randomness) -
max_leaf_nodes/max_depth: tree size (controls overfitting) -
bootstrap: True (bagging) vs False (pasting) -
n_jobs: parallelization -
(Extra Trees):
splitter="random"
Sam
Randomness: We are able to keep the full tree, not pruned
-
Data: Different random sample
-
Features: For each tree, selects best feature to split on from a random subset of features.
Extremely randomized trees also uses random thresholds for each feature when splitting
Sequential Methods (Boosting)¶
Sam
Boosting Process
-
Train model (weak learner)
-
Get residuals
-
Train next model* (see next card)
-
Repeat
-
Result: a strong learner is formed
AdaBoost¶
-
Sequential ensemble; focuses on hard (misclassified) instances
-
Goal: reduce bias
Start with equal weights for all training instances
-
Train model (weak learner)
-
Get residuals
-
Train next model (with increases weight of misclassified instances.)
-
Repeat
-
Final prediction = weighted vote on models (based on accuracy)
-
Pros:
-
Strong performance with weak learners
-
Focuses on difficult observations
-
-
Cons:
-
Sensitive to noise/outliers (they get high weight)
-
Cannot parallelize (sequential dependency)
-
-
Behavior:
- Similar to gradient descent but adds models instead of updating parameters
-
n_estimators: number of learners -
learning_rate: controls influence of each model -
base_estimator: typically shallow trees (stumps)
Gradient Boosting¶
-
Sequential ensemble that fits models to residual errors
-
Goal: reduce bias via additive error correction
-
Train model (weak learner)
-
Get residuals
-
Train next model (on predecessor's residuals.)
-
Repeat
-
Final prediction = sum of all model outputs
-
Pros:
-
Very flexible (can optimize different loss functions)
-
Strong predictive performance
-
-
Cons:
-
Prone to overfitting if too many trees
-
Slower (sequential)
-
-
Key ideas:
-
Shrinkage: small learning rate ⟶ better generalization
-
Early stopping prevents overfitting
-
Subsampling ⟶ stochastic gradient boosting (↓ variance, ↑ bias)
-
-
n_estimators: number of trees -
learning_rate: shrinkage factor -
max_depth: tree complexity -
subsample: fraction of data per tree -
loss: objective function
XGBoost¶
-
Optimized implementation of Gradient Boosting
-
Goal: faster, scalable, regularized boosting
Same core idea as Gradient Boosting.
Differences:
-
Uses optimized tree-building + system-level improvements
-
Supports early stopping using validation set
-
Regularization applied to control complexity
-
Pros:
-
Very fast and scalable
-
Built-in regularization ⟶ reduces overfitting
-
Strong performance in practice (common in competitions)
-
-
Cons:
-
More complex tuning
-
Less interpretable
-
-
n_estimators -
learning_rate -
max_depth -
subsample -
colsample_bytree: feature sampling -
reg_lambda/reg_alpha: regularization -
early_stopping_rounds
Stacking¶
Sam
Instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation?
Components
-
Instance: Row of data.
-
Predictors: Each base model.
-
Predictions: Instance x Predictors
-
Blender: Takes Predictions as input, outputs final prediction.
Process: Image
Sam
Common Approach: hold-out set (assume we're using 3 predictors.)
Components:
-
Training data ⟶ 1st subset (for training each base predictor)
-
Training data ⟶ 2nd subset (hold-out set for training the blender)
Process: Image
-
Split: Split training set into 1st/2nd subsets
-
Train: Use the 1st subset to train the weak learners.
-
Predict: Make predictions on the holdout set.
-
Assemble new training set: Take predicted values ⟶ use as input features in new training set (3D).
-
Train/Blend: Train new model based on only these 3 features. (Called a "meta-model" or blender.)
-
Predict: Make final predictions