ML 07 Ensemble

Quick Review

The Bias/Variance Trade-off¶

Sam

A model’s generalization error is the sum of 3 different errors:

Bias: Error due to wrong assumptions, eg functional form. High bias ⟶ underfit.
Variance: Error due to model’s sensitivity to small variations in the training data. High variance ⟶ overfit.
Irreducible error (data noise)

Increasing model complexity typically reduces bias, but increase variance.

Sam

Ensembling methods:

Parallel: Train models in parallel on different subsets of the data. Min variance.
- Bagging: with replacement (Bootstrapped aggregating)
- Pasting: w/o replacement
- Random Subspaces/Patches: randomize features and/or instances
- Random Forests: Bagging + random feature selection at each split
Sequential (Boosting): Train models sequentially, each correcting predecessor’s errors. Min bias.
- AdaBoost: reweights misclassified instances
- Gradient Boosting: fits to residual errors
- XGBoost: optimized gradient boosting algorithm through parallel processing, tree-pruning, handling missing values, and regularization to avoid overfitting/bias.
Stacking: Train diverse base models in parallel, then combine predictions with a meta-model trained on their outputs.

Parallel Methods¶

Bagging and Pasting¶

Bagging = row sampling with replacement; Pasting = without replacement.

These are sampling strategies, not actual algorithms.

What it doesHow it worksTradeoffsHyperparameters

Train multiple copies of the same model on different random subsets of the training data, then aggregate predictions.
Goal: reduce variance

Sample training data multiple times ⟶ create different subsets
Train 1 predictor per subset (same algorithm each time)
Repeat to build many predictors
Aggregate predictions:
- Classification ⟶ majority vote
- Regression ⟶ average
Feature sampling extensions:
- Random Subspaces: sample features only
- Random Patches: sample features AND rows

Bagging vs Pasting:
- Bagging ⟶ more diversity (bootstrap), slightly higher bias, lower variance ⟶ usually better
- Pasting ⟶ less diversity
- Feature sampling (subspaces/patches) ⟶ even more diversity ⟶ further ↓ variance, slight ↑ bias
Feature sampling variants:
- Random Subspaces ⟶ sample features only
- Random Patches ⟶ sample rows + features (useful for high-dimensional data)
Extra:
- When row sampling with replacement ⟶ Out-of-bag (OOB) samples (~37%) can be used for validation without a separate dataset

n_estimators: more models ⟶ lower variance
features
- max_features: controls feature sampling
- bootstrap_features: whether to sample features with replacement
instances
- max_samples: controls row sampling (normally set to size of training set)
- bootstrap: True (bagging) vs False (pasting)

Random Forest¶

What it doesHow it worksTradeoffsHyperparameters

Ensemble of DTs trained with bagging (sometimes pasting).
Goal: reduce variance vs a single tree while maintaining similar bias.
Adds feature randomness at each split to increase diversity.

Sample training data (typically with replacement)
Train many DTs in parallel
At each split, only shown a random subset of features
Aggregate predictions:
- Classification ⟶ majority vote
- Regression ⟶ average

Diversity:
- Comes from row sampling (bagging) & feature sampling
Extra Trees variant:
- Uses random thresholds instead of best split
Pros:
- Handles nonlinear patterns well
- Robust to overfitting vs single trees
Cons:
- Less interpretable than a single tree
- Can still overfit if trees too deep / too many

n_estimators: number of trees
max_features: number of features considered at each split (controls randomness)
max_leaf_nodes / max_depth: tree size (controls overfitting)
bootstrap: True (bagging) vs False (pasting)
n_jobs: parallelization
(Extra Trees): splitter="random"

Sam

Randomness: We are able to keep the full tree, not pruned

Data: Different random sample
Features: For each tree, selects best feature to split on from a random subset of features.

Extremely randomized trees also uses random thresholds for each feature when splitting

Sequential Methods (Boosting)¶

Sam

Boosting Process

Train model (weak learner)
Get residuals
Train next model* (see next card)
Repeat
Result: a strong learner is formed

AdaBoost¶

What it doesHow it worksTradeoffsHyperparameters

Sequential ensemble; focuses on hard (misclassified) instances
Goal: reduce bias

Start with equal weights for all training instances

Train model (weak learner)
Get residuals
Train next model (with increases weight of misclassified instances.)
Repeat
Final prediction = weighted vote on models (based on accuracy)

Pros:
- Strong performance with weak learners
- Focuses on difficult observations
Cons:
- Sensitive to noise/outliers (they get high weight)
- Cannot parallelize (sequential dependency)
Behavior:
- Similar to gradient descent but adds models instead of updating parameters

n_estimators: number of learners
learning_rate: controls influence of each model
base_estimator: typically shallow trees (stumps)

Gradient Boosting¶

What it doesHow it worksTradeoffsHyperparameters

Sequential ensemble that fits models to residual errors
Goal: reduce bias via additive error correction

Train model (weak learner)
Get residuals
Train next model (on predecessor's residuals.)
Repeat
Final prediction = sum of all model outputs

Pros:
- Very flexible (can optimize different loss functions)
- Strong predictive performance
Cons:
- Prone to overfitting if too many trees
- Slower (sequential)
Key ideas:
- Shrinkage: small learning rate ⟶ better generalization
- Early stopping prevents overfitting
- Subsampling ⟶ stochastic gradient boosting (↓ variance, ↑ bias)

n_estimators: number of trees
learning_rate: shrinkage factor
max_depth: tree complexity
subsample: fraction of data per tree
loss: objective function

XGBoost¶

What it doesHow it worksTradeoffsHyperparameters

Optimized implementation of Gradient Boosting
Goal: faster, scalable, regularized boosting

Same core idea as Gradient Boosting.

Differences:

Uses optimized tree-building + system-level improvements
Supports early stopping using validation set
Regularization applied to control complexity

Pros:
- Very fast and scalable
- Built-in regularization ⟶ reduces overfitting
- Strong performance in practice (common in competitions)
Cons:
- More complex tuning
- Less interpretable

n_estimators
learning_rate
max_depth
subsample
colsample_bytree: feature sampling
reg_lambda / reg_alpha: regularization
early_stopping_rounds

Stacking¶

Sam

Instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation?

Components

Instance: Row of data.
Predictors: Each base model.
Predictions: Instance x Predictors
Blender: Takes Predictions as input, outputs final prediction.

Process: Image

Sam

Common Approach: hold-out set (assume we're using 3 predictors.)

Components:

Training data ⟶ 1st subset (for training each base predictor)
Training data ⟶ 2nd subset (hold-out set for training the blender)

Process: Image

Split: Split training set into 1st/2nd subsets
Train: Use the 1st subset to train the weak learners.
Predict: Make predictions on the holdout set.
Assemble new training set: Take predicted values ⟶ use as input features in new training set (3D).
Train/Blend: Train new model based on only these 3 features. (Called a "meta-model" or blender.)
Predict: Make final predictions