Lecture 9 · Chapter 17 · 28 April

Combining Methods

No single model is best for every problem. By combining many learners — varying algorithms, data samples, or specializations — we can reduce bias, variance, or both, and consistently outperform any individual model.

Authors
Seixas Junior, Mahmud & Koren
Key algorithms
Bagging · AdaBoost · Stacking · ECOC
Date
28 April
00 · Motivation

No Free Lunch Theorem

No Free Lunch Theorem — Wolpert & Macready

Averaged over all possible problems, every learning algorithm performs the same. There is no universally best model — only models that excel at specific problems.

The practical consequence: instead of searching for the one perfect algorithm, we design portfolios of diverse learners and combine their outputs. Different learners make errors on different examples — combined, they cover each other's weaknesses.

  • Different algorithms have different inductive biases — linear, tree-based, kernel-based
  • Different hyperparameters produce models with different bias-variance profiles
  • Different training sets (subsamples) produce differently specialized models
  • Different views/modalities capture complementary information about the input
The Core Assumption

Combining works because base learners are diverse — their errors are uncorrelated. If all base learners fail on the same examples, combining won't help. Diversity is the key ingredient.

01 · Types

Ensemble · Hybrid · Multimodal

🎯

Ensemble

Multiple models of the same type on the same task and modality. Goal: reduce variance or bias through aggregation. Examples: Bagging, Boosting, Voting, Stacking.

🔀

Hybrid

Integrate different model types or learning paradigms. Richer reasoning through complementary strengths. Examples: Rule-based + Neural network, SVM + Deep learning.

🌐

Multimodal

Fuse inputs from different data types: text, image, audio, video. Examples: CLIP (image + text), VQA (image + question), audio-visual speech recognition.

02 · Model Combination

Voting

The simplest combination: each base learner casts a vote, and the final prediction is some aggregate of those votes.

Weighted Voting (Regression & Classification) $$y = \sum_{j=1}^{L} w_j d_j, \qquad w_j \geq 0,\;\sum_j w_j = 1$$ $$y_i = \sum_{j=1}^{L} w_j d_{ji} \quad \text{(class } i \text{ output)}$$
StrategyHow it worksBest for
Majority voteMost-voted label winsSimple classification, equal trust in models
Weighted voteVote weighted by model accuracy on validation setUnequal model quality
Soft (avg probs)Average class probability outputsProbabilistic classifiers
ProductMultiply class probabilitiesIndependent models; fuzzy logic
Min / MaxMost conservative / most confident predictionRisk-averse or aggressive decisions
MedianMiddle prediction across modelsOutlier-robust regression
03 · Multi-Class

Error-Correcting Output Codes (ECOC)

ECOC (Dietterich & Bakiri, 1995) solves multi-class classification using an ensemble of binary classifiers, inspired by error-correcting codes in communications.

📡

Error-Correcting Codes

In data transmission, redundant bits allow error correction. Send "1" as "111". Receive "101" (one bit error) → majority vote → "1" (corrected).

🏷️

ECOC in ML

Assign each class a unique binary codeword. Train one binary classifier per bit position. Classify by finding the class whose codeword has the smallest Hamming distance from the predicted bits.

ECOC Strategies — Code Matrix $W$ $$\text{One-vs-rest: } L=K \quad \text{Pairwise: } L=\frac{K(K-1)}{2} \quad \text{Full code: } L=2^{K-1}-1$$
Why Redundancy Helps

Even if individual binary classifiers make errors, the Hamming distance to the wrong class codeword remains larger than to the correct one — as long as errors are sufficiently uncorrelated. More bits = more error correction capacity, but also more classifiers to train.

04 · Model Combination

Mixture of Experts

Standard voting uses fixed weights $w_j$. Mixture of Experts makes the weights input-dependent: each expert specializes in a different region of the input space, and a gating network routes each input to the relevant expert(s).

Mixture of Experts $$y = \sum_{j=1}^{L} w_j(\mathbf{x})\, d_j(\mathbf{x}), \qquad w_j(\mathbf{x}) \geq 0,\;\sum_j w_j(\mathbf{x}) = 1$$
  • Gating network — learns which expert to trust for each region of input space
  • Expert specialization — each $d_j$ optimizes on the sub-region where $w_j(\mathbf{x})$ is large
  • Key requirement — all experts must collectively cover the entire input space. A specialized expert that misses a region leaves it unhandled.
Modern Relevance

Mixture of Experts is the architecture behind Sparse MoE transformers (Mistral, GPT-4 reportedly). The gating network selects only a few experts per token — achieving massive model capacity with sub-linear compute cost.

05 · Model Combination

Stacking

Stacking (stacked generalization) trains a meta-learner to combine the outputs of base learners — learning when to trust which model from data, rather than using fixed weights.

  1. 01
    Train base models — fit several diverse models (e.g., decision tree, SVM, logistic regression) on the original training set.
  2. 02
    Generate predictions — use cross-validation or a held-out set to get predictions from each base model on unseen data. This avoids information leakage. Each training example becomes a vector of base model predictions $[p_1, \ldots, p_L]$.
  3. 03
    Train meta-learner — use the prediction vectors as input features and original labels as targets. The meta-learner learns how to best combine base model outputs.
  4. 04
    Final prediction — for a new instance: collect base model predictions, pass to meta-learner, output its prediction.
Why Not Use Hard Decisions?

Stacking works better with soft (probabilistic) outputs from base learners — full probability vectors rather than class labels. Hard decisions discard confidence information that the meta-learner can exploit. Concatenating the probability vectors gives the meta-learner a richer feature set.

06 · Sequential Combination

Cascading

Cascading arranges models in a pipeline where each model decides whether to produce a prediction or pass the input to a more complex downstream model.

Early models

Fast, simple. Handle the "easy" cases where one class is overwhelmingly likely. If confidence exceeds a threshold, output immediately. Most inputs are resolved here.

🔬

Later models

Slower, more complex. Handle the "hard" cases that earlier models are uncertain about. By the time they run, the input is already known to be difficult.

  • Face detection (Viola-Jones) — reject most background windows with a simple classifier; escalate to complex classifiers only for face candidates
  • Spam filtering — obvious spam caught by simple rules; ambiguous emails escalated to ML model
  • Medical diagnosis — cheap tests first; expensive tests only for patients who fail simpler screens
Cascading vs. Stacking

Stacking runs all base models on every input, then combines. Cascading runs only as many models as needed — most inputs exit early. Cascading is computationally adaptive; stacking is computationally fixed but uses all model information.

07 · Data-Based

Bagging — Bootstrap Aggregating

Instead of combining different model types, Bagging trains many copies of the same algorithm on different random subsamples of the training data.

Bagging $$\text{For } j = 1,\ldots,L: \;\text{draw } X_j \sim \text{Bootstrap}(\mathcal{X}); \;\text{train } d_j \text{ on } X_j$$ $$\hat{y}(\mathbf{x}) = \text{MajorityVote}(d_1(\mathbf{x}),\ldots,d_L(\mathbf{x})) \;\text{ or }\; \frac{1}{L}\sum_j d_j(\mathbf{x})$$

Bootstrapping

Bootstrap sampling: sample $N$ examples from the training set with replacement. Each bootstrap sample contains approximately 63% of unique examples (the rest are duplicates); ~37% of examples are never selected per sample — these form the "out-of-bag" (OOB) examples, which can be used for unbiased error estimation.

Benefits from Bagging

Unstable algorithms — Decision Trees, Neural Networks, Naïve Bayes. These produce very different models on different training sets. Averaging cancels out variance.

Don't Benefit Much

Stable algorithms — SVM, k-NN. These produce similar models regardless of data subsample. Bagging adds little diversity and thus little improvement.

Random Forests

Random Forests = Bagging + feature randomization. At each split in each tree, only a random subset of $\sqrt{d}$ features is considered. This adds an additional layer of diversity beyond bootstrapping — reducing correlation between trees and further reducing variance. State-of-the-art for tabular data.

08 · Sequential Learning

Boosting

Where Bagging trains base learners in parallel on random subsamples, Boosting trains them sequentially — each new learner explicitly focuses on examples the previous ones got wrong.

Key IdeaFocus on Hard Examples

Maintain a probability distribution over training examples. Initially uniform. After each learner is trained, increase the weight of misclassified examples (make them more likely to be selected next round) and decrease the weight of correctly classified ones.

Each new weak learner must pay more attention to previously difficult examples. The final prediction is a weighted vote across all learners, where more accurate learners get higher weight.

📉

Primarily reduces Bias

Each learner targets the residual error of its predecessors. The ensemble gradually moves from high bias to a complex composite model. Risk: overfitting if run too long.

📊

vs. Bagging

Bagging reduces variance (parallel, independent learners). Boosting reduces bias (sequential, error-correcting). Both improve accuracy but through different mechanisms.

  • AdaBoost — adaptive weights on training examples; log-weight on learner votes
  • Gradient Boosting — fit each new model to the negative gradient of the loss
  • XGBoost / LightGBM — optimized gradient boosting; currently dominant for tabular data competitions
09 · Algorithm

AdaBoost — Adaptive Boosting

Training

// Initialize uniform weights p¹ᵢ = 1/N for all (xᵢ, rᵢ) ∈ X for j = 1, ..., L: Draw Xⱼ from X with probabilities pʲ Train base learner dⱼ on Xⱼ for each (xᵢ, rᵢ): yʲᵢ ← dⱼ(xᵢ) ϵⱼ ← Σᵢ pʲᵢ · 1(yʲᵢ ≠ rᵢ) // weighted error rate if ϵⱼ > 1/2: stop // worse than random → abort βⱼ ← ϵⱼ / (1 - ϵⱼ) // learner weight (< 1 if ϵⱼ < 1/2) for each (xᵢ, rᵢ): if yʲᵢ = rᵢ: pʲ⁺¹ᵢ ← βⱼ · pʲᵢ // correct → decrease weight else: pʲ⁺¹ᵢ ← pʲᵢ // wrong → keep weight Normalize: pʲ⁺¹ᵢ ← pʲ⁺¹ᵢ / Zⱼ

Prediction

AdaBoost — Weighted Class Vote $$y_i = \sum_{j=1}^{L} \left(\log\frac{1}{\beta_j}\right) d_{ji}(\mathbf{x})$$

Each learner's vote is weighted by $\log(1/\beta_j) = \log\frac{1-\varepsilon_j}{\varepsilon_j}$. A learner with low error ($\varepsilon_j \to 0$) gets $\beta_j \to 0$, so $\log(1/\beta_j) \to \infty$ — a very high vote weight. A learner at chance level ($\varepsilon_j = 0.5$) gets $\beta_j = 1$, $\log(1/\beta_j) = 0$ — zero vote.

Theoretical Guarantee

If each base learner achieves error $\varepsilon_j < 1/2 - \gamma$ for some $\gamma > 0$, then the training error of AdaBoost's ensemble decreases exponentially in the number of rounds $L$. It can drive training error to zero — but may overfit on noisy data. Early stopping controls this.

Gradient Boosting — The Modern Variant

Gradient Boosting generalizes boosting to arbitrary differentiable loss functions. Each new model is fit to the negative gradient (residuals) of the loss. With squared error loss, this is exactly fitting to residuals. XGBoost and LightGBM add regularization, second-order Taylor approximations, and hardware-optimized implementations — making them the go-to for structured/tabular data.


Course Complete

This concludes the lecture series. The exam covers all lectures and practical sessions: supervised learning theory, Bayesian decision theory, parametric methods, graphical models, HMMs, rule-based learners, decision trees, lazy learners, SVMs, and combining methods.