Combining Methods
No single model is best for every problem. By combining many learners — varying algorithms, data samples, or specializations — we can reduce bias, variance, or both, and consistently outperform any individual model.
No Free Lunch Theorem
Averaged over all possible problems, every learning algorithm performs the same. There is no universally best model — only models that excel at specific problems.
The practical consequence: instead of searching for the one perfect algorithm, we design portfolios of diverse learners and combine their outputs. Different learners make errors on different examples — combined, they cover each other's weaknesses.
- Different algorithms have different inductive biases — linear, tree-based, kernel-based
- Different hyperparameters produce models with different bias-variance profiles
- Different training sets (subsamples) produce differently specialized models
- Different views/modalities capture complementary information about the input
Combining works because base learners are diverse — their errors are uncorrelated. If all base learners fail on the same examples, combining won't help. Diversity is the key ingredient.
Ensemble · Hybrid · Multimodal
Ensemble
Multiple models of the same type on the same task and modality. Goal: reduce variance or bias through aggregation. Examples: Bagging, Boosting, Voting, Stacking.
Hybrid
Integrate different model types or learning paradigms. Richer reasoning through complementary strengths. Examples: Rule-based + Neural network, SVM + Deep learning.
Multimodal
Fuse inputs from different data types: text, image, audio, video. Examples: CLIP (image + text), VQA (image + question), audio-visual speech recognition.
Voting
The simplest combination: each base learner casts a vote, and the final prediction is some aggregate of those votes.
| Strategy | How it works | Best for |
|---|---|---|
| Majority vote | Most-voted label wins | Simple classification, equal trust in models |
| Weighted vote | Vote weighted by model accuracy on validation set | Unequal model quality |
| Soft (avg probs) | Average class probability outputs | Probabilistic classifiers |
| Product | Multiply class probabilities | Independent models; fuzzy logic |
| Min / Max | Most conservative / most confident prediction | Risk-averse or aggressive decisions |
| Median | Middle prediction across models | Outlier-robust regression |
Error-Correcting Output Codes (ECOC)
ECOC (Dietterich & Bakiri, 1995) solves multi-class classification using an ensemble of binary classifiers, inspired by error-correcting codes in communications.
Error-Correcting Codes
In data transmission, redundant bits allow error correction. Send "1" as "111". Receive "101" (one bit error) → majority vote → "1" (corrected).
ECOC in ML
Assign each class a unique binary codeword. Train one binary classifier per bit position. Classify by finding the class whose codeword has the smallest Hamming distance from the predicted bits.
Even if individual binary classifiers make errors, the Hamming distance to the wrong class codeword remains larger than to the correct one — as long as errors are sufficiently uncorrelated. More bits = more error correction capacity, but also more classifiers to train.
Mixture of Experts
Standard voting uses fixed weights $w_j$. Mixture of Experts makes the weights input-dependent: each expert specializes in a different region of the input space, and a gating network routes each input to the relevant expert(s).
- Gating network — learns which expert to trust for each region of input space
- Expert specialization — each $d_j$ optimizes on the sub-region where $w_j(\mathbf{x})$ is large
- Key requirement — all experts must collectively cover the entire input space. A specialized expert that misses a region leaves it unhandled.
Mixture of Experts is the architecture behind Sparse MoE transformers (Mistral, GPT-4 reportedly). The gating network selects only a few experts per token — achieving massive model capacity with sub-linear compute cost.
Stacking
Stacking (stacked generalization) trains a meta-learner to combine the outputs of base learners — learning when to trust which model from data, rather than using fixed weights.
- 01Train base models — fit several diverse models (e.g., decision tree, SVM, logistic regression) on the original training set.
- 02Generate predictions — use cross-validation or a held-out set to get predictions from each base model on unseen data. This avoids information leakage. Each training example becomes a vector of base model predictions $[p_1, \ldots, p_L]$.
- 03Train meta-learner — use the prediction vectors as input features and original labels as targets. The meta-learner learns how to best combine base model outputs.
- 04Final prediction — for a new instance: collect base model predictions, pass to meta-learner, output its prediction.
Stacking works better with soft (probabilistic) outputs from base learners — full probability vectors rather than class labels. Hard decisions discard confidence information that the meta-learner can exploit. Concatenating the probability vectors gives the meta-learner a richer feature set.
Cascading
Cascading arranges models in a pipeline where each model decides whether to produce a prediction or pass the input to a more complex downstream model.
Early models
Fast, simple. Handle the "easy" cases where one class is overwhelmingly likely. If confidence exceeds a threshold, output immediately. Most inputs are resolved here.
Later models
Slower, more complex. Handle the "hard" cases that earlier models are uncertain about. By the time they run, the input is already known to be difficult.
- Face detection (Viola-Jones) — reject most background windows with a simple classifier; escalate to complex classifiers only for face candidates
- Spam filtering — obvious spam caught by simple rules; ambiguous emails escalated to ML model
- Medical diagnosis — cheap tests first; expensive tests only for patients who fail simpler screens
Stacking runs all base models on every input, then combines. Cascading runs only as many models as needed — most inputs exit early. Cascading is computationally adaptive; stacking is computationally fixed but uses all model information.
Bagging — Bootstrap Aggregating
Instead of combining different model types, Bagging trains many copies of the same algorithm on different random subsamples of the training data.
Bootstrapping
Bootstrap sampling: sample $N$ examples from the training set with replacement. Each bootstrap sample contains approximately 63% of unique examples (the rest are duplicates); ~37% of examples are never selected per sample — these form the "out-of-bag" (OOB) examples, which can be used for unbiased error estimation.
Benefits from Bagging
Unstable algorithms — Decision Trees, Neural Networks, Naïve Bayes. These produce very different models on different training sets. Averaging cancels out variance.
Don't Benefit Much
Stable algorithms — SVM, k-NN. These produce similar models regardless of data subsample. Bagging adds little diversity and thus little improvement.
Random Forests = Bagging + feature randomization. At each split in each tree, only a random subset of $\sqrt{d}$ features is considered. This adds an additional layer of diversity beyond bootstrapping — reducing correlation between trees and further reducing variance. State-of-the-art for tabular data.
Boosting
Where Bagging trains base learners in parallel on random subsamples, Boosting trains them sequentially — each new learner explicitly focuses on examples the previous ones got wrong.
Maintain a probability distribution over training examples. Initially uniform. After each learner is trained, increase the weight of misclassified examples (make them more likely to be selected next round) and decrease the weight of correctly classified ones.
Each new weak learner must pay more attention to previously difficult examples. The final prediction is a weighted vote across all learners, where more accurate learners get higher weight.
Primarily reduces Bias
Each learner targets the residual error of its predecessors. The ensemble gradually moves from high bias to a complex composite model. Risk: overfitting if run too long.
vs. Bagging
Bagging reduces variance (parallel, independent learners). Boosting reduces bias (sequential, error-correcting). Both improve accuracy but through different mechanisms.
- AdaBoost — adaptive weights on training examples; log-weight on learner votes
- Gradient Boosting — fit each new model to the negative gradient of the loss
- XGBoost / LightGBM — optimized gradient boosting; currently dominant for tabular data competitions
AdaBoost — Adaptive Boosting
Training
Prediction
Each learner's vote is weighted by $\log(1/\beta_j) = \log\frac{1-\varepsilon_j}{\varepsilon_j}$. A learner with low error ($\varepsilon_j \to 0$) gets $\beta_j \to 0$, so $\log(1/\beta_j) \to \infty$ — a very high vote weight. A learner at chance level ($\varepsilon_j = 0.5$) gets $\beta_j = 1$, $\log(1/\beta_j) = 0$ — zero vote.
If each base learner achieves error $\varepsilon_j < 1/2 - \gamma$ for some $\gamma > 0$, then the training error of AdaBoost's ensemble decreases exponentially in the number of rounds $L$. It can drive training error to zero — but may overfit on noisy data. Early stopping controls this.
Gradient Boosting generalizes boosting to arbitrary differentiable loss functions. Each new model is fit to the negative gradient (residuals) of the loss. With squared error loss, this is exactly fitting to residuals. XGBoost and LightGBM add regularization, second-order Taylor approximations, and hardware-optimized implementations — making them the go-to for structured/tabular data.
This concludes the lecture series. The exam covers all lectures and practical sessions: supervised learning theory, Bayesian decision theory, parametric methods, graphical models, HMMs, rule-based learners, decision trees, lazy learners, SVMs, and combining methods.