Lecture 6 · Chapter 9 · 20 March

Rule-Based Learners &
Decision Trees

Interpretable models that partition the feature space into regions using explicit if-then rules. Decision trees combine statistical learning with human-readable logic, striking a balance between expressiveness and explainability.

Authors
José Seixas Junior & Jiyan Salim Mahmud
Key algorithms
ID3 · C4.5 · CART · RIPPER
Date
20 March
00 · Introduction

Rule-Based Methods

Definition

Pre-defined rules of the form "if condition then action" determine outcomes. The system evaluates each rule against incoming data and executes the associated action when conditions are met.

Example: if temperature > 38°C AND cough present → suspect infection

Key Characteristics

🔍

Interpretability

Every decision is fully traceable. You can always explain exactly why a prediction was made — critical in regulated domains.

🎯

Determinism

Same input always yields the same output. No stochasticity in inference, making behaviour predictable and auditable.

🧩

Grouping assumption

The feature space can be partitioned into well-defined regions. Boundaries between groups must be capture-able by discrete conditions.

Variants of Rule-Based Methods

  • Expert systems — rules hand-crafted by domain experts. Used in game AI, clinical decision support, regulatory compliance.
  • Decision trees — rules learned from data by recursive splitting. The main focus of this lecture.
  • Fuzzy rule-based systems — extend crisp conditions with fuzzy logic to handle uncertainty and gradual membership.
  • Association rule mining — Apriori and FP-Tree discover rules from large unlabeled datasets (covered in Lecture 2).
01 · Context

Comparison of ML Approaches

ApproachMechanismKnowledge sourceCharacteristics
Rule-BasedExplicit if-then rulesHuman expertiseHighly interpretable; expert systems; regulatory domains
BayesianProbabilistic reasoningPrior + dataUncertainty modelling; spam filtering; medical diagnosis
StatisticalPatterns from dataData-drivenBalanced interpretability and performance
Deep LearningLearned features (black-box)DataHigh accuracy; needs lots of data; less interpretable
02 · Decision Trees

What Is a Decision Tree?

A decision tree is a supervised learning model that uses a hierarchical series of tests on features to arrive at a prediction. It recursively splits the dataset into increasingly homogeneous subsets.

Internal nodes

Tests on features: "Age < 30?". Each node applies a condition and routes examples to a child branch based on the outcome.

Branches

Outcomes of the test at the parent node. Binary (yes/no) for numeric features; multi-way for discrete features.

Leaves

Final predictions: class label or probability distribution (classification) or a numeric value (regression).

Income < 50k? Yes No Savings > 5k? ✓ Low risk (leaf) No Yes ✗ High risk ✓ Low risk

Example: credit scoring tree. Rectangles with rounded corners = decision nodes. Sharp rectangles = leaves.

03 · Strategy

Divide and Conquer

Decision trees are built using a greedy divide-and-conquer strategy: at each node, choose the feature and threshold that produces the purest split, then recurse on each subset.

Univariate splits

Numeric $x_i$: binary split on threshold $x_i > w_m$. Try all midpoints between consecutive values as candidates.
Discrete $x_i$ with $n$ values: $n$-way split (one branch per value).

Multivariate splits

Split on a linear combination of features: $w_1 x_1 + w_2 x_2 + \cdots + w_p x_p \geq b$. Forms oblique boundaries in feature space. More expressive but less interpretable.

Greedy ≠ Optimal

The greedy strategy picks the locally best split at each step. This does not guarantee the globally optimal tree — finding that is NP-hard. In practice, greedy trees are competitive and computationally tractable.

04 · Splitting Criteria

Impurity Measures

At each node, we measure how "mixed" the class labels are. The goal is to choose the split that maximizes purity reduction. Let $p_i^m$ be the proportion of class $i$ examples at node $m$.

Entropy

$-\sum_i p_i \log_2 p_i$

From information theory. Maximum at uniform distribution, zero at purity. The basis of Information Gain.

ID3 · C4.5

Gini Index

$1 - \sum_i p_i^2$

Probability of misclassifying a random sample if we randomly label it according to class distribution. Computationally cheaper.

CART (preferred)

Misclassification Error

$1 - \max_i(p_i)$

Fraction of examples not belonging to the majority class. Simple but less sensitive to class distribution changes — rarely used for splitting.

Rarely used
Information Gain — Best Split Selection $$\text{Gain}(m, \text{split}) = I_m - \sum_{j} \frac{N_{mj}}{N_m} \cdot I_{mj}$$

where $I_m$ is the impurity at node $m$ before splitting, and $I_{mj}$ is the impurity in child node $j$ containing $N_{mj}$ examples. Choose the split that maximizes Gain.

Pure Node = Leaf

A node is pure when all its examples belong to a single class ($p_i = 1$ for some $i$, entropy = 0, Gini = 0). Pure nodes become leaves — no further splitting needed.

05 · Algorithm

Tree Generation Algorithm

function GenerateTree(X): if Entropy(X) < θ_I: // node is pure enough create leaf → majority class return i ← SplitAttribute(X) // find best feature for each branch of x_i: X_i ← examples falling in branch GenerateTree(X_i) // recurse function SplitAttribute(X): MinEnt ← MAX for each attribute i = 1..d: if x_i is discrete (n values): split X into X_1..X_n e ← SplitEntropy(X_1..X_n) else: // numeric for each candidate threshold: split X into X_1, X_2 e ← SplitEntropy(X_1, X_2) end for end if if e < MinEnt: MinEnt ← e; best ← i return best
Candidate Thresholds for Numeric Features

Sort the unique values of the feature. Compute the midpoint between each consecutive pair. Each midpoint is a candidate threshold. This gives $O(N)$ candidates per feature — evaluate all of them and pick the one minimizing split entropy.

06 · Regression

Regression Trees

When the target $r^t \in \mathbb{R}$ is continuous, we build a regression tree. The mechanics are identical to classification trees, but impurity and leaf predictions change.

Classification tree

Splitting: minimize entropy / Gini.
Leaf prediction: majority class or class probability distribution.

Regression tree

Splitting: minimize sum of squared errors (SSE).
Leaf prediction: mean of target values in that region.

Regression Tree — Leaf Prediction and Node Error $$g_m = \frac{\sum_t b_m(x^t)\, r^t}{\sum_t b_m(x^t)} \qquad E_m = \frac{1}{N_m}\sum_t (r^t - g_m)^2\, b_m(x^t)$$

where $b_m(x^t) = 1$ if example $x^t$ reaches node $m$, 0 otherwise. $g_m$ is the mean prediction; $E_m$ is the node's MSE.

Split Selection — Sum of Squared Errors $$\text{SSE} = \sum_{i \in \text{left}} (y_i - \bar{y}_{\text{left}})^2 + \sum_{i \in \text{right}} (y_i - \bar{y}_{\text{right}})^2$$

For each candidate threshold, compute SSE for the left and right child nodes. Choose the threshold that minimizes total SSE.

07 · Extensions

Multivariate Trees

Standard trees split on a single feature at a time, creating axis-aligned boundaries. Multivariate trees split on linear combinations of features, forming oblique hyperplanes:

Multivariate Split Condition $$w_1 x_1 + w_2 x_2 + \cdots + w_p x_p \geq b$$

✅ Advantages

Captures feature interactions directly. Can form more compact trees by using oblique boundaries that align with the true decision surface. Fewer nodes needed for complex patterns.

❌ Disadvantages

Less interpretable — the split condition involves multiple features simultaneously. Computationally more intensive to find optimal weights. Harder to explain to domain experts.

Connection to Perceptrons

A multivariate split $w_1 x_1 + w_2 x_2 + w_0 \geq 0$ is exactly a linear classifier (perceptron) at each node. Multivariate trees can be viewed as hierarchically composed linear classifiers — a bridge between symbolic and connectionist AI.

08 · Generalization

Overfitting in Decision Trees

A fully grown tree — one that keeps splitting until every leaf is pure — will overfit the training data severely. It memorizes noise and produces a tree with thousands of nodes that generalizes poorly.

📉

Symptoms

Very low training error but high validation/test error. A complex tree with many leaves, each covering only a handful of training examples.

🩹

Root causes

The greedy algorithm will always find a split that reduces training impurity. With enough splits, any training set can be perfectly classified — even pure noise.

Key Difference from Neural Networks

Trees overfit by adding more nodes; neural networks overfit by increasing weight magnitudes. The cure is the same in spirit — complexity control — but the mechanism differs: pruning for trees, regularization/early stopping for networks.

09 · Regularization

Pruning Strategies

Pruning controls tree complexity to prevent overfitting. Two approaches exist: stop early (pre-pruning) or grow fully then remove (post-pruning).

🛑

Pre-pruning (Early Stopping)

Prevent the tree from growing too large. Stop splitting when:

Max depth reached
• Node entropy below threshold $\theta_I$
• Node has fewer than min-samples-per-leaf examples
• Split improvement below threshold

✂️

Post-pruning

Grow the full tree, then remove branches that don't help generalization:

Minimum error: prune subtrees using a held-out pruning set; keep the subtree that minimizes validation error
Smallest error + 1 SE: prune to the simplest tree within one standard error of minimum error
Unbalanced subtree: prune branches far larger than siblings

Pre vs. Post in Practice

Post-pruning generally achieves better results because it can observe the tree's full structure before cutting. Pre-pruning is faster and uses less memory — valuable for very large datasets. CART uses post-pruning (cost-complexity pruning); ID3 uses pre-pruning.

10 · Interpretability

Rule Extraction from Trees

Every path from root to leaf in a decision tree is a conjunction of conditions — a rule. Extracting all such paths gives a complete, equivalent rule set.

// Rules extracted from a credit-scoring tree: if age > 38.5 and years_in_job > 2.5 then y = 0.8 if age > 38.5 and years_in_job ≤ 2.5 then y = 0.6 if age ≤ 38.5 and job_type = 'A' then y = 0.4 if age ≤ 38.5 and job_type = 'B' then y = 0.3 if age ≤ 38.5 and job_type = 'C' then y = 0.2
  • One rule per leaf — the rule's antecedent is the conjunction of all conditions on the path from root to that leaf
  • Rules are mutually exclusive (any example satisfies exactly one rule) and exhaustive (every example satisfies some rule)
  • Individual rules can be simplified by removing conditions that don't change the leaf's prediction (pruning conditions)
Tree vs. Rule Induction

Tree induction builds breadth-first — it constructs the whole tree top-down. Rule induction is depth-first — it learns one rule at a time, each targeting a subset of positive examples. Rules can overlap (an example may satisfy multiple rules) and are often more compact than an equivalent tree.

11 · Rule Learning

Rule Induction — IREP & RIPPER

Rather than extracting rules from a tree, rule induction algorithms learn rules directly from data, using the Minimum Description Length (MDL) principle to balance fit against complexity.

MDL Principle

Choose the rule set that gives the shortest total description of the data: description length of the rules + description length of the errors they make. This formalizes Occam's Razor — simpler rule sets are preferred unless extra rules genuinely compress the data.

LearnRuleSet — RIPPER (Cohen, 1995)

function LearnRuleSet(Pos, Neg): RuleSet ← ∅ DL ← DescriptionLength(RuleSet, Pos, Neg) repeat: Rule ← LearnRule(Pos, Neg) // grow one rule Add Rule to RuleSet DL' ← DescriptionLength(RuleSet, Pos, Neg) if DL' < DL + c: // within tolerance PruneRuleSet(RuleSet, Pos, Neg) return RuleSet if DL' < DL: // improvement DL ← DL' Delete instances covered by Rule until Pos = ∅ return RuleSet

OptimizeRuleSet — RIPPER's Second Pass

After learning, RIPPER iterates over each rule and considers two alternatives:

  • Replace — learn a completely new rule from scratch covering the same examples. Keep if it shortens total description length.
  • Revise — generalize the existing rule by relaxing some conditions. Keep if it shortens total description length.
  • If neither improves DL, keep the original rule unchanged.
RIPPER vs. Decision Trees

RIPPER often produces more compact models than trees for the same data, especially with many irrelevant features. However, trees are generally faster to learn and easier to visualize. Both are competitive in accuracy on tabular data — the right choice depends on interpretability requirements and dataset characteristics.


Remaining Lectures

The next topics include Kernel Machines (SVMs), Lazy Learning (k-NN), Combining Methods (ensembles: bagging, boosting, random forests), and Unsupervised Learning. The exam covers all lectures and practical sessions.