Rule-Based Learners &
Decision Trees
Interpretable models that partition the feature space into regions using explicit if-then rules. Decision trees combine statistical learning with human-readable logic, striking a balance between expressiveness and explainability.
Rule-Based Methods
Pre-defined rules of the form "if condition then action" determine outcomes. The system evaluates each rule against incoming data and executes the associated action when conditions are met.
Example: if temperature > 38°C AND cough present → suspect infection
Key Characteristics
Interpretability
Every decision is fully traceable. You can always explain exactly why a prediction was made — critical in regulated domains.
Determinism
Same input always yields the same output. No stochasticity in inference, making behaviour predictable and auditable.
Grouping assumption
The feature space can be partitioned into well-defined regions. Boundaries between groups must be capture-able by discrete conditions.
Variants of Rule-Based Methods
- Expert systems — rules hand-crafted by domain experts. Used in game AI, clinical decision support, regulatory compliance.
- Decision trees — rules learned from data by recursive splitting. The main focus of this lecture.
- Fuzzy rule-based systems — extend crisp conditions with fuzzy logic to handle uncertainty and gradual membership.
- Association rule mining — Apriori and FP-Tree discover rules from large unlabeled datasets (covered in Lecture 2).
Comparison of ML Approaches
| Approach | Mechanism | Knowledge source | Characteristics |
|---|---|---|---|
| Rule-Based | Explicit if-then rules | Human expertise | Highly interpretable; expert systems; regulatory domains |
| Bayesian | Probabilistic reasoning | Prior + data | Uncertainty modelling; spam filtering; medical diagnosis |
| Statistical | Patterns from data | Data-driven | Balanced interpretability and performance |
| Deep Learning | Learned features (black-box) | Data | High accuracy; needs lots of data; less interpretable |
What Is a Decision Tree?
A decision tree is a supervised learning model that uses a hierarchical series of tests on features to arrive at a prediction. It recursively splits the dataset into increasingly homogeneous subsets.
Internal nodes
Tests on features: "Age < 30?". Each node applies a condition and routes examples to a child branch based on the outcome.
Branches
Outcomes of the test at the parent node. Binary (yes/no) for numeric features; multi-way for discrete features.
Leaves
Final predictions: class label or probability distribution (classification) or a numeric value (regression).
Example: credit scoring tree. Rectangles with rounded corners = decision nodes. Sharp rectangles = leaves.
Divide and Conquer
Decision trees are built using a greedy divide-and-conquer strategy: at each node, choose the feature and threshold that produces the purest split, then recurse on each subset.
Univariate splits
Numeric $x_i$: binary split on threshold $x_i > w_m$. Try all midpoints between consecutive values as candidates.
Discrete $x_i$ with $n$ values: $n$-way split (one branch per value).
Multivariate splits
Split on a linear combination of features: $w_1 x_1 + w_2 x_2 + \cdots + w_p x_p \geq b$. Forms oblique boundaries in feature space. More expressive but less interpretable.
The greedy strategy picks the locally best split at each step. This does not guarantee the globally optimal tree — finding that is NP-hard. In practice, greedy trees are competitive and computationally tractable.
Impurity Measures
At each node, we measure how "mixed" the class labels are. The goal is to choose the split that maximizes purity reduction. Let $p_i^m$ be the proportion of class $i$ examples at node $m$.
Entropy
$-\sum_i p_i \log_2 p_i$From information theory. Maximum at uniform distribution, zero at purity. The basis of Information Gain.
ID3 · C4.5Gini Index
$1 - \sum_i p_i^2$Probability of misclassifying a random sample if we randomly label it according to class distribution. Computationally cheaper.
CART (preferred)Misclassification Error
$1 - \max_i(p_i)$Fraction of examples not belonging to the majority class. Simple but less sensitive to class distribution changes — rarely used for splitting.
Rarely usedwhere $I_m$ is the impurity at node $m$ before splitting, and $I_{mj}$ is the impurity in child node $j$ containing $N_{mj}$ examples. Choose the split that maximizes Gain.
A node is pure when all its examples belong to a single class ($p_i = 1$ for some $i$, entropy = 0, Gini = 0). Pure nodes become leaves — no further splitting needed.
Tree Generation Algorithm
Sort the unique values of the feature. Compute the midpoint between each consecutive pair. Each midpoint is a candidate threshold. This gives $O(N)$ candidates per feature — evaluate all of them and pick the one minimizing split entropy.
Regression Trees
When the target $r^t \in \mathbb{R}$ is continuous, we build a regression tree. The mechanics are identical to classification trees, but impurity and leaf predictions change.
Classification tree
Splitting: minimize entropy / Gini.
Leaf prediction: majority class or class probability distribution.
Regression tree
Splitting: minimize sum of squared errors (SSE).
Leaf prediction: mean of target values in that region.
where $b_m(x^t) = 1$ if example $x^t$ reaches node $m$, 0 otherwise. $g_m$ is the mean prediction; $E_m$ is the node's MSE.
For each candidate threshold, compute SSE for the left and right child nodes. Choose the threshold that minimizes total SSE.
Multivariate Trees
Standard trees split on a single feature at a time, creating axis-aligned boundaries. Multivariate trees split on linear combinations of features, forming oblique hyperplanes:
✅ Advantages
Captures feature interactions directly. Can form more compact trees by using oblique boundaries that align with the true decision surface. Fewer nodes needed for complex patterns.
❌ Disadvantages
Less interpretable — the split condition involves multiple features simultaneously. Computationally more intensive to find optimal weights. Harder to explain to domain experts.
A multivariate split $w_1 x_1 + w_2 x_2 + w_0 \geq 0$ is exactly a linear classifier (perceptron) at each node. Multivariate trees can be viewed as hierarchically composed linear classifiers — a bridge between symbolic and connectionist AI.
Overfitting in Decision Trees
A fully grown tree — one that keeps splitting until every leaf is pure — will overfit the training data severely. It memorizes noise and produces a tree with thousands of nodes that generalizes poorly.
Symptoms
Very low training error but high validation/test error. A complex tree with many leaves, each covering only a handful of training examples.
Root causes
The greedy algorithm will always find a split that reduces training impurity. With enough splits, any training set can be perfectly classified — even pure noise.
Trees overfit by adding more nodes; neural networks overfit by increasing weight magnitudes. The cure is the same in spirit — complexity control — but the mechanism differs: pruning for trees, regularization/early stopping for networks.
Pruning Strategies
Pruning controls tree complexity to prevent overfitting. Two approaches exist: stop early (pre-pruning) or grow fully then remove (post-pruning).
Pre-pruning (Early Stopping)
Prevent the tree from growing too large. Stop splitting when:
• Max depth reached
• Node entropy below threshold $\theta_I$
• Node has fewer than min-samples-per-leaf examples
• Split improvement below threshold
Post-pruning
Grow the full tree, then remove branches that don't help generalization:
• Minimum error: prune subtrees using a held-out pruning set; keep the subtree that minimizes validation error
• Smallest error + 1 SE: prune to the simplest tree within one standard error of minimum error
• Unbalanced subtree: prune branches far larger than siblings
Post-pruning generally achieves better results because it can observe the tree's full structure before cutting. Pre-pruning is faster and uses less memory — valuable for very large datasets. CART uses post-pruning (cost-complexity pruning); ID3 uses pre-pruning.
Rule Extraction from Trees
Every path from root to leaf in a decision tree is a conjunction of conditions — a rule. Extracting all such paths gives a complete, equivalent rule set.
- One rule per leaf — the rule's antecedent is the conjunction of all conditions on the path from root to that leaf
- Rules are mutually exclusive (any example satisfies exactly one rule) and exhaustive (every example satisfies some rule)
- Individual rules can be simplified by removing conditions that don't change the leaf's prediction (pruning conditions)
Tree induction builds breadth-first — it constructs the whole tree top-down. Rule induction is depth-first — it learns one rule at a time, each targeting a subset of positive examples. Rules can overlap (an example may satisfy multiple rules) and are often more compact than an equivalent tree.
Rule Induction — IREP & RIPPER
Rather than extracting rules from a tree, rule induction algorithms learn rules directly from data, using the Minimum Description Length (MDL) principle to balance fit against complexity.
MDL Principle
Choose the rule set that gives the shortest total description of the data: description length of the rules + description length of the errors they make. This formalizes Occam's Razor — simpler rule sets are preferred unless extra rules genuinely compress the data.
LearnRuleSet — RIPPER (Cohen, 1995)
OptimizeRuleSet — RIPPER's Second Pass
After learning, RIPPER iterates over each rule and considers two alternatives:
- Replace — learn a completely new rule from scratch covering the same examples. Keep if it shortens total description length.
- Revise — generalize the existing rule by relaxing some conditions. Keep if it shortens total description length.
- If neither improves DL, keep the original rule unchanged.
RIPPER often produces more compact models than trees for the same data, especially with many irrelevant features. However, trees are generally faster to learn and easier to visualize. Both are competitive in accuracy on tabular data — the right choice depends on interpretability requirements and dataset characteristics.
The next topics include Kernel Machines (SVMs), Lazy Learning (k-NN), Combining Methods (ensembles: bagging, boosting, random forests), and Unsupervised Learning. The exam covers all lectures and practical sessions.