Introduction to Decision Tree in Supervised Learning

Introduction

Decision Tree is one of the Supervised Learning Algorithm in Machine Learning. Supervised Learning algorithms are the one that learn to associate some input with some output, given a training set of examples of input x and output y. Practically the output y may be difficult to collect automatically and must be provided by a human “supervisor”, also called as “Data Labelling”. There are variety of algorithms available in Supervised Learning, broadly classified as; “Probablistic” and “Non Probablistic”. Decision Tree algorithm falls in “Non-Probablistic” category similar to k-nearest neighbors algorithm.

Decision Tree algorithm breaks the input space into regions and has separate parameters for each region. Decision Trees can be used both for classification and regression purposes. A decision tree can be imagined as a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.

Decision Tree

As shown in above diagram, each node of the decision tree is associated with a region in the input space, and internal nodes break that region into one subregion for each child of the node.

Input Space and Regions

Space is thus subdivided into nonoverlapping regions, with a one-to-one correspondance between leaf nodes and input regions. Each Leaf node usually maps every point in its imput region to the same output. Each Leaf requires atleast one training example to define, so it is not possible for decision tree to learn a function that has more local maxima than the number of training examples.

This algorithm can be considered nonparametric if it is allowed to learn a tree of arbitrary size, but typically decision trees are regularised with size constraints that turn them into parametric models in practice.

Classification And Regression Trees (CART)

In recent days, the decession tree algorithms are called as CART, which stands for Classification and Regression Trees. CART is a term introduced by Leo Breiman in 1984 to refer to Decision Tree algorithms that can be used for classification and regression modeling problems. The CART algorithm provides a foundation for other important algorithms like bagged decision trees, random forest and boosted decision trees.

Decision Trees are prefered for its stability, reliability and explainability. These algorithms are very useful when computational resources are constrained.

Terminologies

As described earlier, the algorithm will follow a tree like structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class lavel. The paths from the root node to leaf node represents the classification rules. Following are the terminologies involved in Decision Tree algorithm:

Root Node: Represents the entire input space.

Splitting: Process of dividing the node into two or more sub-nodes.

Decision Node: The node that gets further divided into different sub-nodes

Leaf or Terminal Node: Nodes that do not split further

Pruning: Process of removing sub-nodes of a decision node. Basically its opposite of Splitting

Branch/Sub-Tree: Sub-section of an entire tree

Parent and Child Node: Node which is divided into sub-nodes is called parent node of the sub-nodes. And the sub-nodes are the children of the parent node.

Decision Tree Terminologies

The algorithm can be simply stated as follows:

For each attribute in the dataset the algorithm forms a node. The most important attribute is placed at the root node
For evaluating the task in hand, the algorithm starts at the root node and work its way down the treeby following the corresponding node that meets the condition or decision
The process continues until a leaf node is reached. The leaf node contains the prediction or the outcome of the tree

Attribute selection measures

As you observed, the primary challenge in decision tree implementation is to identify the attributes that forms the root node and decision node at each level of the tree. This process is known as attributes selection. There are two popular attribute selection measures:

Information Gain
Gini Index

Information Gain

The key assumption here is that the attributes or features are categorical. The idea here is to estimate the information contained by each attribute. To understand this better, it important to know the concept called Entropy. Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content.

Information gain is a measure of this change in entropy. Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. The ID3(Iterative Dichotomiser) decision tree algorithm uses entropy to calculate the information gain. The attribute with the highest information gain is chosen as the splitting attribute at the node. The mathematical formula to compute Entropy is:

Entropy Formula

Note: Here c is the number of classes and pi is the probability assocuated with the ith class.

Gini Index

The key assumption here is that the attributes or features are continuous. Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. Unlike Information Gain, where the attribute with highest gain is chosen, Gini Index chooses an attribute with lower index.

Gini Index Formula

Note: Here c is the number of classes and pi is the probability assocuated with the ith class.

Common Types of Decision Tree Algorithms

Iterative Dichotomiser 3 (ID3): This algorithm uses Information Gain to decide which attribute is to be used classify the current subset of the data. For each level of the tree, information gain is calculated for the remaining data recursively.
C4.5: This algorithm is the successor of the ID3 algorithm. This algorithm uses either Information gain or Gain ratio to decide upon the classifying attribute. It is a direct improvement from the ID3 algorithm as it can handle both continuous and missing attribute values.
CART: It is a dynamic learning algorithm which can produce a regression tree as well as a classification tree depending upon the dependent variable.

Overfitting in Decision Tree

Overfitting is a common problem in almost all supervised learning algorithms. The problem is that the algorithms or models memorised the entire training set resulting in very low training set error but high testing set error. In decision tree, overfitting means the model built many branches due to outliers and irregularities in data. In practice, following two approaches used to avoid overfitting:

Pre-Pruning: Stops the tree construction bit early. Basically don’t split a node if its goodness measure is below a threshold value. Pratically it is difficult to choose an appropriate stopping point.
Post-Pruning: In this case the pruning happens after the entire tree is built and you observe overfitting problem. The cross-validation method is used to check the effect of pruning. Using cross-validation date, check whether expanding a node will result in improvement or not. If it shows an improvement, then continue expanding the node otherwise stop expansion (or splitting) and the node converted to terminal node.

Example Decision Tree Notebook

# Importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

# Import Car Evaluation Dataset
# Refer https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', header=None)

df.shape

(1728, 7)

df.head()

	0	1	2	3	4	5	6
0	vhigh	vhigh	2	2	small	low	unacc
1	vhigh	vhigh	2	2	small	med	unacc
2	vhigh	vhigh	2	2	small	high	unacc
3	vhigh	vhigh	2	2	med	low	unacc
4	vhigh	vhigh	2	2	med	med	unacc

# Rename Columns
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df.sample()

	buying	maint	doors	persons	lug_boot	safety	class
313	vhigh	med	5more	4	big	med	acc

# Summary of Data
df.info()
df.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB

	count	unique	top	freq
buying	1728	4	low	432
maint	1728	4	low	432
doors	1728	4	4	432
persons	1728	3	4	576
lug_boot	1728	3	small	576
safety	1728	3	low	576
class	1728	4	unacc	1210

# Frequency Distribution of Attributes
for col in df.columns:
    print(df[col].value_counts())   

low      432
vhigh    432
high     432
med      432
Name: buying, dtype: int64
low      432
vhigh    432
high     432
med      432
Name: maint, dtype: int64
4        432
5more    432
3        432
2        432
Name: doors, dtype: int64
4       576
more    576
2       576
Name: persons, dtype: int64
small    576
med      576
big      576
Name: lug_boot, dtype: int64
low     576
high    576
med     576
Name: safety, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64

Data Summary

There are 7 attributes in the dataset. All the attributes are of categorical data type.

These are given by buying, maint, doors, persons, lug_boot, safety and class.

class is the target attribute, which has the following values:

unacc (70%)
acc (22.22%)
good (4%)
vgood (3.8%)

# check missing values in variables
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

# Split Feature & Target attributes

X = X = df.drop(['class'], axis=1)

y = df['class']

# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

# Check the shapes of training and testing sets
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1209, 6), (1209,), (519, 6), (519,))

# Encode the categorical variables
import category_encoders as ce

# encode variables with ordinal encoding
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

X_train.head()

	buying	maint	doors	persons	lug_boot	safety
774	1	1	1	1	1	1
1021	2	2	2	1	2	2
105	3	3	3	1	3	1
44	3	3	2	2	3	3
1374	4	3	4	1	3	1

X_test.head()

	buying	maint	doors	persons	lug_boot	safety
27	3	3	2	3	1	1
1156	2	4	4	1	2	2
1668	4	1	2	1	2	1
1622	4	1	1	3	1	3
692	1	4	2	2	3	3

# The dataset is now ready for modelling
# import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# instantiate the DecisionTreeClassifier model with criterion gini index
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)

# Fit the model
clf_gini.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3, random_state=0)

# Predict by using Testing set
y_pred_gini = clf_gini.predict(X_test)

from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

Model accuracy score with criterion gini index: 0.7842

# Let's compare training and test set accuracy to check whether we have overfitting problem
y_pred_train_gini = clf_gini.predict(X_train)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

Training-set accuracy score: 0.7750

Here, the training-set accuracy score is 0.7750 while the test-set accuracy to be 0.7842. These two values are quite comparable. So, there is no sign of overfitting.

Visualising Decision Tree

plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_gini.fit(X_train, y_train)) 

[Text(267.84000000000003, 380.52, 'X[5] <= 1.5\ngini = 0.463\nsamples = 1209\nvalue = [274, 48, 840, 47]'),
 Text(133.92000000000002, 271.8, 'gini = 0.0\nsamples = 400\nvalue = [0, 0, 400, 0]'),
 Text(401.76000000000005, 271.8, 'X[3] <= 2.5\ngini = 0.583\nsamples = 809\nvalue = [274, 48, 440, 47]'),
 Text(267.84000000000003, 163.07999999999998, 'X[0] <= 3.5\ngini = 0.628\nsamples = 546\nvalue = [274, 48, 177, 47]'),
 Text(133.92000000000002, 54.360000000000014, 'gini = 0.578\nsamples = 406\nvalue = [205, 19, 164, 18]'),
 Text(401.76000000000005, 54.360000000000014, 'gini = 0.663\nsamples = 140\nvalue = [69, 29, 13, 29]'),
 Text(535.6800000000001, 163.07999999999998, 'gini = 0.0\nsamples = 263\nvalue = [0, 0, 263, 0]')]

png

Visualising Decision Tree using Graphwiz

import graphviz 
dot_data = tree.export_graphviz(clf_gini, out_file=None, 
                              feature_names=X_train.columns,  
                              class_names=y_train,  
                              filled=True, rounded=True,  
                              special_characters=True)

graph = graphviz.Source(dot_data) 

graph 

svg

# Decision Tree using Entropy method
clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)

# fit the model
clf_en.fit(X_train, y_train)

y_pred_en = clf_en.predict(X_test)

y_pred_train_en = clf_en.predict(X_train)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_en)))
print('Model accuracy score with criterion entropy: {0:0.4f}'. format(accuracy_score(y_test, y_pred_en)))

Training-set accuracy score: 0.7750
Model accuracy score with criterion entropy: 0.7842

We can see that the training-set score and test-set score is same as above. The training-set accuracy score is 0.7750 while the test-set accuracy to be 0.7842. These two values are quite comparable. So, there is no sign of overfitting.

# Visalising the Trees
tree.plot_tree(clf_en.fit(X_train, y_train)) 

[Text(133.92000000000002, 190.26, 'X[5] <= 1.5\nentropy = 1.217\nsamples = 1209\nvalue = [274, 48, 840, 47]'),
 Text(66.96000000000001, 135.9, 'entropy = 0.0\nsamples = 400\nvalue = [0, 0, 400, 0]'),
 Text(200.88000000000002, 135.9, 'X[3] <= 2.5\nentropy = 1.487\nsamples = 809\nvalue = [274, 48, 440, 47]'),
 Text(133.92000000000002, 81.53999999999999, 'X[0] <= 3.5\nentropy = 1.639\nsamples = 546\nvalue = [274, 48, 177, 47]'),
 Text(66.96000000000001, 27.180000000000007, 'entropy = 1.432\nsamples = 406\nvalue = [205, 19, 164, 18]'),
 Text(200.88000000000002, 27.180000000000007, 'entropy = 1.762\nsamples = 140\nvalue = [69, 29, 13, 29]'),
 Text(267.84000000000003, 81.53999999999999, 'entropy = 0.0\nsamples = 263\nvalue = [0, 0, 263, 0]')]

png

dot_data = tree.export_graphviz(clf_en, out_file=None, 
                              feature_names=X_train.columns,  
                              class_names=y_train,  
                              filled=True, rounded=True,  
                              special_characters=True)

graph = graphviz.Source(dot_data) 

graph 

svg

Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.

But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making.

Confusion matrix & Classification Report helps us here.

from sklearn.metrics import confusion_matrix,classification_report

cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n', cm)
print('\nClassification Report\n', classification_report(y_test, y_pred_en))

Confusion matrix
 [[110   0   0   0]
 [ 21   0   0   0]
 [ 73   0 297   0]
 [ 18   0   0   0]]

Classification Report
               precision    recall  f1-score   support

         acc       0.50      1.00      0.66       110
        good       0.00      0.00      0.00        21
       unacc       1.00      0.80      0.89       370
       vgood       0.00      0.00      0.00        18

    accuracy                           0.78       519
   macro avg       0.37      0.45      0.39       519
weighted avg       0.82      0.78      0.78       519

Results and Conclusion

In the above example, two Decision-Tree Classifier models are built to predict the safety of the care. One model with Gini Index criterion and another with Information Gain (Entropy) criterion. Both the model yields very good performance as indicated by the model accuracy.
In both the models the accuracy scores of both Training and Testing sets are quite comparable. So, there is no sign of overfitting.
In both the models, the training-set and test-set accuracy score is the same. It may happen because of small dataset.
The confusion matrix and classification report shows very good model performance.

Jupyter Notebook Download

Click here to download the notebook

Reference Links

https://www.geeksforgeeks.org/decision-tree-introduction-example/
https://www.kaggle.com/prashant111/decision-tree-classifier-tutorial
Deep Learning by Ian Goodfellow, Yoshua Benigo and Aaron Courville