How Decision Tree works – Cogxta.AI Research

Decision Tree:
* Decision Tree is a non-parametric supervised learning method for regression & classification.
* It”s similar to playing “dumb charades”.
* A good algorithm will have less & right questions compared to not-so-good one.
* The nodes are questions & leafs are prediction.

Decision Tree Algorithm:
* Decision Tree is based on CART which is advancement of ID3, developed in 1986 by Ross Quinlan.
* ID3 works when feature data & target data both are categorical in nature.
* C4.5 is an advancement of ID3, it coverts continues features into categorical features. Then, proceeds with ID3.
* CART is based on C4.5, with slight advancement of “target can be continues”.
* scikit-learn decision trees are based on CART.

Criterion of creating Decision Tree:
1. Entropy – Objective of CART is to maximize information gain in each split.
2. Gini Impurity – If classes are mixed, gini impurity is maximul
Both the approaches, yields almost same results. We will discuss algorithm using Entropy.

Information Gain:
* The information gain is based on the decrease in entropy after a dataset is split on an attribute.
* Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).
* Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]
* We intend to choose the attribute, splitting by which information gain will be the most.

Let us take the tennis play data to understand the decision tree algorithm better. The below image is the decision tree for the tennis play data.

Step 1: Import necessary Packages.
Step 2: Load the dataset.
Step 3: Calculate the entropy for the play attribute.(There are 9 yes and 5 no in the play column)
Step 4: To find the root node, calculate information gain on all attributes. Let us start with outlook attribute. There are 2 yes and 3 no for sunny outlook, 3 yes and 2 no for rainy outlook. Overcast outlook is a homogeneous data, so entropy=0.
Step 5: Calculate information gain for rest of the attributes(temp, humidity, windy)
Step 6: Conclude the root node and repeat from step 4 to split the tree further.


import pandas as pd
import numpy as np

play_data = pd.read_csv("https://raw.githubusercontent.com/cogxta/datasets/main/tennis_data.csv")
print("Tennis Play Data");print("");print(play_data);print("")   

Entropy_Play = -(9/14)*np.log2(9/14) -(5/14)*np.log2(5/14)
print("Entrophy of play: ");print(Entropy_Play);print("")

Entropy_Play_Outlook_Sunny =-(3/5)*np.log2(3/5) -(2/5)*np.log2(2/5)
Entropy_Play_Outlook_Rainy = -(2/5)*np.log2(2/5) - (3/5)*np.log2(3/5)
outlook_gain = Entropy_Play - (5/14)*Entropy_Play_Outlook_Sunny - (4/14)*0 - (5/14) * Entropy_Play_Outlook_Rainy
print("Information gain on outlook:");print(outlook_gain);print("")

Entropy_Play_temp_hot =-(2/4)*np.log2(2/4) -(2/4)*np.log2(2/4)
Entropy_Play_temp_mild = -(4/6)*np.log2(4/6) - (2/6)*np.log2(2/6)
Entropy_Play_temp_cool = -(3/4)*np.log2(3/4) - (1/4)*np.log2(1/4)
temp_gain = Entropy_Play - (4/14)*Entropy_Play_temp_hot - (6/14)*Entropy_Play_temp_mild - (4/14)*Entropy_Play_temp_cool
print("Information gain on Temperature:");print(temp_gain);print("")

Entropy_Play_humidity_high =-(3/7)*np.log2(3/7) -(4/7)*np.log2(4/7)
Entropy_Play_humidity_normal = -(6/7)*np.log2(6/7) - (1/7)*np.log2(1/7)
humidity_gain = Entropy_Play - (7/14)*Entropy_Play_humidity_high - (7/14)*Entropy_Play_humidity_normal 
print("Information gain on Humidity: ");print(humidity_gain);print("")

Entropy_Play_windy_true =-(3/6)*np.log2(3/6) -(3/6)*np.log2(3/6)
Entropy_Play_windy_false = -(6/8)*np.log2(6/8) - (2/8)*np.log2(2/8)
windy_gain = Entropy_Play - (6/14)*Entropy_Play_windy_true - (8/14)*Entropy_Play_windy_false 
print("Information gain on Windy: ");print(windy_gain);print("")

Visualizing Data for Classification

ByKishore January 9, 2024May 27, 2024

In this lab, we’ll explore the German bank credit dataset to understand relationships for a classification problem. Unlike regression problems where the label is a continuous variable, classification problems involve categorical labels. We aim to visually explore the data to identify features useful in predicting customers with bad credit. Load and Prepare the Dataset Let’s…

Machine Learning

Anomaly Detection with Machine Learning

ByKishore October 4, 2023October 4, 2023

Introduction: Anomaly detection is a crucial technique in data analysis, with applications ranging from fraud detection to network security. It involves identifying unusual data points that deviate significantly from the majority of observations. In this tutorial, we will explore the concept of anomaly detection and demonstrate how to implement it using Python. Specifically, we’ll use…

Data Analytics | Machine Learning

Essential Pandas for Machine Learning: Part 1

ByKishore January 5, 2024May 28, 2024

Pandas is a powerful and versatile open-source library for data analysis in Python. It provides easy-to-use data structures like Series and DataFrames, making it an essential tool for handling and manipulating data in machine learning projects. In this blog post, we will explore some key aspects of Pandas that are crucial for anyone working in…

Agentic AI

The Future of AI Orchestration: Harmonizing Intelligence for Impact

Byuser March 18, 2025March 16, 2025

Artificial intelligence (AI) is reshaping industries, but its true power lies in how it’s managed. AI orchestration—the art of coordinating complex AI systems—has emerged as a critical solution for businesses aiming to harness intelligence at scale. Let’s explore what AI orchestration entails, its benefits, and its real-world potential. What Is AI Orchestration? Imagine a bustling…

Machine Learning

Understanding Model Selection with Cross Validation

ByKishore February 1, 2024May 27, 2024

Introduction: In machine learning, model selection plays a crucial role in creating models that generalize well to new, unseen data. One common approach to model selection is through cross-validation, a resampling method that helps estimate the performance of a model on different subsets of the dataset. This blog post will explore the concepts of cross-validation…

Data Analytics | Machine Learning

Data Preparation for Machine Learning

ByKishore February 27, 2024May 31, 2024

Data preparation is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and organizing data to make it suitable for machine learning models. Proper data preparation ensures that the models can learn effectively from the data and make accurate predictions. Why is Data Preparation Important? Data preparation is essential for several reasons:…

Similar Posts

Leave a Reply Cancel reply