Decision Tree: Introduction

Decision Tree is one of the most widely used algorithms in data science. In the case of machine learning ensemble methods like a decision tree and random forest are widely used.

It is a kind of supervised learning algorithm in which data continues to get divided into different categories according to the given conditions.

As the name suggests a decision tree is used as a flow chart like a tree structure to show the prediction of results that are based on feature-based splits. It starts with a root node and ends on a leaf node with a decision made by it.  

Decision trees can be used for solving both regression and classification problems.

What is a Decision Tree?

It is generally a type of binary tree that recursively splits the given dataset on the given condition until we are left with a pure leaf node i.e. the data with only one type of class.

Here is some basic terminology that is more frequently used in decision tree:

  • Root Node: It is present at the beginning of a decision tree. It represents the entire sample data which gets divided into two sets on the given condition.
  • Splitting: The process of dividing nodes into two or more sub-nodes.
  • Decision Node: The node which we get after splitting the root nodes.
  • Leaf Node:  Nodes whose further splitting is not possible are known as leaf nodes.
  • Sub-Tree: The sub-section of this decision tree.
  • Parent and Child Node: A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.

Assumptions

Decision Tree has certain assumptions:

  • Records should be distributed in a recursive manner which will be on the basis of attribute values.
  • In the case of continuous variables, discretization of variables are required prior to building the model.
  • The data which will be used for training should be wholly considered as root.