2022-SC2598-000ZZ-2022-SCP8082660-N0: Lecture 6 - Comparison of decision trees used by the sklearn software with the theory seen in class.

Compare the version of decision trees used by the sklearn software with the theory seen in class.

The sklearn library uses an optimized version of the CART algorithm to implement the decision tree.

CART (Classification and Regression Trees) is a classification algorithm for building a decision tree based on Gini's impurity index as splitting criterion.

The first difference between ID3 and CART is that CART support numerical target variables which allows to can use it also for regression task, meanwhile ID3 can be used only for classification task.

The second difference is in the way in which ID3, and CART generate the Decision Tree. ID3 algorithm creates a multiway tree, finding for each node, the categorical feature that will yield the largest information gain for categorial targets. Instead, CART algorithm creates a binary tree, splitting the nodes into sub nodes based on a threshold value of an attribute. The idea behind CART algorithm is to split the node in subset searching for the best homogeneity, using the Gini Index Criterion. The phase of splitting continues till the last pure sub-set is found int the tree or the maximum number of leaves possible in that growing tree.

Another difference is that ID3 uses Shannon Entropy to pick features with the greatest information gain as nodes. When CART is used for classification task, it selects its split attribute to achieve the subsets that minimize Gini Impurity instead.

Both algorithms do the pruning of the tree to reduce the phenomenon of overfitting.

Source:

https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

https://www.analyticssteps.com/blogs/classification-and-regression-tree-cart-algorithm