Machine Learning - Decision Tree (Notes)

2024-07-12

Table of contents

1. Understanding decision trees

1 Introduction

2. Decision tree generation process

2. Decision Tree in sklearn

1. tree.DecisionTreeClassifier (classification tree)

(1) Basic parameters of the model

(2) Model attributes

(3) Interface

2. tree.DecisionTreeRegressor (regression tree)

3. tree.export_graphviz (export the generated decision tree to DOT format, for drawing)

4. Others (supplementary)

3. Advantages and disadvantages of decision trees

1. Advantages

2. Disadvantages

1. Understanding decision trees

1 Introduction

Decision Tree is aNonparametric supervised learning methods, which can be from a series of Decision rules are summarized from data with features and labels, and these rules are presented in a tree diagram structure to solve classification and regression problems.The decision tree algorithm is easy to understand, applicable to various data, and performs well in solving various problems. In particular, various integrated algorithms with tree models as the core are widely used in various industries and fields.

2. Decision tree generation process

The above dataset is A list of known species and their categoriesOur goal now is to divide the animals intoMammals and non-mammalsBased on the data that has been collected, the decision tree algorithm can calculate the following decision tree:

If we now discover a new species A, which is a cold-blooded animal with scales on its body and is not viviparous, we can use this decision tree to determine its category。

Key concepts involved: Node

① Root node: There is no incoming edge, but there is an outgoing edge. Contains the initial question about the feature.

② Intermediate nodes：There are both incoming edges and outgoing edges. There is only one incoming edge, but there can be many outgoing edges. These are all questions about features.

③ Leaf nodes: There are incoming edges and no outgoing edges, and each leaf node is a category label.

④ Child nodes and parent nodes: Among two connected nodes, the one closer to the root node is the parent node and the other one is the child node.

2. Decision Tree in sklearn

Involved modules: sklearn.tree

1. tree.DecisionTreeClassifier (classification tree)

(1) Basic parameters of the model

(2) Model attributes

(3) Interface

2. tree.DecisionTreeRegressor (regression tree)

Important parameters：criterion

Regression tree branch quality indicatorsThere are three supported standards:

① Enter "mse" to use mean squared error (MSE), the difference in mean squared error between the parent node and the leaf node will be used as the criterion for feature selection, which minimizes the L2 loss by using the mean of the leaf nodes.

② Enter "friedman_mse" to use Feldman mean square error, which uses the mean squared error modified by Friedman for problems in latent branches.

③ Enter "mae" to use the mean absolute error MAE (mean absolute error), this metric uses the median of leaf nodes to minimize the L1 loss.

3. tree.export_graphviz (export the generated decision tree to DOT format, for drawing)

4. Others (supplementary)

① The calculation of information entropy is slower than the Gini coefficient, because the calculation of the Gini coefficient does not involve logarithms. In addition, because information entropy is more sensitive to impurity,When information entropy is used as an indicator, the growth of decision trees will be more "fine"Therefore, for high-dimensional data or data with a lot of noise, information entropy is easy to overfit, and the Gini coefficient often works better in this case.

② random_state is used to set the parameters of the random mode in the branch, the default is None,In high-dimensional data, randomness is more obvious. In low-dimensional data (such as the iris data set), randomness is almost not apparent.Entering any integer will always grow the same tree, making the model stable.

③ splitter is also used to control the random options in the decision tree. There are two input values. If you enter "best", the decision tree will randomly select more important features for branching (the importance can be viewed through the attribute feature_importances_).Enter "random" and the decision tree will be more random when branching., the tree will be deeper and larger because it contains more unnecessary information, and the fit to the training set will be reduced due to this unnecessary information.

④ Without restrictions, a decision tree will grow until the indicator measuring impurity is optimal or there are no more features available. Such a decision tree will often overfit. In order to make the decision tree have better generalization, the decision tree needs to bePruningThe pruning strategy has a huge impact on the decision tree.Correct pruning strategy is the core of optimizing decision tree algorithm。

3. Advantages and disadvantages of decision trees

1. Advantages

① Easy to understand and explain because trees can be drawn and seen.

② Requires little data preparation. Many other algorithms usually require data normalization, creating dummy variables and removing null values. The decision tree module in sklearn does not support the processing of missing values。

③ Use The cost of the tree(e.g., when predicting data) is logarithmic in the number of data points used to train the tree, which is a very low cost compared to other algorithms.

④ Ability to process both numerical and categorical data,Can do both regression and classificationOther techniques are often specialized for analyzing data sets with only one type of variable.

⑤ Ability to handle multi-output problems, i.e. problems with multiple labels (note that this is different from problems with multiple label classifications in one label)

⑥ is a White Box Model, the results are easily interpretable. If a given situation can be observed in the model, the conditions can be easily explained by Boolean logic. In contrast, in black-box models (for example, in artificial neural networks), the results may be more difficult to interpret.

⑦ Models can be validated using statistical tests, which allows us to consider the reliability of a model. It can perform well even if its assumptions are violated to some extent by the true model that generated the data.

2. Disadvantages

① Decision tree learners may create overly complex trees that do not generalize well to the data. This is called overfitting. Pruning,Mechanisms such as setting the minimum number of samples required for a leaf node or setting the maximum depth of the treeis necessary to avoid this problem.

② Decision trees may be unstable. Slight changes in the data may result in completely different trees. This problem needs to be solved by an integrated algorithm.

③ The learning of decision tree is based onGreedy Algorithm, it relies on Optimizing local optimum(the best of each node) to try to achieve the overall optimality, but this approach cannot guarantee the return of the global optimal decision tree. This problem can also be solved by an integrated algorithm. In a random forest, features and samples are randomly sampled during the branching process.

④ Some concepts are difficult to learn because decision trees cannot easily express them, such as XOR, parity check, or multiplexer problems.

⑤ If some classes in the labels are dominant, the decision tree learner will create trees that are biased towards the dominant class. Therefore, it is recommended to Balancing the Dataset。

Technology Sharing