Technology Sharing

Introduction to decision tree algorithm, principle and case implementation

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The decision tree algorithm is a very popular machine learning algorithm that can be used for classification and regression tasks. The following is a detailed introduction to the decision tree algorithm, including its principles and case implementation, as well as the corresponding Python code.

Introduction to Decision Tree Algorithm

basic concept

A decision tree is a tree-like structure used to classify or regress data. It consists of nodes and edges, where each internal node represents a test of a feature, each branch represents the result of the test, and each leaf node represents a category or regression value.

Build Process

The process of building a decision tree usually includes the following steps:

  1. Select the best features: Select the best features to split the data set based on some criteria (such as information gain, Gini index, etc.).
  2. Creating a Node: Splits the dataset using the best features and creates new nodes for each branch.
  3. Recursively build subtrees: The process of selecting the best features and splitting the data set is repeated for each child node until the stopping condition is met (such as the node purity reaches a certain level or the depth of the tree reaches a preset value).
  4. Building leaf nodes: When no more splits are needed, leaf nodes are created, which are usually the majority class label for classification trees and the mean of all data points in the subset for regression trees.
Splitting criteria
  • Information Gain: Measures the reduction in uncertainty of a feature in classifying a dataset.
  • Gini Index: Measures the purity of the data set. The smaller the Gini index, the higher the purity of the data set.
  • Minimum Mean Square Error (MSE): Splitting criterion for regression trees.

Case Implementation

The following is a decision tree classification case implemented using Python and the scikit-learn library. We will use the famous Iris dataset, which contains the features and categories of three types of irises (Setosa, Versicolour, Virginica).

1. Data Preparation
  1. from sklearn.datasets import load_iris
  2. from sklearn.model_selection import train_test_split
  3. # 加载数据集
  4. iris = load_iris()
  5. X, y = iris.data, iris.target
  6. # 拆分数据集为训练集和测试集
  7. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Training a decision tree model
  1. from sklearn.tree import DecisionTreeClassifier
  2. # 初始化决策树分类器
  3. clf = DecisionTreeClassifier()
  4. # 训练模型
  5. clf.fit(X_train, y_train)
Evaluating the Model
  1. from sklearn.metrics import accuracy_score
  2. # 预测测试集
  3. y_pred = clf.predict(X_test)
  4. # 计算准确率
  5. accuracy = accuracy_score(y_test, y_pred)
  6. print(f"Accuracy: {accuracy:.2f}")
Visualizing Decision Trees
  1. import matplotlib.pyplot as plt
  2. from sklearn.tree import plot_tree
  3. # 可视化决策树
  4. plt.figure(figsize=(12, 12))
  5. plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
  6. plt.show()

Summarize:

The above code shows how to use the scikit-learn library to load the Iris dataset, train a decision tree classifier, evaluate the model performance, and visualize the decision tree. Through this case, you can see how the decision tree works and how to use it in practical applications.

I hope you like it. If you like it, please like and collect it.