2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Support vector machine (SVM) is a powerful and versatile machine learning model for linear and nonlinear classification, regression, and outlier detection. This article introduces the support vector machine algorithm and its implementation in scikit-learn, and briefly explores principal component analysis and its application in scikit-learn.
Support Vector Machine (SVM) is a widely used algorithm in the field of machine learning, favored for its ability to provide significant accuracy with less computing resources. SVM can be used for classification and regression tasks, but is most widely used in classification problems.
The goal of support vector machines is to N N N Dimensional space ( N N N is the number of features) to find a hyperplane that can clearly distinguish data points. This hyperplane separates data points of different categories and keeps them as far away from the hyperplane as possible, thus ensuring the robustness of classification.
In order to achieve effective separation of data points, there may be multiple hyperplanes. Our goal is to select a hyperplane with the maximum margin, that is, the maximum distance between the two classes. Maximizing the margin helps improve the accuracy of classification.
A hyperplane is a decision boundary that divides data points. Data points on both sides of the hyperplane can be classified into different categories. The dimension of the hyperplane depends on the number of features: if the input features are 2, the hyperplane is a straight line; if the features are 3, the hyperplane is a two-dimensional plane. When the number of features exceeds 3, the hyperplane becomes difficult to understand intuitively.
Support vectors are those points that are closest to the hyperplane and these points influence the position and orientation of the hyperplane. With these support vectors, we can maximize the margin of the classifier. Removing support vectors changes the position of the hyperplane, so they are crucial in building SVM.
In logistic regression, we use the sigmoid function to compress the output of the linear function into the range [0,1] and assign labels based on a threshold (0.5). In SVM, we use the output of the linear function to determine the classification: if the output is greater than 1, it belongs to one class; if the output is -1, it belongs to another class. SVM forms a marginal range [-1,1] by setting the output threshold to 1 and -1.
Use support vector machines to predict benign and malignant cancer diagnoses.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
# 创建 DataFrame
col_names = list(cancer.feature_names)
col_names.append('target')
df = pd.DataFrame(np.c_[cancer.data, cancer.target], columns=col_names)
df.head()
df.info()print(cancer.target_names)
# ['malignant', 'benign']
# 数据描述:
df.describe()
# 统计摘要:
df.info()
sns.pairplot(df, hue='target', vars=[
'mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension'
])
sns.countplot(x=df['target'], label="Count")
plt.figure(figsize=(10, 8))
sns.scatterplot(x='mean area', y='mean smoothness', hue='target', data=df)
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True)
In machine learning, model training is a key step in finding solutions to problems. scikit-learn
Perform model training and demonstrate the performance of support vector machines (SVMs) on different kernels.
First, we need to prepare and preprocess the data. Here is a code example of data preprocessing:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
X = df.drop('target', axis=1)
y = df.target
print(f"'X' shape: {X.shape}")
print(f"'y' shape: {y.shape}")
# 'X' shape: (569, 30)
# 'y' shape: (569,)
pipeline = Pipeline([
('min_max_scaler', MinMaxScaler()),
('std_scaler', StandardScaler())
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In the code, we use MinMaxScaler
andStandardScaler
The data is scaled. The data is divided into training and test sets, where 30% of the data is used for testing.
To evaluate the performance of the model, we define a print_score
Function, which can output the accuracy, classification report and confusion matrix of training and test results:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
if train:
pred = clf.predict(X_train)
clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
print("Train Result:n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: n {confusion_matrix(y_train, pred)}n")
else:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
print("Test Result:n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: n {confusion_matrix(y_test, pred)}n")
Support Vector Machine (SVM) is a powerful classification algorithm whose performance is affected by hyperparameters. The following will introduce the main parameters of SVM and their impact on model performance:
'poly'
) and are ignored by other kernels. The best hyperparameter values can be found by grid search.Linear kernel SVM works well in most cases, especially when the dataset has a large number of features. Here is a code example using linear kernel SVM:
from sklearn.svm import LinearSVC
model = LinearSVC(loss='hinge', dual=True)
model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
The training and testing results are as follows:
Training results:
Accuracy Score: 86.18%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.000000 0.819079 0.861809 0.909539 0.886811
recall 0.630872 1.000000 0.861809 0.815436 0.861809
f1-score 0.773663 0.900542 0.861809 0.837103 0.853042
support 149.000000 249.000000 0.861809 398.000000 398.000000
_______________________________________________
Confusion Matrix:
[[ 94 55]
[ 0 249]]
Test Results:
Accuracy Score: 89.47%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.000000 0.857143 0.894737 0.928571 0.909774
recall 0.714286 1.000000 0.894737 0.857143 0.894737
f1-score 0.833333 0.923077 0.894737 0.878205 0.890013
support 63.000000 108.000000 0.894737 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 45 18]
[ 0 108]]
The polynomial kernel SVM is suitable for nonlinear data. Here is a code example using a second-order polynomial kernel:
from sklearn.svm import SVC
model = SVC(kernel='poly', degree=2, gamma='auto', coef0=1, C=5)
model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
The training and testing results are as follows:
Training results:
Accuracy Score: 96.98%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.985816 0.961089 0.969849 0.973453 0.970346
recall 0.932886 0.991968 0.969849 0.962427 0.969849
f1-score 0.958621 0.976285 0.969849 0.967453 0.969672
support 149.000000 249.000000 0.969849 398.000000 398.000000
_______________________________________________
Confusion Matrix:
[[139 10]
[ 2 247]]
Test Results:
Accuracy Score: 97.08%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.967742 0.972477 0.97076 0.970109 0.970733
recall 0.952381 0.981481 0.97076 0.966931 0.970760
f1-score 0.960000 0.976959 0.97076 0.968479 0.970711
support 63.000000 108.000000 0.97076 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 60 3]
[ 2 106]]
The radial basis function (RBF) kernel is suitable for processing nonlinear data. The following is a code example using the RBF kernel:
model = SVC(kernel='rbf', gamma=0.5, C=0.1)
model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
The training and testing results are as follows:
Training results:
Accuracy Score: 62.56%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.0 0.625628 0.625628 0.312814 0.392314
recall 0.0 1.000000 0.625628 0.500000 0.625628
f1-score 0.0 0.769231 0.625628 0.384615 0.615385
support 149.0 249.0 0.625628 398.0 398.0
_______________________________________________
Confusion Matrix:
[[ 0 149]
[ 0 249]]
Test Results:
Accuracy Score: 64.97%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.0 0.655172 0.649661 0.327586 0.409551
recall 0.0 1.000000 0.649661 0.500000 0.649661
f1-score 0.0 0.792453 0.649661 0.396226 0.628252
support 63.0 108.0 0.649661 171.0 171.0
_______________________________________________
Confusion Matrix:
[[ 0 63]
[ 0 108]]
Through the above model training and evaluation process, we can observe the performance differences of different SVM kernels. Linear kernel SVM performs well in accuracy and training time, and is suitable for situations with high data dimensions. Polynomial kernel SVM and RBF kernel SVM have better performance on nonlinear data, but may cause overfitting under certain parameter settings. Choosing the right kernel and hyperparameters is crucial to improving model performance.
Digital Input: SVM assumes that the input data is numeric. If the input data is categorical, you may want to convert it into binary dummy variables (one variable for each category).
Binary classification: The basic SVM is suitable for binary classification problems. Although SVM is mainly used for binary classification, there are also extended versions for regression and multi-class classification.
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
The following shows the training and testing results for different SVM kernels:
Linear Kernel SVM
print("=======================Linear Kernel SVM==========================")
model = SVC(kernel='linear')
model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
Training results:
Accuracy Score: 98.99%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.000000 0.984190 0.98995 0.992095 0.990109
recall 0.973154 1.000000 0.98995 0.986577 0.989950
f1-score 0.986395 0.992032 0.98995 0.989213 0.989921
support 149.000000 249.000000 0.98995 398.000000 398.000000
_______________________________________________
Confusion Matrix:
[[145 4]
[ 0 249]]
测试结果
Accuracy Score: 97.66%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.968254 0.981481 0.976608 0.974868 0.976608
recall 0.968254 0.981481 0.976608 0.974868 0.976608
f1-score 0.968254 0.981481 0.976608 0.974868 0.976608
support 63.000000 108.000000 0.976608 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 61 2]
[ 2 106]]
print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC
model = SVC(kernel='poly', degree=2, gamma='auto')
model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
Training results:
Accuracy Score: 85.18%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.978723 0.812500 0.851759 0.895612 0.874729
recall 0.617450 0.991968 0.851759 0.804709 0.851759
f1-score 0.757202 0.893309 0.851759 0.825255 0.842354
support 149.000000 249.000000 0.851759 398.000000 398.000000
_______________________________________________
Confusion Matrix:
[[ 92 57]
[ 2 247]]
测试结果:
Accuracy Score: 82.46%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.923077 0.795455 0.824561 0.859266 0.842473
recall 0.571429 0.972222 0.824561 0.771825 0.824561
f1-score 0.705882 0.875000 0.824561 0.790441 0.812693
support 63.000000 108.000000 0.824561 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 36 27]
[ 3 105]]
print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC
model = SVC(kernel='rbf', gamma=1)
model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
Training results:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.0 1.0 1.0 1.0 1.0
recall 1.0 1.0 1.0 1.0 1.0
f1-score 1.0 1.0 1.0 1.0 1.0
support 149.0 249.0 1.0 398.0 398.0
_______________________________________________
Confusion Matrix:
[[149 0]
[ 0 249]]
测试结果:
Accuracy Score: 63.74%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.000000 0.635294 0.637427 0.817647 0.769659
recall 0.015873 1.000000 0.637427 0.507937 0.637427
f1-score 0.031250 0.776978 0.637427 0.404114 0.502236
support 63.000000 108.000000 0.637427 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 1 62]
[ 0 108]]
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100],
'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly', 'linear']}
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5)
grid.fit(X_train, y_train)
best_params = grid.best_params_
print(f"Best params: {best_params}")
svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)
Training results:
Accuracy Score: 98.24%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.986301 0.980159 0.982412 0.983230 0.982458
recall 0.966443 0.991968 0.982412 0.979205 0.982412
f1-score 0.976271 0.986028 0.982412 0.981150 0.982375
support 149.000000 249.000000 0.982412 398.000000 398.000000
_______________________________________________
Confusion Matrix:
[[144 5]
[ 2 247]]
测试结果:
Accuracy Score: 98.25%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.983871 0.981651 0.982456 0.982761 0.982469
recall 0.968254 0.990741 0.982456 0.979497 0.982456
f1-score 0.976000 0.986175 0.982456 0.981088 0.982426
support 63.000000 108.000000 0.982456 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 61 2]
[ 1 107]]
Principal component analysis (PCA) is a technique that achieves linear dimensionality reduction by projecting data into a lower dimensional space. The specific steps are as follows:
Since high-dimensional data is difficult to visualize directly, we can use PCA to find the first two principal components and visualize the data in two dimensions. To achieve this, the data needs to be standardized so that the variance of each feature is unit variance.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# 数据标准化
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# PCA 降维
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
# 可视化前两个主成分
plt.figure(figsize=(8,6))
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
With the first two principal components, we can easily separate different categories of data points in two-dimensional space.
Although dimensionality reduction is powerful, the meaning of the components is difficult to understand directly. Each component corresponds to a combination of the original features, and these components can be obtained by fitting a PCA object.
The relevant properties of the component include:
When using a support vector machine (SVM) for model training, we need to adjust the hyperparameters to get the best model. The following is an example code for adjusting SVM parameters using grid search (GridSearchCV):
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
# 定义参数网格
param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100],
'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly', 'linear']}
# 网格搜索
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5)
grid.fit(X_train, y_train)
best_params = grid.best_params_
print(f"Best params: {best_params}")
# 使用最佳参数训练模型
svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)
Training results:
Accuracy Score: 96.48%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.978723 0.957198 0.964824 0.967961 0.965257
recall 0.926174 0.987952 0.964824 0.957063 0.964824
f1-score 0.951724 0.972332 0.964824 0.962028 0.964617
support 149.000000 249.000000 0.964824 398.000000 398.000000
_______________________________________________
Confusion Matrix:
[[138 11]
[ 3 246]]
测试结果:
Accuracy Score: 96.49%
_______________________________________________
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.967213 0.963636 0.964912 0.965425 0.964954
recall 0.936508 0.981481 0.964912 0.958995 0.964912
f1-score 0.951613 0.972477 0.964912 0.962045 0.964790
support 63.000000 108.000000 0.964912 171.000000 171.000000
_______________________________________________
Confusion Matrix:
[[ 59 4]
[ 2 106]]
In this article, we learned the following:
refer to:Support Vector Machine & PCA Tutorial for Beginner
Recommend my related columns: