Technology Sharing

【Machine Learning】Application of Support Vector Machine and Principal Component Analysis in Machine Learning

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Support vector machine (SVM) is a powerful and versatile machine learning model for linear and nonlinear classification, regression, and outlier detection. This article introduces the support vector machine algorithm and its implementation in scikit-learn, and briefly explores principal component analysis and its application in scikit-learn.

1. Overview of Support Vector Machine

Support Vector Machine (SVM) is a widely used algorithm in the field of machine learning, favored for its ability to provide significant accuracy with less computing resources. SVM can be used for classification and regression tasks, but is most widely used in classification problems.

What is a Support Vector Machine?

The goal of support vector machines is to N N N Dimensional space ( N N N is the number of features) to find a hyperplane that can clearly distinguish data points. This hyperplane separates data points of different categories and keeps them as far away from the hyperplane as possible, thus ensuring the robustness of classification.

insert image description here

In order to achieve effective separation of data points, there may be multiple hyperplanes. Our goal is to select a hyperplane with the maximum margin, that is, the maximum distance between the two classes. Maximizing the margin helps improve the accuracy of classification.

Hyperplanes and Support Vectors

insert image description here

A hyperplane is a decision boundary that divides data points. Data points on both sides of the hyperplane can be classified into different categories. The dimension of the hyperplane depends on the number of features: if the input features are 2, the hyperplane is a straight line; if the features are 3, the hyperplane is a two-dimensional plane. When the number of features exceeds 3, the hyperplane becomes difficult to understand intuitively.

insert image description here

Support vectors are those points that are closest to the hyperplane and these points influence the position and orientation of the hyperplane. With these support vectors, we can maximize the margin of the classifier. Removing support vectors changes the position of the hyperplane, so they are crucial in building SVM.

Large margin intuition

In logistic regression, we use the sigmoid function to compress the output of the linear function into the range [0,1] and assign labels based on a threshold (0.5). In SVM, we use the output of the linear function to determine the classification: if the output is greater than 1, it belongs to one class; if the output is -1, it belongs to another class. SVM forms a marginal range [-1,1] by setting the output threshold to 1 and -1.

2. Data Preprocessing and Visualization

Use support vector machines to predict benign and malignant cancer diagnoses.

Basic information of the dataset

  • The number of features is 30, for example:
    • Radius (average distance from the center to the perimeter points)
    • Texture (standard deviation of grayscale values)
    • perimeter
    • area
    • Smoothness (local variation in radius length)
    • Compactness (perimeter^2 / area - 1.0)
    • Concavity (the severity of the concave portion of the profile)
    • Concavity (number of concave portions of the profile)
    • symmetry
    • Fractal dimension ("coastline approximation" - 1)
  • The dataset contains 569 samples, with the class distribution of 212 malignant samples and 357 benign samples.
  • Target Category:
    • Malignant
    • benign

Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Loading the dataset

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

# 创建 DataFrame
col_names = list(cancer.feature_names)
col_names.append('target')
df = pd.DataFrame(np.c_[cancer.data, cancer.target], columns=col_names)
df.head()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Data Overview

df.info()print(cancer.target_names)
# ['malignant', 'benign']

# 数据描述:
df.describe()
# 统计摘要:
df.info()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

data visualization

Scatter plot matrix of feature pairs
sns.pairplot(df, hue='target', vars=[
    'mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension'
])
  • 1
  • 2
  • 3
  • 4
  • 5
Category distribution bar chart
sns.countplot(x=df['target'], label="Count")
  • 1

insert image description here

Scatter plot of mean area vs. mean smoothness
plt.figure(figsize=(10, 8))
sns.scatterplot(x='mean area', y='mean smoothness', hue='target', data=df)
  • 1
  • 2

insert image description here

Heatmap of correlations between variables
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True)
  • 1
  • 2

insert image description here

3. Model training (problem solving)

In machine learning, model training is a key step in finding solutions to problems. scikit-learn Perform model training and demonstrate the performance of support vector machines (SVMs) on different kernels.

Data preparation and preprocessing

First, we need to prepare and preprocess the data. Here is a code example of data preprocessing:

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = df.drop('target', axis=1)
y = df.target

print(f"'X' shape: {X.shape}")
print(f"'y' shape: {y.shape}")
# 'X' shape: (569, 30)
# 'y' shape: (569,)

pipeline = Pipeline([
    ('min_max_scaler', MinMaxScaler()),
    ('std_scaler', StandardScaler())
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

In the code, we use MinMaxScaler andStandardScaler The data is scaled. The data is divided into training and test sets, where 30% of the data is used for testing.

Evaluating Model Performance

To evaluate the performance of the model, we define a print_score Function, which can output the accuracy, classification report and confusion matrix of training and test results:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: n {confusion_matrix(y_train, pred)}n")
    else:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: n {confusion_matrix(y_test, pred)}n")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful classification algorithm whose performance is affected by hyperparameters. The following will introduce the main parameters of SVM and their impact on model performance:

  • C Parameters: Controls the trade-off between correctly classifying training points and having a smooth decision boundary. Smaller C C C(lenient) makes the misclassification cost (penalty) lower (soft margin), while larger C C C(Strict) Makes misclassification costly (hard margin), forcing the model to interpret the input data more strictly.
  • Gamma parameter: Controls the influence of a single training set. Larger γ gamma γ Make the influence range closer (closer data points have higher weights), smaller γ gamma γ Make the impact wider (broader solutions).
  • degree parameter: Polynomial kernel function ('poly') and are ignored by other kernels. The best hyperparameter values ​​can be found by grid search.

Linear Kernel SVM

Linear kernel SVM works well in most cases, especially when the dataset has a large number of features. Here is a code example using linear kernel SVM:

from sklearn.svm import LinearSVC

model = LinearSVC(loss='hinge', dual=True)
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

The training and testing results are as follows:

Training results:

Accuracy Score: 86.18%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    1.000000    0.819079  0.861809    0.909539      0.886811
recall       0.630872    1.000000  0.861809    0.815436      0.861809
f1-score     0.773663    0.900542  0.861809    0.837103      0.853042
support    149.000000  249.000000  0.861809  398.000000    398.000000
_______________________________________________
Confusion Matrix: 
 [[ 94  55]
 [  0 249]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Test Results:

Accuracy Score: 89.47%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   1.000000    0.857143  0.894737    0.928571      0.909774
recall      0.714286    1.000000  0.894737    0.857143      0.894737
f1-score    0.833333    0.923077  0.894737    0.878205      0.890013
support    63.000000  108.000000  0.894737  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[ 45  18]
 [  0 108]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Polynomial Kernel SVM

The polynomial kernel SVM is suitable for nonlinear data. Here is a code example using a second-order polynomial kernel:

from sklearn.svm import SVC

model = SVC(kernel='poly', degree=2, gamma='auto', coef0=1, C=5)
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

The training and testing results are as follows:

Training results:

Accuracy Score: 96.98%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    0.985816    0.961089  0.969849    0.973453      0.970346
recall       0.932886    0.991968  0.969849    0.962427      0.969849
f1-score     0.958621    0.976285  0.969849    0.967453      0.969672
support    149.000000  249.000000  0.969849  398.000000    398.000000
_______________________________________________
Confusion Matrix: 
 [[139  10]
 [  2 247]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Test Results:

Accuracy Score: 97.08%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   0.967742    0.972477   0.97076    0.970109      0.970733
recall      0.952381    0.981481   0.97076    0.966931      0.970760
f1-score    0.960000    0.976959   0.97076    0.968479      0.970711
support    63.000000  108.000000   0.97076  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[ 60   3]
 [  2 106]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Radial Basis Function (RBF) Kernel SVM

The radial basis function (RBF) kernel is suitable for processing nonlinear data. The following is a code example using the RBF kernel:

model = SVC(kernel='rbf', gamma=0.5, C=0.1)
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5

The training and testing results are as follows:

Training results:

Accuracy Score: 62.56%
_______________________________________________
CLASSIFICATION REPORT:
             0.0         1.0  accuracy   macro avg  weighted avg
precision    0.0    0.625628  0.625628    0.312814      0.392314
recall       0.0    1.000000  0.625628    0.500000      0.625628
f1-score     0.0    0.769231  0.625628    0.384615      0.615385
support    149.0  249.0    0.625628  398.0        398.0
_______________________________________________
Confusion Matrix: 
 [[  0 149]
 [  0 249]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Test Results:

Accuracy Score: 64.97%
_______________________________________________
CLASSIFICATION REPORT:
             0.0         1.0  accuracy   macro avg  weighted avg
precision    0.0    0.655172  0.649661    0.327586      0.409551
recall       0.0    1.000000  0.649661    0.500000      0.649661
f1-score     0.0    0.792453  0.649661    0.396226      0.628252
support    63.0  108.0    0.649661  171.0        171.0
_______________________________________________
Confusion Matrix: 
 [[  0  63]
 [  0 108]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Summarize

Through the above model training and evaluation process, we can observe the performance differences of different SVM kernels. Linear kernel SVM performs well in accuracy and training time, and is suitable for situations with high data dimensions. Polynomial kernel SVM and RBF kernel SVM have better performance on nonlinear data, but may cause overfitting under certain parameter settings. Choosing the right kernel and hyperparameters is crucial to improving model performance.

4. SVM Data Preparation

Digital Input: SVM assumes that the input data is numeric. If the input data is categorical, you may want to convert it into binary dummy variables (one variable for each category).

Binary classification: The basic SVM is suitable for binary classification problems. Although SVM is mainly used for binary classification, there are also extended versions for regression and multi-class classification.

X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
  • 1
  • 2

Model training and evaluation

The following shows the training and testing results for different SVM kernels:

Linear Kernel SVM

print("=======================Linear Kernel SVM==========================")
model = SVC(kernel='linear')
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Training results:

Accuracy Score: 98.99%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    1.000000    0.984190   0.98995    0.992095      0.990109
recall       0.973154    1.000000   0.98995    0.986577      0.989950
f1-score     0.986395    0.992032   0.98995    0.989213      0.989921
support    149.000000  249.000000   0.98995  398.000000    398.000000
_______________________________________________
Confusion Matrix: 
 [[145   4]
 [  0 249]]

测试结果

Accuracy Score: 97.66%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   0.968254    0.981481  0.976608    0.974868      0.976608
recall      0.968254    0.981481  0.976608    0.974868      0.976608
f1-score    0.968254    0.981481  0.976608    0.974868      0.976608
support    63.000000  108.000000  0.976608  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[ 61   2]
 [  2 106]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
Polynomial Kernel SVM
print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='poly', degree=2, gamma='auto')
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Training results:

Accuracy Score: 85.18%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    0.978723    0.812500  0.851759    0.895612      0.874729
recall       0.617450    0.991968  0.851759    0.804709      0.851759
f1-score     0.757202    0.893309  0.851759    0.825255      0.842354
support    149.000000  249.000000  0.851759  398.000000    398.000000
_______________________________________________
Confusion Matrix: 
 [[ 92  57]
 [  2 247]]

测试结果:

Accuracy Score: 82.46%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   0.923077    0.795455  0.824561    0.859266      0.842473
recall      0.571429    0.972222  0.824561    0.771825      0.824561
f1-score    0.705882    0.875000  0.824561    0.790441      0.812693
support    63.000000  108.000000  0.824561  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[ 36  27]
 [  3 105]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
Radial Basis Function Kernel SVM
print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='rbf', gamma=1)
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Training results:

Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
             0.0    1.0  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    149.0  249.0       1.0      398.0         398.0
_______________________________________________
Confusion Matrix: 
 [[149   0]
 [  0 249]]

测试结果:

Accuracy Score: 63.74%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   1.000000    0.635294  0.637427    0.817647      0.769659
recall      0.015873    1.000000  0.637427    0.507937      0.637427
f1-score    0.031250    0.776978  0.637427    0.404114      0.502236
support    63.000000  108.000000  0.637427  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[  1  62]
 [  0 108]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

Support Vector Machine Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100], 
              'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001], 
              'kernel': ['rbf', 'poly', 'linear']} 

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5)
grid.fit(X_train, y_train)

best_params = grid.best_params_
print(f"Best params: {best_params}")

svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

Training results:

Accuracy Score: 98.24%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    0.986301    0.980159  0.982412    0.983230      0.982458
recall       0.966443    0.991968  0.982412    0.979205      0.982412
f1-score     0.976271    0.986028  0.982412    0.981150      0.982375
support    149.000000  249.000000  0.982412  398.000000    398.000000
_______________________________________________
Confusion Matrix: 
 [[144   5]
 [  2 247]]

测试结果:

Accuracy Score: 98.25%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   0.983871    0.981651  0.982456    0.982761      0.982469
recall      0.968254    0.990741  0.982456    0.979497      0.982456
f1-score    0.976000    0.986175  0.982456    0.981088      0.982426
support    63.000000  108.000000  0.982456  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[ 61   2]
 [  1 107]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

5. Principal Component Analysis

Introduction to PCA

Principal component analysis (PCA) is a technique that achieves linear dimensionality reduction by projecting data into a lower dimensional space. The specific steps are as follows:

  • Using Singular Value Decomposition: Project the data into a low-dimensional space through singular value decomposition.
  • Unsupervised Learning: PCA does not require labeled data for dimensionality reduction.
  • Feature Transformation: Try to find which features explain the most variance in the data.

PCA Visualization

Since high-dimensional data is difficult to visualize directly, we can use PCA to find the first two principal components and visualize the data in two dimensions. To achieve this, the data needs to be standardized so that the variance of each feature is unit variance.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 数据标准化
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# PCA 降维
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

# 可视化前两个主成分
plt.figure(figsize=(8,6))
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

insert image description here

With the first two principal components, we can easily separate different categories of data points in two-dimensional space.

Explanation Components

Although dimensionality reduction is powerful, the meaning of the components is difficult to understand directly. Each component corresponds to a combination of the original features, and these components can be obtained by fitting a PCA object.

The relevant properties of the component include:

  • Component score: The variable value after transformation.
  • Load (weight): The combination weight of the features.
  • Data compression and information preservation: PCA is used to compress data while retaining key information.
  • Noise Filtering: Noise can be filtered out during the dimensionality reduction process.
  • Feature extraction and engineering: Used to extract and construct new features.

Parameter Tuning of Support Vector Machine (SVM) Model

When using a support vector machine (SVM) for model training, we need to adjust the hyperparameters to get the best model. The following is an example code for adjusting SVM parameters using grid search (GridSearchCV):

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100], 
              'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001], 
              'kernel': ['rbf', 'poly', 'linear']} 

# 网格搜索
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5)
grid.fit(X_train, y_train)
best_params = grid.best_params_
print(f"Best params: {best_params}")

# 使用最佳参数训练模型
svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

Training results:

Accuracy Score: 96.48%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    0.978723    0.957198  0.964824    0.967961      0.965257
recall       0.926174    0.987952  0.964824    0.957063      0.964824
f1-score     0.951724    0.972332  0.964824    0.962028      0.964617
support    149.000000  249.000000  0.964824  398.000000    398.000000
_______________________________________________
Confusion Matrix: 
 [[138  11]
 [  3 246]]

测试结果:

Accuracy Score: 96.49%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0         1.0  accuracy   macro avg  weighted avg
precision   0.967213    0.963636  0.964912    0.965425      0.964954
recall      0.936508    0.981481  0.964912    0.958995      0.964912
f1-score    0.951613    0.972477  0.964912    0.962045      0.964790
support    63.000000  108.000000  0.964912  171.000000    171.000000
_______________________________________________
Confusion Matrix: 
 [[ 59   4]
 [  2 106]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

VI. Conclusion

In this article, we learned the following:

  • Support Vector Machine (SVM): Understand the basic concepts of SVM and its implementation in Python.
  • SVM Kernel Function: Includes linear, radial basis function (RBF), and polynomial kernel functions.
  • data preparation: How to prepare data for the SVM algorithm.
  • Hyperparameter Tuning: Tuning SVM hyperparameters via grid search.
  • Principal Component Analysis (PCA): How to use PCA to reduce the complexity of data and reuse it in scikit-learn.

refer to:Support Vector Machine & PCA Tutorial for Beginner
Recommend my related columns:


insert image description here