[Python learning notes] Optuna parameter adjustment tool Titanic case

2024-07-12

[Python learning notes] parameter adjustment tool Optuna & Titanic case

Background forward swing:(You can skip this if you want to save some traffic)
I recently found an internship as an AI labeler, but the whole time I was doing text-related work, which was essentially screwing things in. So I wanted to learn some skills like parameter adjustment and deployment to increase my competitiveness, so that it would be easier to package my resume in the future.
Of course, the first choice for looking for tutorials is B station university, and I found a free parameter adjustment tool with many good reviews.Optuna。
By the way, my internship wasContent-safe direction text annotation(Human language: Ensure that the content of the training set isPolitical Correctness), but when I watched other videos to learn Yolo5 for mask and face recognition, I learned that there is a freeImage AnnotationThe tool is very easy to use. I have tested it myself. It is easy to install and use, and it does not contain any advertisements or other clutter. It is simple and efficient.
Tool Name:LabelImg
A public account article shared by the tutor before is very comprehensive:
https://mp.weixin.qq.com/s/AE_rJwd9cKQkUGFGD6EfAg
Official account portal
insert image description here
————————————————————————————————————————————

Main content: About Optuna and the learning process
**Video tutorial link:**https://www.bilibili.com/list/watchlater?oid=832000670&bvid=BV1c34y1G7E8&spm_id_from=333.1007.top_right_bar_window_view_later.content.click
Videos of the masters at Station B
insert image description here
Installation method:

Anaconda should also be installed via conda. I just installed it via pip as shown in the video to save trouble.
————————————————————————————————————————————

Kaggle case link used by the master:
https://www.kaggle.com/code/yunsuxiaozi/learn-to-use-the-optuna/notebook
Portal
insert image description here
Although copying and pasting is very comfortable, in order to maintain the feel and feel of the details, it is recommended to write it slowly by hand if you have time. Even if you type it, it is better than copying and pasting directly. When you are just starting to learn a skill, slow is fast.
Similarly, if you have Anaconda on your computer, it is recommended to create a dedicated virtual environment to test Optuna to prevent conflicts.
————————————————————————————————————————————

CSDN master's conda virtual environment construction and jupyter virtual environment configuration tutorial:
https://blog.csdn.net/fanstering/article/details/123459665
If you don't have Anaconda, you can skip this step.

I was already familiar with the step of setting up a virtual environment. The environment and Optuna package were clearly installed, but when Jupyter was running, it still reported that there was no Optuna library. I followed the tutorial of this great god and installed it.nb_condaThe problem will be solved later.
insert image description here

————————————————————————————————————————————

Then we enter the familiar coding phase. First, we import several Python libraries we need, and install whatever is missing. You can use pip install or conda install. If you have Anaconda, we recommend the latter.

If he reports a bug: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.read
zhebu insert image description here

Don't be afraid, you just need to upgrade Jupyter.
insert image description here

After I completed the first three steps, I refreshed and ran this program again and everything was fine.

————————————————————————————————

Dataset download link: https://www.kaggle.com/competitions/titanic/data?select=train.csv
Portal
insert image description here

total_df['Embarked_is_nan']=(total_df['Embarked']!=total_df['Embarked'])
1

This line of code creates a new column 'Embarked_is_nan' to mark empty values (NaN) in the 'Embarked' column. If an element in the 'Embarked' column is not itself (i.e. the element is NaN), the corresponding position in the new column will be set to True.
This is the first time I have seen this kind of writing.

keys=['Pclass','Sex','SibSp','Parch']
for key in keys:
    values=np.unique(train_df[key].values)
    
    if len(values)<10 and key!="Survived":
        print(f"key:{key},values:{values}") 
        
        key_target=train_df['Survived'].groupby([train_df[key]]).mean()
        keys=key_target.keys().values
        target=key_target.values
        key_target=pd.DataFrame({key:keys,key+"_target":target})
        total_df=pd.merge(total_df,key_target,on=key,how="left")
total_df.head()
1
2
3
4
5
6
7
8
9
10
11
12
13

This code is a bit complicated. First, it selects four important features ['Pclass', 'Sex', 'SibSp', 'Parch'] and stores them in the keys list. These columns are considered to be related to the survival of passengers.
Then go through the list one by one to see how many unique values each feature has in the training set.
If the number of unique values len(values) of a key column is less than 10, the code will further analyze the relationship between this column and survival status.
(I am a little confused as to why key!= "Survived" is emphasized, since there is no value "Survived" in this keys list??)
Printing out found that all four attributes meet the conditions:
insert image description here

key_target=train_df['Survived'].groupby([train_df[key]]).mean()
1

In this step, in the training set train_df, the code calculates the average survival rate of the 'Survived' column for each unique value in the keys list.
insert image description here
As can be seen from the figure, the value of the key_target dataframe changes with the value of the key attribute in each loop. For example, when processing key = Pclass, the keys of key_target are 1, 2, 3, and the values are the corresponding average survival rates; when processing key = sex, the keys of key_target are 'Female' and 'Male', and the values are also the corresponding average survival rates.
The subsequent steps are a bit complicated, so let's draw a word table to help you understand:
There are four attributes in keys. We will take one of them at a time. Finally, the data of the four attributes are stored in total_df and summarized and returned.
This is the case of PClass. PClass has three values: 1/2/3. The average survival rate is calculated for each value and saved in key_target.
Then the key of key_target is taken out and stored in the keys list (yes, this list is also called keys..., but the values in it should be the three values 1/2/3 of PClass), and the value is stored in the target list (that is, the three numbers 0.629630, 0.472826, and 0.242363). Then the key and value columns are named artificially. The key column is still called PClass, and the value column is more personalized and is called 'PClass_target'.
insert image description here
By analogy, the key_target of the next three attributes all follow this pattern, but the total_df table keeps expanding.

(The word "male" here is out of bounds, so the picture is cut off and a separate piece is added later)
Until finally total_df becomes a large table:
insert image description here
To be honest, for an Optuna case whose main purpose is to teach people how to use parameter adjustment tools, the logic written by the original author in this step really adds to my understanding burden...
I asked Kimi to rewrite this code into a clearer and simpler version, so that it is not easy to confuse so many keys and targets that fill the screen:
insert image description here
This part is not the focus of the parameter adjustment steps. If you really don’t understand it, you can skip it.
Fill missing values with the mean value, a common step in data processing learned, mark it.

Supplement: When I was taking the Data Analysis and Tableau Visualization course, the teacher shared a post that specifically introduced many types of missing data and how to handle them: https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184
Add link description
The post is in English and you need to register and log in to view it.
————————————————————————————————————————————
After having a more comprehensive data set total_df with more attributes, it is divided back into training set and test set according to the previous length.
insert image description here
————————————————————————————————————————————

The ratio of training set to test set mentioned here is 8:2, or 4:1, which is a very common division ratio.
The original author probably regarded test and valid as the same thing, so there was no special distinction in the naming.
insert image description here
But in fact, these two concepts are not exactly the same (but I have seen that the distinction is sometimes not strict), so here I have adopted the unified writing method of test_X and test_y.

————————————————————————————————————————————
ImportLGBM RegressorAlthough the Titanic dataset is mostly used for classification problems (predicting whether a passenger can survive), the original author treats it as a regression problem here. I consider that for beginners, it is easier to choose a simple and familiar dataset and a tutorial with video explanation, so I did not dwell on the details of regression or classification problems. If you are familiar with the process later, you can find more complex and standardized data online to practice.
Supplementary learning link: "LGBMRegressor parameter setting lgbmclassifier parameters"
https://blog.51cto.com/u_12219/10333606
Add link description
(Isn’t there an LGBM classifier in this post? Why did the author specifically mention it as a regression problem in the video?)
————————————————————————————————————————————
RMSE, as a loss function, is an evaluation indicator. It is a Python function that calculates the root mean square error.The smaller the better。
insert image description here
————————————————————————————————————————————
Set the parameters of the task objective according to the original author's video:
For specific parameter names, meanings, and recommended values, please refer to the supplementary learning post "LGBMRegressor parameter settings lgbmclassifier parameters" above.

def objective(trial):
    param = {
        'metric':'rmse',
        'random_state':trial.suggest_int('random_state', 2023, 2023),  #随机种子固定，所以设置为2023-2023
        'n_estimators':trial.suggest_int('n_estimators', 50, 300),  #迭代器数量 50-300 的整数
        'reg_alpha':trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda':trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),  #对数正态分布的建议值
        'colsample_bytree':trial.suggest_float('colsample_bytree', 0.5, 1), #浮点数
        'subsample':trial.suggest_float('subsample', 0.5, 1),
        'learning_rate':trial.suggest_float('learning_rate', 1e-4, 0.1, log = True),
        'num_leaves':trial.suggest_int('num_leaves', 8, 64),  #整数
        'min_child_samples':trial.suggest_int('min_child_smaples', 1, 100),        
    }
    model = LGBMRegressor(**param)  #调用模型
    model.fit(train_X, train_y, eval_set = [(test_X, test_y)], early_stopping_rounds = 100, verbose = False)  #拟合
    preds = model.predict(test_X)  #计算测试集的损失
    rmse = RMSE(test_y, preds)
    return rmse
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

————————————————————————————————————————————
Key code: Call Optuna, create a learning task, specify the minimum loss, and set the task name:

#创建的研究命名，找最小值
study = optuna.create_study(direction = 'minimize', study_name = 'Optimize boosting hpyerparameters')  #创建了一个Optuna研究对象，用于优化超参数。
#关键代码：调用Optuna，创建一个学习任务，指定让损失最小，设置任务名称。
#目标函数，尝试的次数
study.optimize(objective, n_trials = 100) #将设定好参数的object任务传进来，尝试一百次
#输出最佳的参数
print('Best Trial: ',study.best_trial.params) #找到最佳参数  tudy.best_trial 表示在所有尝试中损失最小的那次试验，params 是一个字典，包含了那次试验中使用的超参数。
lgbm_params = study.best_trial.params #这行代码将最佳参数赋值给 lgbm_params 变量。这样可以将这些参数用于LightGBM模型或其他需要这些超参数的模型。
1
2
3
4
5
6
7
8

insert image description here
————————————————————————————————————————————
The idea is fine, but we encountered a strange error: fit() got an unexpected keyword argument 'early_stopping_rounds':

I tried Kimi's method but still got an error, so it's probably not our fault. insert image description here

The great people at StackFlow provided two solutions:
https://stackoverflow.com/questions/76895269/lgbmclassifier-fit-got-an-unexpected-keyword-argument-early-stopping-rounds

Portal
insert image description here

Note that it is best to use pip install, conda install does not seem to work.

But I found it was useless... It still reported the same error, it didn't recognize early_stopping_rounds, and even after deleting it, it didn't recognize the following parameter verbose...
It doesn't work even if I change the parameters according to the previous supplementary post. It seems that the information update is a little delayed...
insert image description here

————————————————————————————————————————————
By the way, I found a mistake-prone point during the search: the maximum value of early_stopping_rounds seems to be 100.
https://blog.csdn.net/YangTinTin/article/details/120708391
Portal
insert image description here
————————————————————————————————————————————
I searched the Internet manually and asked Kimi, but I haven't found a very effective alternative yet. Since we mainly want to try out the use of Optuna, we will delete these two troublesome parameters first.

model.fit(train_X, train_y, eval_set=[(test_X, test_y)])  #拟合
1

Complete objective code:

def objective(trial):
    param = {
        'metric':'rmse',
        'random_state':trial.suggest_int('random_state', 2023, 2023),  #随机种子固定，所以设置为2023-2023
        'n_estimators':trial.suggest_int('n_estimators', 50, 300),  #迭代器数量 50-300 的整数
        'reg_alpha':trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda':trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),  #对数正态分布的建议值
        'colsample_bytree':trial.suggest_float('colsample_bytree', 0.5, 1), #浮点数
        'subsample':trial.suggest_float('subsample', 0.5, 1),
        'learning_rate':trial.suggest_float('learning_rate', 1e-4, 0.1, log = True),
        'num_leaves':trial.suggest_int('num_leaves', 8, 64),  #整数
        'min_child_samples':trial.suggest_int('min_child_smaples', 1, 100),        
    }
    model = LGBMRegressor(**param)  #调用模型
    model.fit(train_X, train_y, eval_set=[(test_X, test_y)])  #拟合
    preds = model.predict(test_X)  #计算测试集的损失
    rmse = RMSE(test_y, preds)
    return rmse
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Then you canGet Best_Trial:
insert image description here
————————————————————————————————————————————
Similarly, the original UP also usedxgboost and catboostA similar approach is used to find the optimal parameters.
It is not difficult to see that the main differences between the three different methods are the selected parameters param and the function method specified by model.
XGBoost:
insert image description here

XGBoost Results：

————————————————————————————————————————————
CatBoost:

CatBoost results:

————————————————————————————————————————————
Last UseK-fold cross validationGet the best results:
Cross validation is also a common term in machine learning.
insert image description here

def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred)/len(y_true)
1
2

insert image description here

insert image description here
kf is a KFold object, which is a tool in the scikit-learn library for implementing K-fold cross validation. KFold divides the dataset inton_splits subsets, each subsetTake turns as a validation set, and the rest are used as training sets.
for train_index, valid_index in kf.split(x): This line of code iterates over the KFold object, returning two arrays each time: train_index and valid_index.train_index contains the index of the data point used for training, and valid_index contains the index of the data point used for validation.According to the index to the full set X, y, you can get the training set and validation set data for each time. This process is like shuffling the whole class, randomly drawing student numbers and calling a few students to form a group each time.
————————————————————————————————————————————
Now I have encountered a common bug, although I don't know how it happened, and I don't know if Python should be blamed again. (Because the shapes of the X and y datasets are exactly the same as the author's, and there is no additional operation on them later...)
insert image description here

Don’t be afraid when you encounter problems, fix them! Trust Kimi! (By the way, the employees of the artificial intelligence company where I interned also used Kimi, so Kimi is quite trustworthy, and it’s free!!)

Don’t forget to disable the early_stopping_rounds and verbose parameters at the same time to avoid errors.
insert image description here
This block of code is quite long, and there are many repetitive parts. Be careful not to miss or make mistakes.

from sklearn.model_selection import KFold  #在机器学习库中导入k折交叉验证的函数
from xgboost import XGBRegressor
from lightgbm import  LGBMRegressor
from catboost import CatBoostRegressor  

def accuracy(y_true,y_pred):
    return np.sum(y_true==y_pred)/len(y_true)
print("start fit.")
folds = 10  #将数据分成10份
y=train_df['Survived']
X=train_df.drop(['Survived'],axis=1)

train_accuracy=[]
valid_accuracy=[]
# 存储已学习模型的列表
models = []

#将数据集随机打乱,并分成folds份
kf = KFold(n_splits=folds, shuffle=True, random_state=2023) 

#从x_train中按照9:1的比例分成训练集和验证集,并取出下标
for train_index, valid_index in kf.split(X):
    
    #根据下标取出训练集和验证集的数据
    x_train_cv = X.iloc[train_index]
    y_train_cv = y.iloc[train_index]
    x_valid_cv =X.iloc[valid_index]
    y_valid_cv = y.iloc[valid_index]
    
    model = LGBMRegressor(**lgbm_params)
    
    #模型用x_train_cv去训练,用x_train_cv和x_valid_cv一起去评估
    model.fit(
        x_train_cv, 
        y_train_cv, 
        eval_set = [(x_train_cv, y_train_cv), (x_valid_cv, y_valid_cv)], 
        #early_stopping_rounds=100,
        #verbose = 100, #迭代100次输出一个结果
    )
    
    #对训练集进行预测
    y_pred_train = model.predict(x_train_cv)        
    #对验证集进行预测
    y_pred_valid = model.predict(x_valid_cv) 
    
    y_pred_train=(y_pred_train>=0.5)
    y_pred_valid=(y_pred_valid>=0.5)
    
    train_acc=accuracy(y_pred_train,y_train_cv)
    valid_acc=accuracy(y_pred_valid,y_valid_cv)
    
    train_accuracy.append(train_acc)
    valid_accuracy.append(valid_acc)
    
    #将model保存进列表中
    models.append(model)
    
    model = XGBRegressor(**xgb_params)
    
    #模型用x_train_cv去训练,用x_train_cv和x_valid_cv一起去评估
    model.fit(
        x_train_cv, 
        y_train_cv, 
        eval_set = [(x_train_cv, y_train_cv), (x_valid_cv, y_valid_cv)], 
        #early_stopping_rounds=100,
        #verbose = 100, #迭代100次输出一个结果
    )
    
    #对训练集进行预测
    y_pred_train = model.predict(x_train_cv)        
    #对验证集进行预测
    y_pred_valid = model.predict(x_valid_cv) 
    
    y_pred_train=(y_pred_train>=0.5)
    y_pred_valid=(y_pred_valid>=0.5)
    
    train_acc=accuracy(y_pred_train,y_train_cv)
    valid_acc=accuracy(y_pred_valid,y_valid_cv)
    
    train_accuracy.append(train_acc)
    valid_accuracy.append(valid_acc)
    
    #将model保存进列表中
    models.append(model) 
    
    model = CatBoostRegressor(**cat_params)
    
    #模型用x_train_cv去训练,用x_train_cv和x_valid_cv一起去评估
    model.fit(
        x_train_cv, 
        y_train_cv, 
        eval_set = [(x_train_cv, y_train_cv), (x_valid_cv, y_valid_cv)], 
        #early_stopping_rounds=100,
        #verbose = 100, #迭代100次输出一个结果
    )
    
    #对训练集进行预测
    y_pred_train = model.predict(x_train_cv)        
    #对验证集进行预测
    y_pred_valid = model.predict(x_valid_cv) 
    
    y_pred_train=(y_pred_train>=0.5)
    y_pred_valid=(y_pred_valid>=0.5)
    
    train_acc=accuracy(y_pred_train,y_train_cv)
    valid_acc=accuracy(y_pred_valid,y_valid_cv)
    
    train_accuracy.append(train_acc)
    valid_accuracy.append(valid_acc)
    
    #将model保存进列表中
    models.append(model) 
    
    print(f"train_accuracy:{train_accuracy}, valid_accuracy:{valid_accuracy}")

train_accuracy=np.array(train_accuracy)
valid_accuracy=np.array(valid_accuracy)

print(f"mean_train_accuracy: {np.mean(train_accuracy)}")
print(f"mean_valid_accuracy: {np.mean(valid_accuracy)}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

My intermediate output is slightly different from the original author's, and I don't know what caused the difference:
insert image description here

Fortunately, the results are not too bad at the moment:

————————————————————————————————————————————
Test the performance of each model on the test set:

test_X = test_df.drop(['Survived'], axis = 1).values

preds_test = []

#用每个保存的模型都对x_test预测一次,然后取平均值
for model in models:
    pred = model.predict(test_X)
    preds_test.append(pred)
1
2
3
4
5
6
7
8

#将预测结果转换为np.array
preds_test_np = np.array(preds_test)

#按行对每个模型的预测结果取平均值
test_pred= preds_test_np.mean(axis = 0 )
test_pred=(test_pred >= 0.5).astype(np.int64) 
#平均预测值与 0.5 进行比较，根据比较结果（大于等于 0.5 为 True，否则为 False）将每个值转换为二进制形式（即 1 或 0），然后使用 astype(np.int64) 将布尔值转换为 64 位整数类型。
test_pred
1
2
3
4
5
6
7
8

insert image description here

Look at the output test_pred.shape. Don’t be fooled by the square shape of this array. Its shape is still a one-dimensional array with 418 elements.
In Python, the shape of a one-dimensional array is usually expressed as (N,) where N is the total number of elements in the array.
————————————————————————————————————————————
There are many examples in this project, such as test_pred=(test_pred >= 0.5).astype(np.int64) , which creates new columns and assigns Bool values of True or False by comparing the size:
insert image description here

————————————————————————————————————————————
Finally save and write to CSV file:

submission=pd.read_csv("D:/StudyFiles/Optuna_Titanic/data/gender_submission.csv")  #读取CSV文件，并将其存储在变量submission中
submission['Survived']=test_pred  #更新了submission DataFrame中的'Survived'列，使其包含模型预测的生存概率或分类结果。
submission.to_csv("submission.csv",index=None) #将更新后的submission DataFrame保存为一个新的CSV文件"submission.csv"。参数index=None表示在保存CSV文件时不包括行索引。
submission.head()
1
2
3
4

insert image description here
————————————————————————————————————————————
I will post my code package to the CSDN homepage. Friends in need are welcome to download it by themselves. See you in the next tutorial.
insert image description here
————————————————————————————————————————————
Other related learning content:
1. **9.1 Model Parameter Adjustment [Stanford 21 Fall: Practical Machine Learning Chinese Edition]: **Professor Li Mu’s video introduces some theories about parameter adjustment. If you don’t understand parameter adjustment, you can watch it to learn the basics.
https://www.bilibili.com/video/BV1vQ4y1e7LF/?spm_id_from=333.788.recommend_more_video.1&vd_source=cdfd0a0810bcc0bcdbcf373dafdf6a82
Portal
(two)This automatic parameter adjustment tool is really powerful! It can fully meet the daily use of machine learning and deep learning parameter adjustment! A must-have tool for beginners!: The video content is not as good as the demonstration video I followed. It mainly briefly talked about Optuna without any practical examples.
https://www.bilibili.com/video/BV1Zs421K7Qj/?spm_id_from=333.788.recommend_more_video.6&vd_source=cdfd0a0810bcc0bcdbcf373dafdf6a82
Portal
However, I am very interested in the book introduced by this UP, because for a newbie like me who lacks experience and likes to find patterns, I really hope to have an instruction manual to introduce me to some universal formulas and directions.

Technology Sharing

[Python learning notes] Optuna parameter adjustment tool Titanic case

[Python learning notes] parameter adjustment tool Optuna & Titanic case

Personal profile

my contact information