2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
1. Event Introduction
With the rapid development of the global economy and the acceleration of urbanization, the power system is facing increasing challenges. Accurate forecasting of power demand is crucial for the stable operation of the power grid, the effective management of energy, and the integration of renewable energy.
2. Event Mission
Given the relevant sequence data of electricity consumption history of multiple houses for N days, predict the electricity consumption of the houses.
2024 iFLYTEK AI Developer Competition-iFLYTEK Open Platform
3.Task 2: Advanced lightgbm and start feature engineering
(1) Import module:This section contains the modules required by the code
- import numpy as np
- import pandas as pd
- import lightgbm as lgb
- from sklearn.metrics import mean_squared_log_error, mean_absolute_error
- import tqdm
- import sys
- import os
- import gc
- import argparse
- import warnings
- warnings.filterwarnings('ignore')
(2) Data preparation
In the data preparation stage, the main work is to read training data and test data, and perform basic data display.
- train = pd.read_csv('./data/train.csv')
- test = pd.read_csv('./data/test.csv')
A brief introduction to the data:Where id is the house id, dt is the day identifier, the minimum dt of the training data is 11, and different ids correspond to different sequence lengths; type is the house type, and generally speaking, there are relatively large differences in the overall consumption of different types of houses; target is the actual power consumption, which is also the prediction target of this competition. The following is a simple visual analysis to help us have a simple understanding of the data.
Histogram of targets of different types
- import matplotlib.pyplot as plt
- # 不同type类型对应target的柱状图
- type_target_df = train.groupby('type')['target'].mean().reset_index()
- plt.figure(figsize=(8, 4))
- plt.bar(type_target_df['type'], type_target_df['target'], color=['blue', 'green'])
- plt.xlabel('Type')
- plt.ylabel('Average Target Value')
- plt.title('Bar Chart of Target by Type')
- plt.show()
The line chart of target with id 00037f39cf and dt as sequence
- specific_id_df = train[train['id'] == '00037f39cf']
- plt.figure(figsize=(10, 5))
- plt.plot(specific_id_df['dt'], specific_id_df['target'], marker='o', linestyle='-')
- plt.xlabel('DateTime')
- plt.ylabel('Target Value')
- plt.title("Line Chart of Target for ID '00037f39cf'")
- plt.show()
(3) Feature Engineering
Here we mainly construct historical translation features and window statistical features; each feature is justified, as described below:
History shift features:The information of the previous stage is obtained through historical shift. As shown in the figure below, the information of time d-1 can be given to time d, and the information of time d can be given to time d+1, thus realizing the feature construction of a unit shift.
Window statistics features:Window statistics can construct different window sizes, and then count the mean, maximum, minimum, median, and variance based on the window range to reflect the changes in the data in the recent stage. As shown in the figure below, the information of the three time units before time d can be statistically constructed to give me time d.
- # 合并训练数据和测试数据,并进行排序
- data = pd.concat([test, train], axis=0, ignore_index=True)
- data = data.sort_values(['id','dt'], ascending=False).reset_index(drop=True)
-
- # 历史平移
- for i in range(10,30):
- data[f'last{i}_target'] = data.groupby(['id'])['target'].shift(i)
-
- # 窗口统计
- data[f'win3_mean_target'] = (data['last10_target'] + data['last11_target'] + data['last12_target']) / 3
-
- # 进行数据切分
- train = data[data.target.notnull()].reset_index(drop=True)
- test = data[data.target.isnull()].reset_index(drop=True)
-
- # 确定输入特征
- train_cols = [f for f in data.columns if f not in ['id','target']]
4) Model training and test set prediction
The Lightgbm model is chosen, which is also usually used as the baseline model for data mining competitions. It can also get relatively stable scores without the need for process parameter adjustment. In addition, it should be noted that the construction of the training set and the validation set is strictly divided according to the time sequence because the data has a time series relationship. Here, the original training data set dt is selected as training data after 30 as the training data, and the data before 30 as the validation data, which ensures that there is no data crossing problem (future data is not used to predict historical data).
- def time_model(clf, train_df, test_df, cols):
- # 训练集和验证集切分
- trn_x, trn_y = train_df[train_df.dt>=31][cols], train_df[train_df.dt>=31]['target']
- val_x, val_y = train_df[train_df.dt<=30][cols], train_df[train_df.dt<=30]['target']
- # 构建模型输入数据
- train_matrix = clf.Dataset(trn_x, label=trn_y)
- valid_matrix = clf.Dataset(val_x, label=val_y)
- # lightgbm参数
- lgb_params = {
- 'boosting_type': 'gbdt',
- 'objective': 'regression',
- 'metric': 'mse',
- 'min_child_weight': 5,
- 'num_leaves': 2 ** 5,
- 'lambda_l2': 10,
- 'feature_fraction': 0.8,
- 'bagging_fraction': 0.8,
- 'bagging_freq': 4,
- 'learning_rate': 0.05,
- 'seed': 2024,
- 'nthread' : 16,
- 'verbose' : -1,
- }
- # 训练模型
- model = clf.train(lgb_params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix],
- categorical_feature=[], verbose_eval=500, early_stopping_rounds=500)
- # 验证集和测试集结果预测
- val_pred = model.predict(val_x, num_iteration=model.best_iteration)
- test_pred = model.predict(test_df[cols], num_iteration=model.best_iteration)
- # 离线分数评估
- score = mean_squared_error(val_pred, val_y)
- print(score)
-
- return val_pred, test_pred
-
- lgb_oof, lgb_test = time_model(lgb, train, test, train_cols)
-
- # 保存结果文件到本地
- test['target'] = lgb_test
- test[['id','dt','target']].to_csv('submit.csv', index=None)
getFraction.
4. Advanced
The importance of feature engineering is self-evident
I added some more features. I could have added more, but colab ran out of memory.
There are millions of data, and it is impossible to add more features. Let’s see how to optimize it later.