Linear regression model

Linear Regression Model

2024-07-12

Linear Regression

1. Theoretical Part

Linear Regression: In limited data, by adjusting parameters, a straight line is fitted, and this straight line (model) is used to predict unknown data.predict。

StraightGeneral form：
$y = w \times x + b$
The state of the entire line is given by $w and b$ Decide, $w$ Determine the straight lineSlope(that is, the tilt angle), $b$ Determine the line on the Y axisintercept(Controls the vertical translation of a line, also calledBias). Therefore, we only need to calculate $w$ and $b$ The value of can determine a specific straight line. Next, we will study how to find $w$ and $b$ 。

insert image description here

The influence of w and b on the shape of a line

Proper Noun

noun	Interpretation
Training set	Dataset used to train the model
Test Set	Datasets used to test the quality of the model
Training Sample	Each data in the training set
Features	The data input into the model can be numerical values, category labels, image pixel values, etc.
Label	The target value that the model needs to predict. For classification problems, the label is usually the class name; for regression problems, the label is a continuous value.

1. Univariate Linear Regression

1) Data preprocessing

Let's first study single variable linear regression. The so-called single variable linear regression refers to a linear function with only one independent variable, such as: $y = w \cdot x + b$ That isUnivariate straight line, with only one input variable $x$ The straight line can be represented on a two-dimensional plane (the horizontal axis is X and the vertical axis is Y).

When we get a set of undivided data, we usually divide the data into training set and test set. A simple way to divide it is: take the first 80% of the samples as the training set and the remaining 20% as the test set.

2) Define the cost function

Assuming we have found $w$ and $b$ , then we have determined a straight line, and we can use this straight line for prediction. In order to facilitate the judgment of the predicted valuey’With the true valueybetweenerrorWhat is the value of the prediction? We need to define a ruler to measure the predicted value. $y^{'}$ With the true value $y$ Here we useMean square errorTo defineCost Function：

$frac{1}{2m}sumlimits_{i = 1}^m(f_{w,b}(x^{(i)}) - y^{(i)})^2$
Formula deconstruction：

$f_{w,b}(x^{(i)}) - y^{(i)}$ :in $f_{w,b}(x^{(i)})$ represents the value predicted by the trained model, and $y^{(i)}$ represents the true result of each training sample, and the difference between the two represents the error between the value predicted by the model and the true value.

Why square the error?

In all sample sets, each error may be positive or negative, and there is a certain probability of cancellation during the summation process. This will result in a small value (-5) after summing up when the error of each item is large (such as: -100, +90, -25, +30), and ultimately an incorrect judgment is made.

$1 2 m frac{1}{2m}$ : Represents the average of the sum of all data errors (this average value can represent the error of the entire model in a sense) to obtain the mean square error.

Why divide by 2?？

Because when performing gradient descent later, the derivation will divide the exponent 2 into the coefficient. Because for a large amount of data, a constant has little effect on the model. In order to simplify the derivation formula, it is divided by 2 here in advance to offset it later.

Knowing the cost function, we just need to find a way to reduce the cost. The lower the cost, the closer our predicted value is to the true value.

By observing the error cost function, we can see that the error cost function is a quadratic function, that is, aConvex function, a property of a convex function is:The extreme point is the most valuable point, since the cost function is a quadratic function that opens upward (the formula can be expanded to intuitively feel that the coefficient of the square term is greater than 0), the convex function only has a minimum value. We only need to find the minimum value of the cost function to get the minimum value. For the error cost function $J (w, b)$ , its formula expansion can be written as:
$frac{1}{2m}sumlimits_{i = 1}^m((wx^{(i)}+b) - y^{(i)})^2$
$J$ The size depends on the parameter $w$ and $b$ , which can be solved by gradient descent, the shape of the cost function is roughly as follows:

insert image description here

3) Gradient Descent

The idea of gradient descent mainly runsFind partial derivativesThis is consistent with the biologicalControl variablesThe method is very similar, such as: $b$ Update without changing $w$ (Visual $b$ is a constant), formula: $w ′ = w − α ∂ J ( w ) ∂ w w' = w - alpha frac{partial J(w)}{partial w}$ Indicates that they have been updated in sequence $w$ ,in $α$ representsLearning Rate It is used to indicate the step length, which can also be understood as the speed of descent. $∂ J ( w ) ∂ w frac{partial J(w)}{partial w}$ Express $w$ Taking the partial derivative, we get $W - J$ (WeightsWand costJA tangent line on a convex function is used to indicate the fastest decreasing value of the function.direction, the product of these two isMove one step in the direction where the function value decreases fastestThis distance needs to be adjusted according to the data set. $α$ Too large (step length is too large), which will lead to $w$ Going directly over the lowest point and reaching the high point at the other end, so that it can never approach the minimum value, if $α$ Too small (step size is too small), which will lead to $w$ It gets slower and slower as it approaches the minimum, consuming computational cost.

Learning rate（ $α$ )：

First set a smaller $α$ For example: 0.001.

Then increase by 10 times each time, up to 1.

After determining a certain value, such as: 0.01.

Then perform 3 times the processing, such as: $0.01 \times 3 = 0.03, 0.03 \times 3 = 0.09$ (The purpose of this is to make the convergence faster).

The process of solving partial derivatives (finding the direction of gradient descent):

beg $∂ J ( w ) ∂ w frac{partial J(w)}{partial w}$ :

$∂ J ( w ) ∂ w frac{partial J(w)}{partial w}$ = $w}frac{1}{2m}sumlimits_{i = 1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})^2$

= $w}frac{1}{2m}sumlimits_{i = 1}^{m}w x^{(i)} - y^{(i)})^2$

= $frac{1}{2m}sumlimits_{i = 1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})cdot2x^{(i)}$

= $frac{1}{m}sumlimits_{i = 1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}$
beg $∂ J ( w ) ∂ b frac{partial J(w)}{partial b}$ :

$∂ J ( w ) ∂ b frac{partial J(w)}{partial b}$ = $b}frac{1}{2m}sumlimits_{i = 1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})^2$

= $b}frac{1}{2m}sumlimits_{i = 1}^{m}w x^{(i)} - y^{(i)})^2$

= $frac{1}{2m}sumlimits_{i = 1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})cdot2$

= $frac{1}{m}sumlimits_{i = 1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})$

Find the specific $w$ Value and $b$ value:

$w hi l e () :$

$w^* = w - alpha frac{partial}{partial w}J(w,b)$

$b^* = b - alpha frac{partial}{partial b}J(w,b)$

$w = w^*$

$b = b^*$

Initially, we randomly select values for w and b, and then iterate. We can set the gradient descent to exit when the error is less than a certain threshold, or customize the number of iterations. Through a limited number of steps, we can get a more ideal $w$ Value and $b$ value.

2. Multivariate Linear Regression

Multivariate linear regression expands the dimension to three or even multiple dimensions, such as $y = w_1 x_1 + w_2 x_2 + b$ It can be understood asXAxis andYThe axes are $x_1 and x_2$ ZAxis $y$ , this is a three-dimensional state. Each training sample is a point in the three-dimensional space. We need to find a suitable straight line to fit the sample points in the three-dimensional space.

Method: For each variable ( $w_1,w_2,dots,w_n,b$ ) to perform gradient descent

Key Points: For multivariate linear regression, different feature values have different ranges, such as the age feature range: $0$ ~ $100$ , area: $0m^2$ ~ $10000m^2$ , there may also beStrange SampleThe existence of singular sample data will increase the training time and may also cause failure to converge. Therefore, when there are singular sample data, the preprocessed data needs to be processed before training.NormalizedOn the contrary, if there is no singular sample data, normalization can be omitted. To solve this problem, we need to normalize the data.Features are scaled (normalized)。

1) Data preprocessing

Data normalization

accomplishData NormalizationThere are three methods:

Same as dividing by the maximum value:

All values in each feature are divided by the maximum value in that feature.
Mean Normalization:

Subtract the value of each feature from the value of that featureMean, and then divided by the difference between the maximum and minimum values of the feature.
Z-score normalization:

Calculate the value of each featureStandard DeviationandMean

Subtract the value of each feature from all its valuesaverage value, and then divided by the featureMean

If normalization is not performed, the cost function will become "flat"In this way, when the gradient is descending, the direction of the gradient will deviate from the direction of the minimum value and go a lotdetour, which will make the training time too long.

After normalization, the objective function will be relatively "round"This will greatly speed up the training process and avoid many detours.

Benefits of data normalization:

After normalizationSpeed up the gradient descent to find the optimal solution, which means accelerating the convergence of the training network.
Normalization may improve accuracy.

2) Cost function (same as the cost function for univariate linear regression)

$frac{1}{2m}sumlimits_{i = 1}^m(f_{w,b}(x^{(i)}) - y^{(i)})^2$

3) Gradient Descent

Gradient Descent for Multiple Linear Regression：

$w_1 = W_1 - alpha frac{1}{m}sumlimits_{i = 1}^{m}(f_{vec{w},b}vec{x}^{(i)} - y^{(i)})x_1^{(i)}$

$⋮$

$w_n = W_n - alpha frac{1}{m}sumlimits_{i = 1}^{m}(f_{vec{w},b}vec{x}^{(i)} - y^{(i)})x_n^{(i)}$

$frac{1}{m}sumlimits_{i = 1}^{m}(f_{vec{w},b}vec{x}^{(i)} - y^{(i)})$

explain: $w_1cdots w_n$ represents the coefficient of each variable, and b represents the constant term of the linear function.

3. Normal equations

1) Data preprocessing

Omit...

2) Cost Function

Mathematical derivation:

$frac{1}{2m}sumlimits_{i = 1}^{m}(vec{theta}_i vec{x}_i - y_i)^2$

= $y||^2$

= $y)^T(vec{theta} vec{x} - y)$

= $frac{1}{2m}(vec{theta}^T vec{x}^T - y^T)(vec{theta} vec{x} - y)$

= $frac{1}{2m}(vec{theta}^T vec{x}^Tvec{x}vec{theta} - y^Tvec{x}vec{theta} -vec{theta}^Tvec{x}^Ty +y^Ty )$

3) Gradient Descent

right $θ$ Taking partial derivatives we get: $vec{theta}^T vec{x}^Tvec{x}vec{theta}}{partial theta} - frac{partial y^Tvec{x}vec{theta}}{partial theta} - frac{partial vec{theta}^Tvec{x}^Ty}{partial theta} + frac{y^Ty}{partial theta})$

Matrix derivative rule:

$theta^{T}Atheta}{partial theta} = (A + A^T)theta$

$X^{T}A}{partial X} = A$

$A^T$

$∂ A ∂ X = 0 frac{partial A}{partial X} = 0$

Available $vec{theta}^T vec{x}^Tvec{x}vec{theta}}{partial theta} - frac{partial y^Tvec{x}vec{theta}}{partial theta} - frac{partial vec{theta}^Tvec{x}^Ty}{partial theta} + frac{y^Ty}{partial theta}) = frac{1}{2m}cdot (2x^Txtheta - 2x^Ty) = frac{1}{m} cdot (x^Txtheta - x^Ty)$
when $Δ = 0$ hour: $x^Tx theta = x^Ty$ , $(x^Tx)^{-1}x^Ty$ It can be calculated $θ$ The value of .

Comparison of Gradient Descent and Normal Equations：

Gradient Descent：It is necessary to select the learning rate α, multiple iterations are required, and it is also suitable when the number of features n is large, and it is suitable for various types of models
Normal equations：No need to choose the learning rate α, it is calculated once, and needs to be calculated $x^Tx)^{-1}$ , if the number of features $n$ If it is larger, the computational cost will be high, because the computational time complexity of matrix inverse is $O(n^3)$ , usually when $n$ It is still acceptable when it is less than 10,000.Only applicable to linear models, which is not suitable for other models such as logistic regression models.

4. Polynomial Regression

In some cases, it is difficult for a straight line to fit all the data, so a curve is needed to fit the data, such as a quadratic model, a cubic model, and so on.

In general, the regression function of the data is unknown. Even if it is known, it is difficult to transform it into a linear model with a simple function transformation. Therefore, the common practice is toPolynomial regression(Polynomial Regression), that is, using a polynomial function to fit the data.

How to choose the degree of a polynomial：

There are many types of polynomial functions. Generally speaking, we need to observe the shape of the data first and then decide what form of polynomial function to use to solve the problem. For example, if we observe from the scatter plot of the data, there is a “bend ", we can consider using a quadratic polynomial (i.e. squaring the feature); there are two "bend ", you can consider using a cubic polynomial (taking the characteristic to the third power); there are three "bend ", then consider using a quartic polynomial (taking the fourth power of the feature), and so on.

Although the true regression function is not necessarily a polynomial of a certain degree, it is feasible to use an appropriate polynomial to approximate the true regression function as long as the fit is good.

2. Experimental part

The appendix at the end of the article contains all the original data used in the experiment.ex1data1.txtis the relationship between population and profits, ex1data2.txtIt is the impact of house size and number of bedrooms on house price.

1. Univariate Linear Regression

1) Loading data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
path = "ex1data1.txt"
data = pd.read_csv(path,header = None,names=['Population','Profit'])
data.head()   # 预览数据
1
2
3
4
5
6

insert image description here

2) View the data

data.describe()    # 更加详细的数据描述
1

insert image description here

# 可视化训练数据
data.plot(kind = 'scatter',x = 'Population',y = 'Profit',figsize = (12,8))
plt.show()
1
2
3

insert image description here

3) Define the cost function

def computerCost(X,y,theta):    # 定义代价函数
    inner = np.power(((X*theta.T)-y),2)   # theta.T表示theta的转置
    return np.sum(inner)/(2*len(X))
1
2
3

data.insert(0,"One",1)  # 表示在第0列前面插入一列数，其表头为One，其值为1
1

Insert in the first column of the dataset $1$ The purpose is toFacilitates matrix calculations, when matrix multiplication, weights are involved $w$ and bias $b$ ,because $b$ No multiplication is done with the variable, so a $1$ , used with $b$ Multiply.

4) Split the data

cols = data.shape[1]
X = data.iloc[:,0:cols - 1]  #“，”前只有“：”，表示所有的行，“，”后表示抽取数据中第[0列~第cols-1列)(左闭右开)，去掉最后一列，最后一列为预测值

y = data.iloc[:,cols - 1:cols]  #只取最后一列的值，表示预测值

1
2
3
4
5

X.head()
1

insert image description here

y.head()
1

insert image description here

X = np.matrix(X.values)
y = np.matrix(y.values)  #只将表格中的值装换为矩阵而不是包括序号与标题

#初始化theta
theta = np.matrix(np.array([0,0]))  #先是一个一维的数据，然后在转换为一个二维的矩阵
1
2
3
4
5

5) Initialization parameters

theta  
# => matrix([[0, 0]])
1
2

X.shape,theta.shape,y.shape  # 此时theta为一行列，需要进行转置
# => ((97, 2), (1, 2), (97, 1))
1
2

computerCost(X,y,theta)
# => 32.072733877455676
1
2

6) Define the gradient descent function

def gradientDecent(X,y,theta,alpha,iters):   #iters为迭代次数
    temp = np.matrix(np.zeros(theta.shape))   #构造一个与theta大小一样的零矩阵，用于存储更新后的theta
    parmaters = int (theta.ravel().shape[1])    #.ravel()的功能是将多维数组降至一维，用于求需要求的参数个数
    cost = np.zeros(iters)   #构建iters个0的数组，相当于对每次迭代的cost进行记录
    
    for i in range(iters):
        error = (X * theta.T - y)     #记录误差值，结果为一个数组
        for j in range(parmaters):    #对每一个参数进行更新，j用于表示每一个参数
            term = np.multiply(error,X[:,j])   #.multiply 是对矩阵当中的数对应相乘，这里表示与X矩阵的第j列相乘。
            temp[0,j] = theta[0,j] - ((alpha/len(X))*np.sum(term))  #存储更行后的theta的值，.sum()表示将矩阵中的数进行求和
        
        theta = temp      #更新theta
        cost[i] = computerCost(X,y,theta)  #计算此时的代价，并记录在cost中。
    return theta,cost
1
2
3
4
5
6
7
8
9
10
11
12
13
14

7) Initialize hyperparameters

alpha  = 0.01		# 学习率
iters = 1000		# 迭代次数
1
2

8) Gradient Descent

g,cost = gradientDecent(X,y,theta,alpha,iters)
g
# => matrix([[-3.24140214,  1.1272942 ]])
1
2
3

9) Calculate the cost

computerCost(X,y,g)
# => 4.515955503078914
1
2

10) Plotting a linear model

x = np.linspace(data.Population.min(),data.Population.max(),100) #抽取100个样本  (从data数据集中的最小值到最大值之间抽取100个样本)
f = g[0,0] + (g[0,1] * x)  #f = ax + b

fig,ax = plt.subplots(figsize = (12,8))    #figsize表示图的大小
ax.plot(x,f,'r',label = "Prediction")    #绘制直线，横坐标，纵坐标，直线名称
ax.scatter(data.Population,data.Profit,label = 'Training data')   #绘制点，横坐标，纵坐标，点的名称
ax.legend(loc = 4)  #显示图例位置
ax.set_xlabel('Population')  #设置x轴的名称
ax.set_ylabel('Profit')   #设置y轴的名称
ax.set_title('Predicted Profit vs. Population Size')  #设置标题的名称
plt.show()
1
2
3
4
5
6
7
8
9
10
11

insert image description here

11) Draw the cost change curve

fig,ax = plt.subplots(figsize = (12,8))
ax.plot(np.arange(iters),cost,'r')
ax.set_xlabel('Interations')
ax.set_ylabel('Cost')
ax.set_title("Error vs. Training Epoc")
plt.show()
1
2
3
4
5
6

insert image description here

2. Multivariate Linear Regression

1) Loading data

path = "ex1data2.txt"
data2 = pd.read_csv(path,header = None,names=["Size","Bedroom","Price"])
data2.head()
1
2
3

insert image description here

2) Data normalization

data2 = (data2 - data2.mean())/data2.std()
data2.head()
1
2

insert image description here

3) Split the data

data2.insert(0,'Ones',1)  #在x的第一列插入1

clos = data2.shape[1]   #存储第二维（列）的数据量
X2 = data2.iloc[:,0:clos-1]  #对X2进行赋值
y2 = data2.iloc[:,clos-1:clos]  #对y2进行赋值

X2 = np.matrix(X2.values)  #将X2转为矩阵
y2 = np.matrix(y2.values)  #将y2转为矩阵
theta2 = np.matrix(np.array([0,0,0]))  #初始化theta2为0矩阵
computerCost(X2, y2, theta2)
# => 0.48936170212765967
1
2
3
4
5
6
7
8
9
10
11

4) Gradient Descent

g2,cost2 = gradientDecent(X2,y2,theta2,alpha,iters)   #记录放回值g2（theta2）和cost2
g2
# => matrix([[-1.10868761e-16,  8.78503652e-01, -4.69166570e-02]])
1
2
3

5) Calculate the cost

computerCost(X2,y2,g2)
# => 0.13070336960771892
1
2

6) Draw the cost change curve

fig,ax = plt.subplots(figsize = (12,8))
ax.plot(np.arange(iters),cost2,'x')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
plt.show()
1
2
3
4
5
6

Please add a description of the image

3. Normal equations

#正规方程
def normalEqn(X,y):
    theta = np.linalg.inv(X.T@X)@X.T@y   #.linalg中包含线性代数中的函数，求矩阵的逆（inv）、特征值等。@表示矩阵相乘
    return theta
1
2
3
4

final_theta2 = normalEqn(X,y)
final_theta2
# => matrix([[-3.89578088], [ 1.19303364]])
1
2
3

Conclusion

General steps for training a model：

Data preprocessing.
Choose a model based on your specific problem.
Set the cost function.
The gradient descent algorithm is used to find the optimal parameters.
Evaluate the model and adjust hyperparameters.
Use the model to make predictions.

IV. Appendix

1. `ex1data1.txt`

6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
8.3829,11.886
7.4764,4.3483
8.5781,12
6.4862,6.5987
5.0546,3.8166
5.7107,3.2522
14.164,15.505
5.734,3.1551
8.4084,7.2258
5.6407,0.71618
5.3794,3.5129
6.3654,5.3048
5.1301,0.56077
6.4296,3.6518
7.0708,5.3893
6.1891,3.1386
20.27,21.767
5.4901,4.263
6.3261,5.1875
5.5649,3.0825
18.945,22.638
12.828,13.501
10.957,7.0467
13.176,14.692
22.203,24.147
5.2524,-1.22
6.5894,5.9966
9.2482,12.134
5.8918,1.8495
8.2111,6.5426
7.9334,4.5623
8.0959,4.1164
5.6063,3.3928
12.836,10.117
6.3534,5.4974
5.4069,0.55657
6.8825,3.9115
11.708,5.3854
5.7737,2.4406
7.8247,6.7318
7.0931,1.0463
5.0702,5.1337
5.8014,1.844
11.7,8.0043
5.5416,1.0179
7.5402,6.7504
5.3077,1.8396
7.4239,4.2885
7.6031,4.9981
6.3328,1.4233
6.3589,-1.4211
6.2742,2.4756
5.6397,4.6042
9.3102,3.9624
9.4536,5.4141
8.8254,5.1694
5.1793,-0.74279
21.279,17.929
14.908,12.054
18.959,17.054
7.2182,4.8852
8.2951,5.7442
10.236,7.7754
5.4994,1.0173
20.341,20.992
10.136,6.6799
7.3345,4.0259
6.0062,1.2784
7.2259,3.3411
5.0269,-2.6807
6.5479,0.29678
7.5386,3.8845
5.0365,5.7014
10.274,6.7526
5.1077,2.0576
5.7292,0.47953
5.1884,0.20421
6.3557,0.67861
9.7687,7.5435
6.5159,5.3436
8.5172,4.2415
9.1802,6.7981
6.002,0.92695
5.5204,0.152
5.0594,2.8214
5.7077,1.8451
7.6366,4.2959
5.8707,7.2029
5.3054,1.9869
8.2934,0.14454
13.394,9.0551
5.4369,0.61705
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

2. `ex1data2.txt`

2104,3,399900
1600,3,329900
2400,3,369000
1416,2,232000
3000,4,539900
1985,4,299900
1534,3,314900
1427,3,198999
1380,3,212000
1494,3,242500
1940,4,239999
2000,3,347000
1890,3,329999
4478,5,699900
1268,3,259900
2300,4,449900
1320,2,299900
1236,3,199900
2609,4,499998
3031,4,599000
1767,3,252900
1888,2,255000
1604,3,242900
1962,4,259900
3890,3,573900
1100,3,249900
1458,3,464500
2526,3,469000
2200,3,475000
2637,3,299900
1839,2,349900
1000,1,169900
2040,4,314900
3137,3,579900
1811,4,285900
1437,3,249900
1239,3,229900
2132,4,345000
4215,4,549000
2162,4,287000
1664,2,368500
2238,3,329900
2567,4,314000
1200,3,299000
852,2,179900
1852,4,299900
1203,3,239500
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

Technology Sharing