2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Logistic regression is a commonly used statistical learning method, mainly used to solve classification problems. Although the name contains "regression", it is actually a classification algorithm.
This is the most basic and common use of logistic regression. It can predict whether an event will occur, and the output is yes or no.
For example:
These examples all have one thing in common, which is that they only have two results, true (1) and false (0).
Logistic regression can be extended to multi-class classification problems through methods such as One-vs-Rest or softmax.
For example:
These examples all have one thing in common, which is that the same object has multiple possible results, similar to our common multiple-choice questions, where there are multiple options, but only one option is the most appropriate answer.
Logistic regression not only gives classification results, but also outputs probability values, which is very useful in many scenarios.
For example:
These examples all have one thing in common, which is prediction, that is, using known results to infer unknown results.
Imagine you are a doctor who needs to determine whether a patient has a certain disease. Logistic regression is like an intelligent assistant that helps you make this judgment. Just like a doctor would look at the various physical examination indicators of a patient, logistic regression will consider multiple relevant factors (we call them features). Some indicators may be more important than others. Logistic regression assigns a "weight" to each factor to reflect its importance. In the end, it does not simply say "yes" or "no", but gives a probability. For example, "there is a 70% chance that this patient has the disease." You can set a standard, such as "yes" if it is more than 50%, otherwise it is "no". Logistic regression "learns" from a large number of known cases. Just like a doctor accumulates experience through a large number of cases.
Of course, the role of logistic regression is far more than this, but due to space limitations (Actually, I just want to be lazy.), I won't introduce it in detail.
I personally don't like to give you a bunch of math formulas and then tell you that the underlying principles are these math formulas, and let you slowly understand them yourself. What I hope to do is to analyze a few core formulas and why these formulas are enough. This is what I hope to explain clearly in my article.
y = b0 + b1x1 + b2x2 + ... + bn*xn
It doesn’t matter if you don’t understand the code, just look at the pictures
- import matplotlib.pyplot as plt
- import numpy as np
- import matplotlib.font_manager as fm
-
- # Generate some simulated house data
- np.random.seed(0)
- area = np.random.rand(100) * 200 + 50 # House area (square meters)
- price = 2 * area + 5000 + np.random.randn(100) * 500 # House price (ten thousand yuan)
-
- # Fit the data using linear regression
- from sklearn.linear_model import LinearRegression
- model = LinearRegression()
- model.fit(area.reshape(-1, 1), price)
-
- # Get the regression coefficients
- b0 = model.intercept_
- b1 = model.coef_[0]
-
- # Plot the scatter plot
- plt.scatter(area, price, label="House Data")
-
- # Plot the regression line
- plt.plot(area, b0 + b1*area, color="red", label="Linear Regression")
-
- # Set the plot title and axis labels
- plt.title("Linear Regression of House Area and Price")
-
- # Set the font to SimSun (楷体)
- font_prop = fm.FontProperties(fname=r"C:WindowsFontssimkai.ttf", size=12) # Replace with your SimSun font path
-
- plt.xlabel("House Area (Square Meters)", fontproperties=font_prop)
- plt.ylabel("House Price (Ten Thousand Yuan)", fontproperties=font_prop)
-
- # Add legend
- plt.legend()
-
- # Show the plot
- plt.show()
1. Generate 100 house simulation data
- np.random.seed(0)
- area = np.random.rand(100) * 200 + 50
- price = 2 * area + 5000 + np.random.randn(100) * 500
2. Fit the data using linear regression
- from sklearn.linear_model import LinearRegression
- model = LinearRegression()
- model.fit(area.reshape(-1, 1), price)
3. Obtain regression coefficients (b0 ~ bn)
- b0 = model.intercept_
- b1 = model.coef_[0]
4. Draw a scatter plot
plt.scatter(area, price, label="House Data")
5. Draw the regression line
plt.plot(area, b0 + b1*area, color="red", label="Linear Regression")
6. Set the article title
plt.title("Linear Regression of House Area and Price")
7. Set the font to Kaiti and the font size (if any)
- font_prop = fm.FontProperties(fname=r"C:WindowsFontssimkai.ttf", size=12)
- plt.xlabel("House Area (Square Meters)", fontproperties=font_prop)
- plt.ylabel("House Price (Ten Thousand Yuan)", fontproperties=font_prop)
8. Add a legend
plt.legend()
9. Display charts
plt.show()
This code uses linear regression to fit the relationship between house area and price (the horizontal and vertical titles are described in English to avoid errors)
Someone may ask: Why do we generate so much data?
Good question!
With this data, can we roughly calculate y = b0 + b1x1 + b2x2 + ... + bn*xn What about the coefficient?
Specifically:
By collecting a large amount of data, we can use a linear regression model to calculate these coefficients and build a predictive model.This model can help us understand the impact of different factors on the target variable and predict the value of the target variable in the future (similar to y = kx + b in mathematics. With specific k and b, we can predict y by getting x. The difference is that there are more coefficients k here)
σ(x) = 1 / (1 + exp(-x))
The graph of the sig function is shown below:
As we mentioned before, the most basic use of logistic regression is to solve the problem of binary classification.
The goal of logistic regression is to convert the output of a linear model (which can be any real number) into a probability value, which represents the possibility of an event occurring, and the probability value should naturally be between 0 and 1.
The sigmoid function accomplishes this task perfectly: it compresses the output of the linear model to between 0 and 1, and as the input value increases, the output value gradually increases, which is consistent with the changing trend of the probability value.
If you look at the function of sig, when it approaches positive infinity, it approaches 1 infinitely, and when it approaches negative infinity, it approaches 0 infinitely. Doesn't it just fit our problem of either 0 or 1?
Someone will say, that’s not right. Although the two sides can be infinitely close, the middle cannot be controlled. For example, 0.5 is between 0 and 1, so do you think 0.5 is closer to 0 or to 1?
Although the number in the middle cannot be approached, I can set it artificially~
For example, I put the numbers >= 0.5 into the 1 category, and the numbers < 0.5 into the 0 category, so the problem is solved~
Therefore, 0.5 is not a critical point, but a threshold we set for classification.
p = σ(b0 + b1*x1 + b2*x2 + ... + bn*xn) = 1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + ... + bn*xn)))
We have said so much before just to introduce this formula
Does it give you a headache just looking at it? It gives me a headache too. Let’s simplify it.
You see, isn't this much more refreshing? Just like a handsome guy who doesn't like to dress up, he tidies up his appearance a little bit, and then you find that wow, this person is so handsome~
Ahem, off topic~ SoLogistic regression is actually linear regression + sigmoid function
The z in the sigmoid function is the linear regression. y = b0 + b1x1 + b2x2 + ... + bn*xn Go to replace
So what does this do?
OK, let's take a look~
The essence of logistic regression isPredict the probability of an event occurringIt does not directly classify the data, but maps the results of linear regression to the interval of 0~1 through a function (Sigmoid function). The values in this interval represent the possibility of the event occurring.
The basis of logistic regression is linear regression. Linear regression builds a linear model and tries to fit the data with a linear function to get a predicted value. This predicted value can be any value and is not limited to the interval of 0~1.
The Sigmoid function is a "magic" function that converts the predicted value obtained by linear regression to the interval of 0 to 1, and the value in this interval can be interpreted as the probability of an event occurring.
In order to perform binary classification, we need to set a threshold, usually 0.5. If the predicted probability is greater than the threshold, it is judged as a positive class, otherwise it is judged as a negative class.
For example:
The function on the left can be seen as a linear regression function and the function on the right is the mapped sigmoid function.