Logistic regression (pure theory)

2024-07-12

1. What is logistic regression?

Logistic regression is a commonly used statistical learning method, mainly used to solve classification problems. Although the name contains "regression", it is actually a classification algorithm.

2. Why do we need to use logistic regression in machine learning?

1. Binary Classification

This is the most basic and common use of logistic regression. It can predict whether an event will occur, and the output is yes or no.

For example:

Predict whether a user will click on an ad
Determine whether the email is spam
Diagnose whether a patient has a disease

These examples all have one thing in common, which is that they only have two results, true (1) and false (0).

2. Multi-category classification

Logistic regression can be extended to multi-class classification problems through methods such as One-vs-Rest or softmax.

For example:

Object Classification in Image Recognition
Text classification (news classification, sentiment analysis, etc.)

These examples all have one thing in common, which is that the same object has multiple possible results, similar to our common multiple-choice questions, where there are multiple options, but only one option is the most appropriate answer.

3. Probabilistic prediction

Logistic regression not only gives classification results, but also outputs probability values, which is very useful in many scenarios.

For example:

Predict the probability of a customer purchasing a product
Assessing the loan applicant’s probability of default risk

These examples all have one thing in common, which is prediction, that is, using known results to infer unknown results.

If you still don’t understand the role of logistic regression, it’s okay. Let me give you an easy-to-understand example.

Imagine you are a doctor who needs to determine whether a patient has a certain disease. Logistic regression is like an intelligent assistant that helps you make this judgment. Just like a doctor would look at the various physical examination indicators of a patient, logistic regression will consider multiple relevant factors (we call them features). Some indicators may be more important than others. Logistic regression assigns a "weight" to each factor to reflect its importance. In the end, it does not simply say "yes" or "no", but gives a probability. For example, "there is a 70% chance that this patient has the disease." You can set a standard, such as "yes" if it is more than 50%, otherwise it is "no". Logistic regression "learns" from a large number of known cases. Just like a doctor accumulates experience through a large number of cases.

Of course, the role of logistic regression is far more than this, but due to space limitations (~~Actually, I just want to be lazy.~~), I won't introduce it in detail.

3. OK, let's introduce the formula for logistic regression

I personally don't like to give you a bunch of math formulas and then tell you that the underlying principles are these math formulas, and let you slowly understand them yourself. What I hope to do is to analyze a few core formulas and why these formulas are enough. This is what I hope to explain clearly in my article.

1. Linear regression formula

y = b0 + b1x1 + b2x2 + ... + bn*xn

yis the dependent variable, the value we want to predict.
b0is the intercept, which indicates the value of the dependent variable when all independent variables are 0.
b1, b2, ..., bnis the regression coefficient, which indicates the influence of each independent variable on the dependent variable.
x1, x2, ..., xnis the independent variable, which is used to predict the value of the dependent variable

Here is an example of linear regression:

It doesn’t matter if you don’t understand the code, just look at the pictures


import matplotlib.pyplot as plt
import numpy as np
import matplotlib.font_manager as fm
 
# Generate some simulated house data
np.random.seed(0)
area = np.random.rand(100) * 200 + 50  # House area (square meters)
price = 2 * area + 5000 + np.random.randn(100) * 500  # House price (ten thousand yuan)
 
# Fit the data using linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(area.reshape(-1, 1), price)
 
# Get the regression coefficients
b0 = model.intercept_
b1 = model.coef_[0]
 
# Plot the scatter plot
plt.scatter(area, price, label="House Data")
 
# Plot the regression line
plt.plot(area, b0 + b1*area, color="red", label="Linear Regression")
 
# Set the plot title and axis labels
plt.title("Linear Regression of House Area and Price")
 
# Set the font to SimSun (楷体)
font_prop = fm.FontProperties(fname=r"C:WindowsFontssimkai.ttf", size=12)  # Replace with your SimSun font path
 
plt.xlabel("House Area (Square Meters)", fontproperties=font_prop)
plt.ylabel("House Price (Ten Thousand Yuan)", fontproperties=font_prop)
 
# Add legend
plt.legend()
 
# Show the plot
plt.show()

1. Generate 100 house simulation data


np.random.seed(0)
area = np.random.rand(100) * 200 + 50
price = 2 * area + 5000 + np.random.randn(100) * 500

2. Fit the data using linear regression


from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(area.reshape(-1, 1), price)

3. Obtain regression coefficients (b0 ~ bn)


b0 = model.intercept_
b1 = model.coef_[0]

4. Draw a scatter plot

plt.scatter(area, price, label="House Data")

5. Draw the regression line

plt.plot(area, b0 + b1*area, color="red", label="Linear Regression")

6. Set the article title

plt.title("Linear Regression of House Area and Price")

7. Set the font to Kaiti and the font size (if any)


font_prop = fm.FontProperties(fname=r"C:WindowsFontssimkai.ttf", size=12)
plt.xlabel("House Area (Square Meters)", fontproperties=font_prop)
plt.ylabel("House Price (Ten Thousand Yuan)", fontproperties=font_prop)

8. Add a legend

plt.legend()

9. Display charts

plt.show()

This code uses linear regression to fit the relationship between house area and price (the horizontal and vertical titles are described in English to avoid errors)

The results are as follows:

Someone may ask: Why do we generate so much data?

Good question!

With this data, can we roughly calculate y = b0 + b1x1 + b2x2 + ... + bn*xn What about the coefficient?

Specifically:

y represents the target variable we want to predict, such as house prices.
x1, x2, ... xn represent factors that affect the target variable, such as house size, number of rooms, geographical location, etc.
b0, b1, b2, ... bn represent the influence of each factor on the target variable, which is the coefficient we want to calculate.

By collecting a large amount of data, we can use a linear regression model to calculate these coefficients and build a predictive model.This model can help us understand the impact of different factors on the target variable and predict the value of the target variable in the future (similar to y = kx + b in mathematics. With specific k and b, we can predict y by getting x. The difference is that there are more coefficients k here)

2.Sigmoid function formula

σ(x) = 1 / (1 + exp(-x))

The graph of the sig function is shown below:

Question 1: Why choose the sigmoid function?

As we mentioned before, the most basic use of logistic regression is to solve the problem of binary classification.

The goal of logistic regression is to convert the output of a linear model (which can be any real number) into a probability value, which represents the possibility of an event occurring, and the probability value should naturally be between 0 and 1.

The sigmoid function accomplishes this task perfectly: it compresses the output of the linear model to between 0 and 1, and as the input value increases, the output value gradually increases, which is consistent with the changing trend of the probability value.

If you look at the function of sig, when it approaches positive infinity, it approaches 1 infinitely, and when it approaches negative infinity, it approaches 0 infinitely. Doesn't it just fit our problem of either 0 or 1?

Question 2: How to classify the sigmoid function?

Someone will say, that’s not right. Although the two sides can be infinitely close, the middle cannot be controlled. For example, 0.5 is between 0 and 1, so do you think 0.5 is closer to 0 or to 1?

Although the number in the middle cannot be approached, I can set it artificially~

For example, I put the numbers >= 0.5 into the 1 category, and the numbers < 0.5 into the 0 category, so the problem is solved~

Therefore, 0.5 is not a critical point, but a threshold we set for classification.

3. Logistic regression formula

p = σ(b0 + b1*x1 + b2*x2 + ... + bn*xn) = 1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + ... + bn*xn)))

We have said so much before just to introduce this formula

Does it give you a headache just looking at it? It gives me a headache too. Let’s simplify it.

You see, isn't this much more refreshing? Just like a handsome guy who doesn't like to dress up, he tidies up his appearance a little bit, and then you find that wow, this person is so handsome~

Ahem, off topic~ SoLogistic regression is actually linear regression + sigmoid function

The z in the sigmoid function is the linear regression. y = b0 + b1x1 + b2x2 + ... + bn*xn Go to replace

So what does this do?

OK, let's take a look~

4. The nature and function of logistic regression

The essence of logistic regression isPredict the probability of an event occurringIt does not directly classify the data, but maps the results of linear regression to the interval of 0~1 through a function (Sigmoid function). The values in this interval represent the possibility of the event occurring.

The basis of logistic regression is linear regression. Linear regression builds a linear model and tries to fit the data with a linear function to get a predicted value. This predicted value can be any value and is not limited to the interval of 0~1.

The Sigmoid function is a "magic" function that converts the predicted value obtained by linear regression to the interval of 0 to 1, and the value in this interval can be interpreted as the probability of an event occurring.

In order to perform binary classification, we need to set a threshold, usually 0.5. If the predicted probability is greater than the threshold, it is judged as a positive class, otherwise it is judged as a negative class.

For example:

The function on the left can be seen as a linear regression function and the function on the right is the mapped sigmoid function.

All pictures in this article are from【Machine Learning】Learn Logistic Regression in 10 Minutes, Easy to Understand (Contains Spark Solution Process)_Bilibili_bilibili

Technology Sharing