Summary of interview questions for large models/NLP/algorithms 6: Why do gradients disappear and explode?

2024-07-12

Gradient vanishing and gradient exploding are common problems in deep learning. They mainly occur during the training of neural networks, especially when usingWhen the back propagation algorithm updates the weightsThe following is a detailed analysis of the causes of these two problems:

1. Reasons for gradient disappearance

Deep network structure：
- whenToo many neural network layersWhen the gradient passes through the back propagation processMultiple multiplicationOperation. IfThe gradient of each layer is less than 1(For example, the derivative of the sigmoid function is less than 0.25 in most cases), thenAs the number of layers increases, the gradient value will decay rapidly in an exponential form to close to 0, causing the gradient to disappear.
Inappropriate activation function：
- someActivation FunctionThe derivative of (such as sigmoid and tanh) becomes very small when the input value is far away from the origin, which causes the gradient value to decrease rapidly during backpropagation, thus causing the gradient to disappear.
Improper weight initialization：
- ifThe network weights are initialized too small, It may also cause the gradient value to be too small during the back propagation process, which in turn causes the gradient to disappear.

2. Causes of Gradient Explosion

Deep network structure：
- Similar to the vanishing gradient problem,Deep network structureThis can also lead to gradient explosion. However, in this case, the gradient will pass through during the back propagation process.Multiple multiplicationoperation, andThe gradient of each layer is greater than 1, then as the number of layers increases, the gradient value will increase exponentially to a very large value, resulting in a gradient explosion.
Inappropriate activation function：
- Although activation functions themselves do not necessarily lead to exploding gradients, in some cases (such as usingReLU activation functionand the input value continues to be positive), the gradient may remain unchanged or continue to increase, increasing the risk of gradient explosion.
Improper weight initialization：
- ifThe network weights are initialized too large, then during the back-propagation process, the gradient value may increase rapidly to a very large value, resulting in a gradient explosion.

3. Root Cause

Gradient vanishing and gradient explodingThe root cause lies in the shortcomings of the back-propagation algorithmIn deep networks, different layers learn at very different speeds.This is manifested in the fact that the layers close to the output in the network learn very well, while the layers close to the input learn very slowly.，Sometimes, even after a long training period, the weights of the first few layers are almost the same as the values randomly initialized at the beginning.This is mainly becauseThe cumulative multiplication effect of gradients during backpropagationcaused by.

4. Solution

To solve the gradient disappearance and gradient explosion problems, the following strategies can be adopted:

Choose the right activation function：
- useActivation functions such as ReLU, Leaky ReLU, etc., the derivatives of these functions are greater than 0 in most cases, which can effectively alleviate the gradient vanishing problem.
Reasonable weight initialization：
- useXavier、HeInitialization methods such asAutomatically adjust the range of weight initialization according to the number of network layers, thereby reducing the risk of gradient vanishing and gradient exploding.
Using Batch Normalization：
- The BN layer canThe input of each layer is normalized, so that the input distribution of each layer remains consistent, thereby reducing the risk of gradient disappearance and gradient explosion.
Residual Network (ResNet)：
- passIntroducing cross-layer connection structure, the residual network can beDeepening the number of network layers while alleviating the gradient disappearance problem。
Gradient Clipping：
- During the gradient update process, ifThe gradient value is too large, so it can be clipped, to prevent gradient explosion.
useA more suitable optimizer：
- likeOptimizers such as Adam can automatically adjust the learning rate, and update the parameters according to the first-order moment and second-order moment of the gradient, thereby reducing the risk of gradient disappearance and gradient explosion.

Technology Sharing