2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
I met you a long time ago. BatchNorm
andLayerNorm
, at that time, we only roughly knew that they werestandardization:
x
=
x
−
μ
σ
bm{x} = frac{bm{x} - bm{mu}}{bm{sigma}}
x=σx−μ This is of course simple, but the actual data is more complicated.CV TasksFor image data, a batch of data is usually like this: [B, C, H, W]
; For NLP tasksFor sentence data, a batch of data is usually like this: [B, T, D]
. That XXNorm
How is it calculated? It is definitely not to calculate all the numbers of the entire batch at once
μ
bm{mu}
μ and
σ
bm{sigma}
σ, and they are distinct.
So I looked up a lot of information, and it really proved what my mentor said: Don't read those messy blogs, their level is not as good as yours! Foreigners are really good, just answering questions in the comment section and explaining the problem clearly: 《Why do transformers use layer norm instead of batch norm?》
This section explains in detail from both theoretical and code perspectives. BatchNorm
andLayerNorm
The calculation method of .
>>>>> Thousand Questions on Tongyi>>>>>
ask: What is the difference between layer norm and batch norm? Give a formula explanation.
answer: Layer Normalization (LayerNorm) andBatch Normalization (BatchNorm) These are all techniques used to accelerate neural network training, by normalizing the layersEnter or activate a value, making the network learning more stable and efficient. However, theyImplementationandApplicable sceneThere is some difference on.
Batch Normalization (BatchNorm)
BatchNorm
Mainly during trainingFor a mini-batch of datato normalize.main targetyesReducing internal covariate shift(internal covariate shift), that is, the phenomenon that the input distribution of the intermediate layer changes with the training iterations during the forward propagation of the network. BatchNorm
via caculationThe mean and variance of the current mini-batchTo normalize the input, the formula is as follows:
set up
x
bm{x}
x is an input vector within a mini-batch,
μ
B
bm{mu}_B
μB and
σ
B
2
bm{sigma}^2_B
σB2 are the mean and variance of the mini-batch respectively, thenBatchNorm
Output
y
bm{y}
y It can be expressed as:
x
^
=
x
−
μ
B
μ
B
2
+
ϵ
y
^
=
γ
x
^
+
β
^bmx=bmx−bmμB√bmμ2B+ϵ^bmy=γ^bmx+β
x^y^=μB2+ϵx−μB=γx^+β in,
γ
gamma
γ and
β
beta
β is a learnable parameter,
ϵ
epsilon
ϵ is a small constant used for numerical stability.
Layer Normalization (LayerNorm)
LayerNorm
The purpose isAt each sample levelNormalize without relying on mini-batch statistics. LayerNorm
More suitable for sequence models (such asLSTM
orTransformer
), etc., among whichThere may be dependencies between samplesor The mini-batch size may not be fixed. LayerNorm
calculateThe mean and variance of each sample feature, the formula is as follows:
set up
x
bm{x}
x is the feature vector of a single sample,
μ
L
bm{mu}_L
μL and
σ
L
2
bm{sigma}^2_L
σL2 are the mean and variance of the sample features respectively, thenLayerNorm
Output
y
bm{y}
y It can be expressed as:
x
^
=
x
−
μ
L
μ
L
2
+
ϵ
y
^
=
γ
x
^
+
β
^bmx=bmx−bmμL√bmμ2L+ϵ^bmy=γ^bmx+β
x^y^=μL2+ϵx−μL=γx^+β Here too,
γ
gamma
γ and
β
beta
β is a learnable parameter,
ϵ
epsilon
ϵ is a very small constant.
the difference
BatchNorm
Depends onMini-batch statistics, and LayerNorm
Depends onFeature statistics of a single sample;BatchNorm
More suitable forComputer Vision Tasks, and LayerNorm
More suitable forNatural language processing tasks, especially when dealing with sequence data;BatchNorm
It may be more efficient in parallel computing because it can exploit the parallelism within the mini-batch.LayerNorm
The calculation on each sample is independent, which is more suitable for processing sequence data of different lengths.These two methods have their own advantages and disadvantages, and the choice of which method usually depends on the specific application scenario and data characteristics.
<<<<< Thousand Questions on Tongyi<<<<
Do you understand? Confused, What is the "mean and variance of the current mini-batch"?? What is the mean and variance of each sample feature?? The foreigner’s answer is better:
First clarify the basic concepts:
This is reflected in the data:
[B, C, H, W]
is a batch,[C, H, W]
is an instance,[C]
is a pixel, containingC
feature.[B, T, L]
is a batch,[T, L]
is an instance,[L]
is a word, containingL
feature.As shown below:
From the Batch Dimension side, each small square extending backwards represents an element, such as the purple strip in the left figure, the RGB feature of a pixel, or a word vector.LayerNorm
GiveEach element Calculate the mean and variance, and we getBxL
The mean and variance (orBxHxW
). ThenEach element is standardized independentlyThe purple slice in the right picture is a feature, the first feature of all words in the batch. Each such slice is a feature, BatchNorm
GiveEach feature Calculate the mean and variance, and we getL
The mean and variance (orC
). ThenEach feature is standardized independently.
It should be noted that Transformer does not follow the aboveLayerNorm
Calculated, but givenEach instance Calculate the mean and variance, and we getB
means and variances, thenEach instance is standardized independently. To be exact, it looks like the following picture:
BatchNorm
andLayerNorm
BatchNorm
In PyTorch, BatchNorm
pointnn.BatchNorm1d
, nn.BatchNorm2d
andnn.BatchNorm3d
, respectively for data of different dimensions:
nn.BatchNorm1d
: (N, C)
or (N, C, L)
nn.BatchNorm2d
: (N, C, H, W)
nn.BatchNorm3d
: (N, C, D, H, W)
View source code:
class BatchNorm1d(_BatchNorm):
r"""
Args:
num_features: number of features or channels `C` of the input
Shape:
- Input: `(N, C)` or `(N, C, L)`, where `N` is the batch size,
`C` is the number of features or channels, and `L` is the sequence length
- Output: `(N, C)` or `(N, C, L)` (same shape as input)
"""
def _check_input_dim(self, input):
if input.dim() != 2 and input.dim() != 3:
raise ValueError(f"expected 2D or 3D input (got {input.dim()}D input)")
Examples:
>>> m = nn.BatchNorm1d(100) # C=100 # With Learnable Parameters
>>> m = nn.BatchNorm1d(100, affine=False) # Without Learnable Parameters
>>> input = torch.randn(20, 100) # (N, C)
>>> output = m(input)
>>> # 或者
>>> input = torch.randn(20, 100, 30) # (N, C, L)
>>> output = m(input)
γ
,
β
bm{gamma}, bm{beta}
γ,β is a learnable parameter, andshape=(C,)
, the parameter name is .weight
and.bias
:
>>> m = nn.BatchNorm1d(100)
>>> m.weight
Parameter containing:
tensor([1., 1., ..., 1.], requires_grad=True)
>>> m.weight.shape
torch.Size([100])
>>> m.bias
Parameter containing:
tensor([0., 0., ..., 0.], requires_grad=True)
BatchNorm2d
andBatchNorm3d
are the same, except that_check_input_dim(input)
:
class BatchNorm2d(_BatchNorm):
r"""
Args:
num_features: `C` from an expected input of size `(N, C, H, W)`
Shape:
- Input: :math:`(N, C, H, W)`
- Output: :math:`(N, C, H, W)` (same shape as input)
"""
def _check_input_dim(self, input):
if input.dim() != 4:
raise ValueError(f"expected 4D input (got {input.dim()}D input)")
Examples:
>>> m = nn.BatchNorm2d(100)
>>> input = torch.randn(20, 100, 35, 45)
>>> output = m(input)
class BatchNorm3d(_BatchNorm):
r"""
Args:
num_features: `C` from an expected input of size `(N, C, D, H, W)`
Shape:
- Input: :math:`(N, C, D, H, W)`
- Output: :math:`(N, C, D, H, W)` (same shape as input)
"""
def _check_input_dim(self, input):
if input.dim() != 5:
raise ValueError(f"expected 5D input (got {input.dim()}D input)")
Examples:
>>> m = nn.BatchNorm3d(100)
>>> input = torch.randn(20, 100, 35, 45, 10)
>>> output = m(input)
LayerNorm
Different from BatchNorm(num_features)
, LayerNorm(normalized_shape)
The parameters areinput.shape
Afterx
indivualdim
, like [B, T, L]
The last two dimensions[T, L]
, then each sentence will be standardized independently; if L
or[L]
, then each word vector is normalized independently.
NLP Example
>>> batch, sentence_length, embedding_dim = 20, 5, 10
>>> embedding = torch.randn(batch, sentence_length, embedding_dim)
>>> layer_norm = nn.LayerNorm(embedding_dim)
>>> layer_norm(embedding) # Activate module
Image Example
>>> N, C, H, W = 20, 5, 10, 10
>>> input = torch.randn(N, C, H, W)
>>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
>>> layer_norm = nn.LayerNorm([C, H, W])
>>> output = layer_norm(input)
That is to say, it not only includes the “various element Independently standardize” and “eachinstance independently normalized", and anynormalize over the last x dimensions.
import torch
from torch import nn
# >>> 手动计算 BatchNorm2d >>>
weight = torch.ones([1, 3, 1, 1])
bias = torch.zeros([1, 3, 1, 1])
x = 10 * torch.randn(2, 3, 4, 4) + 100
mean = x.mean(dim=[0, 2, 3], keepdim=True)
std = x.std(dim=[0, 2, 3], keepdim=True, unbiased=False)
print(x)
print(mean)
print(std)
y = (x - mean) / std
y = y * weight + bias
print(y)
# <<< 手动计算 BatchNorm2d <<<
# >>> nn.BatchNorm2d >>>
bnm2 = nn.BatchNorm2d(3)
z = bnm2(x)
print(z)
# <<< nn.BatchNorm2d <<<
print(torch.norm(z - y, p=1))
You will find that manual calculations and nn.BatchNorm
The calculations are almost identical, there may be some
ϵ
epsilon
ϵ Note that hereunbiased=False
There are some details, as described in the official documentation:
"""
At train time in the forward pass, the standard-deviation is calculated via the biased estimator,
equivalent to `torch.var(input, unbiased=False)`.
However, the value stored in the moving average of the standard-deviation is calculated via
the unbiased estimator, equivalent to `torch.var(input, unbiased=True)`.
Also by default, during training this layer keeps running estimates of its computed mean and
variance, which are then used for normalization during evaluation.
"""
Here I just want to verify the calculation process, not focus on unbiased
. Just to briefly mention: