Technology Sharing

BatchNorm

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

0. Abstract

I met you a long time ago. BatchNorm andLayerNorm, at that time, we only roughly knew that they werestandardization: x = x − μ σ bm{x} = frac{bm{x} - bm{mu}}{bm{sigma}} x=σxμ This is of course simple, but the actual data is more complicated.CV TasksFor image data, a batch of data is usually like this: [B, C, H, W]; For NLP tasksFor sentence data, a batch of data is usually like this: [B, T, D]. That XXNorm How is it calculated? It is definitely not to calculate all the numbers of the entire batch at once μ bm{mu} μ and σ bm{sigma} σ, and they are distinct.

So I looked up a lot of information, and it really proved what my mentor said: Don't read those messy blogs, their level is not as good as yours! Foreigners are really good, just answering questions in the comment section and explaining the problem clearly: 《Why do transformers use layer norm instead of batch norm?》

1. How to calculate?

This section explains in detail from both theoretical and code perspectives. BatchNorm andLayerNorm The calculation method of .

1.1 Let’s talk about it theoretically first, and ask the general meaning of thousands of questions

>>>>> Thousand Questions on Tongyi>>>>>
ask: What is the difference between layer norm and batch norm? Give a formula explanation.

answer: Layer Normalization (LayerNorm) andBatch Normalization (BatchNorm) These are all techniques used to accelerate neural network training, by normalizing the layersEnter or activate a value, making the network learning more stable and efficient. However, theyImplementationandApplicable sceneThere is some difference on.

Batch Normalization (BatchNorm)
BatchNorm Mainly during trainingFor a mini-batch of datato normalize.main targetyesReducing internal covariate shift(internal covariate shift), that is, the phenomenon that the input distribution of the intermediate layer changes with the training iterations during the forward propagation of the network. BatchNorm via caculationThe mean and variance of the current mini-batchTo normalize the input, the formula is as follows:
set up x bm{x} x is an input vector within a mini-batch, μ B bm{mu}_B μB and σ B 2 bm{sigma}^2_B σB2 are the mean and variance of the mini-batch respectively, thenBatchNorm Output y bm{y} y It can be expressed as: x ^ = x − μ B μ B 2 + ϵ y ^ = γ x ^ + β ^bmx=bmxbmμBbmμ2B+ϵ^bmy=γ^bmx+β x^y^=μB2+ϵ xμB=γx^+β in, γ gamma γ and β beta β is a learnable parameter, ϵ epsilon ϵ is a small constant used for numerical stability.

Layer Normalization (LayerNorm)
LayerNorm The purpose isAt each sample levelNormalize without relying on mini-batch statistics. LayerNorm More suitable for sequence models (such asLSTM orTransformer), etc., among whichThere may be dependencies between samplesor The mini-batch size may not be fixed. LayerNorm calculateThe mean and variance of each sample feature, the formula is as follows:
set up x bm{x} x is the feature vector of a single sample, μ L bm{mu}_L μL and σ L 2 bm{sigma}^2_L σL2 are the mean and variance of the sample features respectively, thenLayerNorm Output y bm{y} y It can be expressed as: x ^ = x − μ L μ L 2 + ϵ y ^ = γ x ^ + β ^bmx=bmxbmμLbmμ2L+ϵ^bmy=γ^bmx+β x^y^=μL2+ϵ xμL=γx^+β Here too, γ gamma γ and β beta β is a learnable parameter, ϵ epsilon ϵ is a very small constant.

the difference

  • Depends on data: BatchNorm Depends onMini-batch statistics, and LayerNorm Depends onFeature statistics of a single sample;
  • Application Scenario: BatchNorm More suitable forComputer Vision Tasks, and LayerNorm More suitable forNatural language processing tasks, especially when dealing with sequence data;
  • Computational efficiency: BatchNorm It may be more efficient in parallel computing because it can exploit the parallelism within the mini-batch.LayerNorm The calculation on each sample is independent, which is more suitable for processing sequence data of different lengths.

These two methods have their own advantages and disadvantages, and the choice of which method usually depends on the specific application scenario and data characteristics.
<<<<< Thousand Questions on Tongyi<<<<

Do you understand? Confused, What is the "mean and variance of the current mini-batch"?? What is the mean and variance of each sample feature?? The foreigner’s answer is better:

First clarify the basic concepts:

  • instance: A sample, image for CV, sentence for NLP;
  • element: Components of the sample, pixel for image, word for sentence;
  • feature: element features, RGB Value for pexel, embedding for word.

This is reflected in the data:

  • [B, C, H, W] is a batch,[C, H, W] is an instance,[C] is a pixel, containingC feature.
  • [B, T, L] is a batch,[T, L] is an instance,[L] is a word, containingL feature.

As shown below:

From the Batch Dimension side, each small square extending backwards represents an element, such as the purple strip in the left figure, the RGB feature of a pixel, or a word vector.LayerNorm GiveEach element Calculate the mean and variance, and we getBxL The mean and variance (orBxHxW). ThenEach element is standardized independentlyThe purple slice in the right picture is a feature, the first feature of all words in the batch. Each such slice is a feature, BatchNorm GiveEach feature Calculate the mean and variance, and we getL The mean and variance (orC). ThenEach feature is standardized independently.


It should be noted that Transformer does not follow the aboveLayerNorm Calculated, but givenEach instance Calculate the mean and variance, and we getB means and variances, thenEach instance is standardized independently. To be exact, it looks like the following picture:

1.2 PyTorch BatchNorm andLayerNorm
1.2.1 BatchNorm

In PyTorch, BatchNorm pointnn.BatchNorm1d, nn.BatchNorm2d andnn.BatchNorm3d, respectively for data of different dimensions:

  • nn.BatchNorm1d: (N, C) or (N, C, L)
  • nn.BatchNorm2d: (N, C, H, W)
  • nn.BatchNorm3d: (N, C, D, H, W)

View source code:

class BatchNorm1d(_BatchNorm):
	r"""
	Args:
		num_features: number of features or channels `C` of the input

	Shape:
		- Input: `(N, C)` or `(N, C, L)`, where `N` is the batch size,
		  `C` is the number of features or channels, and `L` is the sequence length
		- Output: `(N, C)` or `(N, C, L)` (same shape as input)
	"""
	def _check_input_dim(self, input):
		if input.dim() != 2 and input.dim() != 3:
			raise ValueError(f"expected 2D or 3D input (got {input.dim()}D input)")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

Examples:

>>> m = nn.BatchNorm1d(100)  # C=100	   # With Learnable Parameters
>>> m = nn.BatchNorm1d(100, affine=False)  # Without Learnable Parameters
>>> input = torch.randn(20, 100)  # (N, C)
>>> output = m(input)
>>> # 或者
>>> input = torch.randn(20, 100, 30)  # (N, C, L)
>>> output = m(input)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

γ , β bm{gamma}, bm{beta} γ,β is a learnable parameter, andshape=(C,), the parameter name is .weight and.bias:

>>> m = nn.BatchNorm1d(100)
>>> m.weight
Parameter containing:
tensor([1., 1., ..., 1.], requires_grad=True) 
>>> m.weight.shape
torch.Size([100])
>>> m.bias
Parameter containing:
tensor([0., 0., ..., 0.], requires_grad=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

BatchNorm2d andBatchNorm3d are the same, except that_check_input_dim(input):

class BatchNorm2d(_BatchNorm):
	r"""
	Args:
		num_features: `C` from an expected input of size `(N, C, H, W)`
	Shape:
		- Input: :math:`(N, C, H, W)`
		- Output: :math:`(N, C, H, W)` (same shape as input)
	"""
	def _check_input_dim(self, input):
		if input.dim() != 4:
			raise ValueError(f"expected 4D input (got {input.dim()}D input)")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Examples:

>>> m = nn.BatchNorm2d(100)
>>> input = torch.randn(20, 100, 35, 45)
>>> output = m(input)
  • 1
  • 2
  • 3
class BatchNorm3d(_BatchNorm):
	r"""
	Args:
		num_features: `C` from an expected input of size `(N, C, D, H, W)`
	Shape:
		- Input: :math:`(N, C, D, H, W)`
		- Output: :math:`(N, C, D, H, W)` (same shape as input)
	"""
	def _check_input_dim(self, input):
		if input.dim() != 5:
			raise ValueError(f"expected 5D input (got {input.dim()}D input)")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Examples:

>>> m = nn.BatchNorm3d(100)
>>> input = torch.randn(20, 100, 35, 45, 10)
>>> output = m(input)
  • 1
  • 2
  • 3
1.2.2 LayerNorm

Different from BatchNorm(num_features), LayerNorm(normalized_shape) The parameters areinput.shape Afterx indivualdim, like [B, T, L] The last two dimensions[T, L], then each sentence will be standardized independently; if L or[L], then each word vector is normalized independently.

NLP Example

>>> batch, sentence_length, embedding_dim = 20, 5, 10
>>> embedding = torch.randn(batch, sentence_length, embedding_dim)
>>> layer_norm = nn.LayerNorm(embedding_dim)
>>> layer_norm(embedding)  # Activate module
  • 1
  • 2
  • 3
  • 4

Image Example

>>> N, C, H, W = 20, 5, 10, 10
>>> input = torch.randn(N, C, H, W)
>>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
>>> layer_norm = nn.LayerNorm([C, H, W])
>>> output = layer_norm(input)
  • 1
  • 2
  • 3
  • 4
  • 5

That is to say, it not only includes the “various element Independently standardize” and “eachinstance independently normalized", and anynormalize over the last x dimensions.

1.3 Calculation process investigation
import torch
from torch import nn

# >>> 手动计算 BatchNorm2d >>>
weight = torch.ones([1, 3, 1, 1])
bias = torch.zeros([1, 3, 1, 1])

x = 10 * torch.randn(2, 3, 4, 4) + 100
mean = x.mean(dim=[0, 2, 3], keepdim=True)
std = x.std(dim=[0, 2, 3], keepdim=True, unbiased=False)
print(x)
print(mean)
print(std)

y = (x - mean) / std
y = y * weight + bias
print(y)
# <<< 手动计算 BatchNorm2d <<<

# >>> nn.BatchNorm2d >>>
bnm2 = nn.BatchNorm2d(3)
z = bnm2(x)
print(z)
# <<< nn.BatchNorm2d <<<
print(torch.norm(z - y, p=1))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

You will find that manual calculations and nn.BatchNorm The calculations are almost identical, there may be some ϵ epsilon ϵ Note that hereunbiased=False There are some details, as described in the official documentation:

"""
At train time in the forward pass, the standard-deviation is calculated via the biased estimator,
equivalent to `torch.var(input, unbiased=False)`.
However, the value stored in the moving average of the standard-deviation is calculated via
the unbiased  estimator, equivalent to `torch.var(input, unbiased=True)`.

Also by default, during training this layer keeps running estimates of its computed mean and
variance, which are then used for normalization during evaluation.
"""
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Here I just want to verify the calculation process, not focus on unbiased. Just to briefly mention:

  • The variance calculated in the training phase is a biased estimate of the variance, while the variance in the moving average with variance is an unbiased estimate;
  • During training, a sliding average of mean and var is saved and then used in the testing phase.