Technology sharing

BatchNorm

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

0. Abstract

Occurrit olim BatchNorm etLayerNormtum non nisi male cognitum est eos perfeceruntstandardisation: x = x − μ σ bm{x} = frac{bm{x} - bm{mu}}{bm{sigma}}x****************=σx****************μ Hoc quidem simplex est, sed ipsa notitia magis implicata estCV tasksSecundum imaginem datae, notitia massae plerumque similis est: [B, C, H, W]; NLP tasksSecundum sententiam datae, notitia massae plerumque similis est: [B, T, D]. Quod XXNorm Quomodo calculata? μ bm{mu}μ et σ bm{sigma}σac differunt.

Multum igitur informationes sedatus sum, et cum magistri verbis vere assentior: Diaria nuntius non legeris, non tam bona sunt quam tu! commentarium regio: Cur Transformers utor accumsan norma pro batch norma?

1. Quomodo calculare?

Haec sectio singillatim ex- plicatur perspectivae tam theoreticae quam codicis BatchNorm etLayerNorm calculi modus.

1.1 Loquamur de eo primo theoretice rogando de significatione generali Qianwen

>>>>> Tongyi Qianwen >>>>>
quaeritur?: Quid interest inter regulam et regulam massam? Da explicationem formulae.

respondere: Layer Normalization (LayerNorm) etBatch Normalization (BatchNorm) sunt artes ad accelerandum neural network disciplina, per ordinationem accumsanIntra seu activate valoremretis doctis firmiorem efficiensqueExsecutioetLorem scaenadifferant on.

Batch Normalization (BatchNorm)
BatchNorm Maxime in disciplinaAd mini-batch of notitiaad normalizeprincipalis scopumsicRedigendum internum covariate redolentque thymo fragrantia(transpositio covariata interna), hoc est, per processum propagationis anterioris retis, input distributio intermedii iacuit cum iterationibus exercitationibus mutatur. BatchNorm per caculationemMedium et discrepantia currentis mini-batchAd initus normalize, formula haec est:
extruxerat x bm{x}x**************** vector initus est intra batch mini- μ B bm{mu}_BμB et σ B 2 bm{sigma}^2_BσB2 sunt media et discordes ad mini-batch respectively, tumBatchNorm Output y bm{y}y****** Sic exprimi potest: x^ = x − μ B μ B 2 + y^ = γ x ^ + β^bmx****************=bmx****************bmμBbmμ2B+ϵ^bmy******=γ^bmx****************+β x****************^y******^=μB2+ϵ x****************μB=γx****************^+β in, γ gammaγ et β betaβ disci modulus; epsilonϵ est parva constans stabilis pro numerorum usu.

Layer Normalization (LayerNorm)
LayerNorm Propositum estutrumque specimen gradu "Facere ordinationem sine innixa mini-batch mutant. LayerNorm Aptius ad exempla sequentia (utLSTM or *Transformerinter quos etcNon potest esse clientelas inter exemplaor * mini-batch mole non potest firmari. LayerNorm computareMedium et dissimile cuiusque specimen plumaformula talis est;
extruxerat x bm{x}x**************** est pluma vector unius exempli; μ L bm{mu}_LμL et σ L 2 bm{sigma}^2_LσL2 sunt media et variata specimen indoles respective, tumLayerNorm Output y bm{y}y****** Sic exprimi potest: x^ = x − μ L μ L 2 + y^ = γ x ^ + β^bmx****************=bmx****************bmμLbmμ2L+ϵ^bmy******=γ^bmx****************+β x****************^y******^=μL2+ϵ x****************μL=γx****************^+β Idem hic; γ gammaγ et β betaβ disci modulus; epsilonϵ parva constans.

differentia

  • Pendent notitia: BatchNorm dependetmini-batch mutant, et LayerNorm dependetProprium statistica unius specimen;
  • Application missiones: BatchNorm aptius for*computatrum vision tasks, et LayerNorm aptius for*naturalis lingua processus taskspraesertim cum de data serie agitur;
  • Computational efficientiam: BatchNorm In parallelismi computatione efficacius esse potest, quia parallelismo in mini- batch uti potest, dumLayerNorm Calculi in unoquoque sample sui iuris sunt et aptiores ad seriem processus notitiarum diversarum longitudinum.

Ambae methodi suas utilitates et incommoda habent, et qui modus eligendi plerumque pendet ex applicatione specifica missionis et notae notarum.
<<<<< Tongyi Qianwen <<<<

Esne perspicuus? Quid est "medium et discrepantiae hodiernae mini-batch".? Quid est "medium et dissimile cuiusque pluma specimen"? Alieni responsio melius est:

Praecipuas notiones intellegamus:

  • exempli gratia: exemplum, imago ad cv;sententia ad NLP;
  • elementumum: tium sample, pixel ad imaginem;verbum ad sententiam;
  • pluma: Characteres elementi; RGB valorem for pexel,embedding ad verbum.

Relucet hoc in notitia;

  • [B, C, H, W] batch eft;[C, H, W] est instantia;[C] pixel est, comprehendoC pluma.
  • [B, T, L] batch eft;[T, L] est instantia;[L] Verbum est, comprehendoL pluma.

Ut infra patebit:

Ex parte Batch Dimensionis spectans, singula quadrata parva ad dorsum extensa elementum repraesentat, quale est talea longa purpurea in pictura sinistra, notam pixelli RGB, vel vectoris verbi.LayerNorm Dareunumquodque elementum Medium et discordans computa, accipere potesBxL medium ac discordes (or *BxHxW).Unumquodque elementum est normatum independenter. Purpura panni in pictura a dextra pluma est, primum omnium verborum in massa. BatchNorm DareQuisque pluma Medium et discordans computa, accipere potesL medium ac discordes (or *C).Quisque pluma est normatum independenter.


Notandum quod Transformator supradictum non sequiturLayerNorm ratione, sed deditutrumque Medium et discordans computa, accipere potesB medium ac discordes, tumSingula instantia normatum independenter.

1.2 In PyTorch BatchNorm etLayerNorm
1.2.1 BatchNorm

Apud PyTorch, BatchNorm punctumnn.BatchNorm1d, nn.BatchNorm2d etnn.BatchNorm3drespective pro notitia diversarum rationum;

  • nn.BatchNorm1d: (N, C) or *(N, C, L)
  • nn.BatchNorm2d: (N, C, H, W)
  • nn.BatchNorm3d: (N, C, D, H, W)

View source code:

class BatchNorm1d(_BatchNorm):
	r"""
	Args:
		num_features: number of features or channels `C` of the input

	Shape:
		- Input: `(N, C)` or `(N, C, L)`, where `N` is the batch size,
		  `C` is the number of features or channels, and `L` is the sequence length
		- Output: `(N, C)` or `(N, C, L)` (same shape as input)
	"""
	def _check_input_dim(self, input):
		if input.dim() != 2 and input.dim() != 3:
			raise ValueError(f"expected 2D or 3D input (got {input.dim()}D input)")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

Exempla:

>>> m = nn.BatchNorm1d(100)  # C=100	   # With Learnable Parameters
>>> m = nn.BatchNorm1d(100, affine=False)  # Without Learnable Parameters
>>> input = torch.randn(20, 100)  # (N, C)
>>> output = m(input)
>>> # 或者
>>> input = torch.randn(20, 100, 30)  # (N, C, L)
>>> output = m(input)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

γ , β bm{gamma}, bm{beta}γ,β est discibilis parameter, etshape=(C,), modulus nomen est .weight et.bias:

>>> m = nn.BatchNorm1d(100)
>>> m.weight
Parameter containing:
tensor([1., 1., ..., 1.], requires_grad=True) 
>>> m.weight.shape
torch.Size([100])
>>> m.bias
Parameter containing:
tensor([0., 0., ..., 0.], requires_grad=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

BatchNorm2d etBatchNorm3d idem interest quod_check_input_dim(input):

class BatchNorm2d(_BatchNorm):
	r"""
	Args:
		num_features: `C` from an expected input of size `(N, C, H, W)`
	Shape:
		- Input: :math:`(N, C, H, W)`
		- Output: :math:`(N, C, H, W)` (same shape as input)
	"""
	def _check_input_dim(self, input):
		if input.dim() != 4:
			raise ValueError(f"expected 4D input (got {input.dim()}D input)")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Exempla:

>>> m = nn.BatchNorm2d(100)
>>> input = torch.randn(20, 100, 35, 45)
>>> output = m(input)
  • 1
  • 2
  • 3
class BatchNorm3d(_BatchNorm):
	r"""
	Args:
		num_features: `C` from an expected input of size `(N, C, D, H, W)`
	Shape:
		- Input: :math:`(N, C, D, H, W)`
		- Output: :math:`(N, C, D, H, W)` (same shape as input)
	"""
	def _check_input_dim(self, input):
		if input.dim() != 5:
			raise ValueError(f"expected 5D input (got {input.dim()}D input)")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Exempla:

>>> m = nn.BatchNorm3d(100)
>>> input = torch.randn(20, 100, 35, 45, 10)
>>> output = m(input)
  • 1
  • 2
  • 3
1.2.2 LayerNorm

aliud a * BatchNorm(num_features), LayerNorm(normalized_shape) Parametri suntinput.shape postx singuladim, sicut [B, T, L] Ultimae duae rationes[T, L]tum quaelibet sententia independenter normatum erit; L or *[L]ergo omne nomen vector independenter mensuratur.

NLP Exemplum

>>> batch, sentence_length, embedding_dim = 20, 5, 10
>>> embedding = torch.randn(batch, sentence_length, embedding_dim)
>>> layer_norm = nn.LayerNorm(embedding_dim)
>>> layer_norm(embedding)  # Activate module
  • 1
  • 2
  • 3
  • 4

Exemplum imago

>>> N, C, H, W = 20, 5, 10, 10
>>> input = torch.randn(N, C, H, W)
>>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
>>> layer_norm = nn.LayerNorm([C, H, W])
>>> output = layer_norm(input)
  • 1
  • 2
  • 3
  • 4
  • 5

Aliis verbis non solum "varium" includit elementumum Standardize independently "et" Quisqueexempli gratia independently normalized "et numerant quid"normalize in ultimis x dimensionibus.

1.3 Inquisitio calculi processus
import torch
from torch import nn

# >>> 手动计算 BatchNorm2d >>>
weight = torch.ones([1, 3, 1, 1])
bias = torch.zeros([1, 3, 1, 1])

x = 10 * torch.randn(2, 3, 4, 4) + 100
mean = x.mean(dim=[0, 2, 3], keepdim=True)
std = x.std(dim=[0, 2, 3], keepdim=True, unbiased=False)
print(x)
print(mean)
print(std)

y = (x - mean) / std
y = y * weight + bias
print(y)
# <<< 手动计算 BatchNorm2d <<<

# >>> nn.BatchNorm2d >>>
bnm2 = nn.BatchNorm2d(3)
z = bnm2(x)
print(z)
# <<< nn.BatchNorm2d <<<
print(torch.norm(z - y, p=1))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

Invenies manually calculandum et nn.BatchNorm Calculi calculi fere eodem modo sunt; epsilonϵ Auctoritasunbiased=False Gravis est notare, documenta officialis explicat:

"""
At train time in the forward pass, the standard-deviation is calculated via the biased estimator,
equivalent to `torch.var(input, unbiased=False)`.
However, the value stored in the moving average of the standard-deviation is calculated via
the unbiased  estimator, equivalent to `torch.var(input, unbiased=True)`.

Also by default, during training this layer keeps running estimates of its computed mean and
variance, which are then used for normalization during evaluation.
"""
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

EGO iustus volo cognoscere processus calculi hic, non focus in unbiased.

  • Scaena disciplinae aestimationem discordantium aestimat, dum variatio in motu mediocris cum discordantia est aestimatio pensata;
  • In disciplina, medium movens mediocris et var salvabitur et tunc usus est in scaena probationis.