Technology Sharing

Regularization Techniques in Deep Learning - Noise Robustness

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Preface

With the booming development of deep learning, the performance and generalization ability of the model have become the focus of researchers. However, data in practical applications are often accompanied by various noises, which not only come from hardware limitations in the data collection process, but may also be introduced by factors such as environmental interference and transmission errors. The presence of noise seriously affects the training effect and prediction accuracy of deep learning models, especially in tasks such as speech recognition and image classification. Therefore, improving the noise robustness of deep learning models, that is, enhancing the stable performance and recognition ability of the model in a noisy environment, has become an important direction of current research. By designing more effective data preprocessing algorithms, optimizing model structures, and introducing technical means such as noise enhancement training, the deep learning model's resistance to noise can be significantly improved, thereby promoting its application in more complex scenarios.

Noise robustness

  • existRegularization Techniques in Deep Learning - Dataset EnhancementIt has been inspired to apply noise to the input as a dataset augmentation strategy. For some models, adding noise with very small variance to the model input is equivalent to adding a norm penalty to the weights (Bishop, 1995a,b). In general, noise injection is much more powerful than simply shrinking parameters, especially when noise is added to hidden units.Adding noise to hidden units is a topic that deserves a separate discussion.
  • anotherThe noise of the regularized model is used by adding it to the weightsThis technique is mainly used in recurrent neural networks (Jim et al., 1996; Graves, 2011). This can be interpreted as a stochastic implementation of Bayesian inference about the weights. Using a Bayesian approach to the learning process treats the weights as uncertain and this uncertainty can be represented by a probability distribution. Adding noise to the weights is a practical stochastic way to reflect this uncertainty.
  • Under certain assumptions, the noise applied to the weights can be interpreted as equivalent to more traditional forms of regularization, encouraging stability in the function to be learned.
  • We study the case of regression, that is, training a set of features x boldsymbol{x} xFunction that maps to a scalar y ^ ( x ) hat{y}(boldsymbol{x}) y^(x), and use the least squares cost function to measure the model prediction value y ^ hat{y} y^With the true value y y yError
    J = E p ( x , y ) [ ( y ^ ( x ) − y ) 2 ] —Formula 1 J=mathbb{E}_{p(x,y)}[(hat{y}(boldsymbol{x})-y)^2]quadtextbf{footnotesize{---Formula 1}}J=Ep(x,y)[(y^(x)y)2]formula1
  • The training set contains m m mAnnotation samples { ( x ( i ) , y ( i ) ) , … , ( x ( m ) , y ( m ) ) } {(boldsymbol{x}^{(i)},y^{(i)}),dots,(boldsymbol{x}^{(m)},y^{(m)})} {(x(i),y(i)),,(x(m),y(m))}
  • Now we assume that we add a random perturbation of the network weights to each input representation ϵ w ∼ N ( ϵ ; 0 , η I ) epsilon_wsimmathcal{N}(boldsymbol{epsilon};0,etaboldsymbol{I}) ϵwN(ϵ;0,ηI)Imagine we have a standard l l llayer MLP text{MLP} MLPWe denote the perturbation model as y ^ ϵ W ( x ) hat{y}_{epsilon_{boldsymbol{W}}}(boldsymbol{x}) y^ϵW(x)
  • Despite the noise injection, we are still interested in reducing the square of the network output error. Therefore, the objective function is: { J ^ W = Ep ( x , y , ϵ W ) [ ( y ^ ϵ W ( x ) − y ) 2 ] —Formula 2 = Ep ( x , y , ϵ W ) [ y ^ ϵ W 2 ( x ) − 2 yy ^ ϵ W ( x ) + y 2 ] —Formula 3 {J^W=Ep(x,y,ϵW)[(y^ϵW(x)y)2]formula2=Ep(x,y,ϵW)[y^ϵW2(x)2yy^ϵW(x)+y2]formula3
  • For small η eta η, minimize the weighted noise (variance is η I etaboldsymbol{I} ηI)of J J JThis is equivalent to minimizing the additional regularization term J : η E p ( x , y ) [ ∥ ∇ W y ^ ( x ) ∥ 2 ] J:etamathbb{E}_{p(x,y)}left[Vertnabla_{boldsymbol{W}}hat{y}(boldsymbol{x})Vert^2right] J:ηEp(x,y)[Wy^(x)2]
  • This form of regularization encourages the parameters to move into regions of parameter space where small perturbations of the weights have relatively little effect on the output. In other words, it pushes the model into regions that are relatively insensitive to small changes in the weights, finding points that are not just minima but minima surrounded by flat regions (Hochreiter and Schmidhuber, 1995).
  • In a simplified linear regression (e.g., y ^ ( x ) = w ⊤ x + b hat{y}(boldsymbol{x})=boldsymbol{w}^topboldsymbol{x}+b y^(x)=wx+b, the regularization term degenerates to: η E p ( x ) [ ∥ x ∥ 2 ] etamathbb{E}_{p(x)}[Vertboldsymbol{x}Vert^2] ηEp(x)[x2], which has nothing to do with the function's arguments and therefore has no effect on J ^ w hat{J}_w J^wContributes to the gradients with respect to the model parameters.

Injecting noise into the output target

  • Most datasets y y yThere are certain errors in the labels. y y yis wrong, to maximize log ⁡ p ( y ∣ x ) log p(ymidboldsymbol{x}) logp(yx)It would be harmful.
  • One way to prevent this is to explicitly model the noise on the labels.
    • For example, we can assume that for some small constant ϵ epsilon ϵ, training set label y y yThe probability that it is correct is 1 − ϵ 1-epsilon 1ϵ, any of the other possible tags might be correct.
    • This assumption can be easily incorporated into the cost function analytically without explicitly taking noise samples.
    • For example,Label smoothing(label smoothing) based on k k kOutput softmax text{softmax} softmaxFunction, refers to the clear classification 0 0 0and 1 1 1Replace with ϵ k − 1 displaystylefrac{epsilon}{k-1} k1ϵand 1 − ϵ 1-epsilon 1ϵ, regularize the model.
  • The standard cross entropy loss can be used on these non-exact target outputs. softmax text{softmax} softmaxMaximum likelihood learning of functions and explicit objectives may never converge — softmax text{softmax} softmaxFunctions can never be truly predicted 0 0 0Probability or 1 1 1probabilities, so it will continue to learn larger and larger weights, making predictions more extreme. Using other regularization strategies such as weight decay can prevent this. The advantage of label smoothing is that it prevents the model from pursuing explicit probabilities without hindering the correct classification. This strategy has been used since the 1980s and continues to feature prominently in modern neural networks (Szegedyet al., 2015).

Summarize

  • Improving noise robustness in deep learning is the key to ensuring that the model works stably in real-world environments. Through a series of innovative technical means, such as data enhancement, noise injection training, and model structure optimization, we can effectively improve the model's tolerance to noise and recognition accuracy. These efforts not only promote the further development of deep learning technology, but also bring more reliable and efficient solutions to practical applications in fields such as speech recognition, image recognition, and natural language processing.
  • In the future, with the deepening of research and the continuous advancement of technology, we have reason to believe that the noise robustness of deep learning models will be further improved, bringing revolutionary changes to more fields.

Past content returns

Regularization Techniques in Deep Learning - Dataset Enhancement