Summary of interview questions for large models/NLP/algorithms 9 — Will switching from normal attention to multi-head attention lead to a surge in parameters?
2024-07-11
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Switching from normal attention to multi-head attention, usuallyWill not cause a surge in the total number of parametersOn the contrary, under certain implementations and configurations, the increase in the number of parameters may be relatively controllable, and even in some cases, effective control of the number of parameters can be achieved through optimization.
Parameter analysis
- Basic composition:
- General attention: Typically consists of a set of linear transformation matrices for computing queries (Q), keys (K), and values (V), as well as a matrix for the output transformation.
- Multi-head attention: is toThe input features are split intoMultiple "heads", each head independently calculates its own query, key and value, and obtains the output through its own attention mechanism. Finally, the outputs of all heads are concatenated and subjected to an additional linear transformation to obtain the final output.
- Parameter quantity change:
- In multi-head attention, each head has its own query, key, and value transformation matrices (W_q, W_k, W_v), as well as a linear transformation matrix for the final output (W_o). However, it is important to note thatAlthough the number of heads increases, the number of parameters used by each head (i.e., the dimensions of each linear transformation matrix) is usually adjusted accordingly.To keep the overall parameter quantity controllable.
- For example, ifThe dimensions of the query, key, and value transformation matrices in the original single-head attention are d_model, then in multi-head attention, if the number of heads is h, the dimensions of the query, key, and value transformation matrices of each head may beAdjusted to d_model/h(or a close value, depending on whether the overall dimensionality needs to be kept consistent.) At the same time, the dimension of the final output linear transformation matrix W_o will also be adjusted as needed.
- Advantages of parallel computing:
- One of the main advantages of the multi-head attention mechanism is that it can process multiple heads in parallel, which helps speed up the computation process. Although on the surface, increasing the number of heads seems to increase the computational complexity, in fact, due to the increase in parallelism, the overall computational efficiency can be improved.
in conclusion
Therefore, when switching from ordinary attention to multi-head attention, although more parameters are indeed introduced (mainly the query, key, and value transformation matrices of each head), the increase in the number of parameters is not necessarily a surge. By reasonably adjusting the parameter dimensions of each head and the dimensions of the final output linear transformation matrix, the overall number of parameters can be kept controllable. At the same time, the parallel computing advantages brought by the multi-head attention mechanism also help improve computing efficiency.