2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
In the vast universe of artificial intelligence, natural language processing (NLP) has always been a field full of challenges and opportunities. With the development of technology, we have witnessed the evolution from traditional rules to statistical machine learning, and then to deep learning and pre-trained models. Today, we are standing on the threshold of large language models (LLMs), which are redefining the way we communicate with machines. This article will take a deep look at the development history of LLMs, the technical routes, and their impact on the future of AI.
The goal of natural language processing (NLP) is to enable machines to understand, interpret, and generate human language. The development of this field has gone through several important stages, each of which marks a leap in the depth of language understanding. From early rule-based systems, to statistical learning methods, to deep learning models, to today's large language models (LLMs), each step is a transcendence of the previous stage.
In the early days of NLP, researchers relied on hand-written rules to process language. The technology stack at this stage included finite state machines and rule-based systems. For example, Apertium is a rule-based machine translation system that shows how early researchers achieved automatic translation of languages by manually organizing dictionaries and writing rules.
Over time, researchers began to turn to statistical learning methods, using tools such as support vector machines (SVM), hidden Markov models (HMM), maximum entropy models (MaxEnt), and conditional random fields (CRF). This stage is characterized by the combination of a small amount of manually annotated domain data and manual feature engineering, marking the transition from manually written rules to machines automatically learning knowledge from data.
The emergence of deep learning has brought revolutionary changes to NLP. Technologies such as encoder-decoder, long short-term memory network (LSTM), attention mechanism and embedding enable models to process larger data sets and require almost no manual feature engineering. Google's neural machine translation system (2016) is a representative work of this stage.
The emergence of pre-trained models marks another leap forward in the field of NLP. The technology stack with Transformer and attention mechanism as the core combines massive unlabeled data for self-supervised learning, generates general knowledge, and then adapts to specific tasks through fine-tuning. This stage is very mutational because it expands the range of available data from labeled data to unlabeled data.
LLMs represent the latest developments in language models, which typically use a decoder-based architecture combined with Transformers and reinforcement learning human feedback (RLHF). This phase is characterized by a two-stage process: pre-training and human alignment. The pre-training phase uses massive unlabeled data and domain data to generate knowledge through self-supervised learning; the human alignment phase uses habits and values to align, enabling the model to adapt to a variety of tasks.
Looking back at each stage of development, we can see the following trends:
Data: From data to knowledge, more and more data are being utilized/future: More text data, more other forms of data → any data
Algorithms: increasingly expressive; increasingly large-scale; increasingly autonomous learning; from professional to general/future: Transformer seems to be sufficient for now, new models (should emphasize learning efficiency)? → AGI?
Human-machine relationship: Position shifted backward, from instructor to supervisor/future: Human-machine collaboration, machines learn from humans → humans learn from machines? → machines expand the boundaries of human knowledge
In the past few years, the development of LLM technology has shown a variety of paths, including BERT mode, GPT mode and T5 mode, etc. Each mode has its own characteristics and applicable scenarios.
The BERT model is suitable for natural language understanding tasks through a two-stage process of bidirectional language model pre-training and task fine-tuning (bidirectional language model pre-training + task fine-tuning). BERT pre-training extracts general knowledge from general data, while fine-tuning extracts domain knowledge from domain data.
Suitable for solving task scenarios: more suitable for natural language understanding, specific tasks in a scenario, focused but light;
The GPT model is developed from the one-stage process of unidirectional language model pre-training and zero shot/few shot prompt or instruction (unidirectional language model pre-training + zero shot/few shot prompt/Instruct), which is suitable for natural language generation tasks. GPT model models are usually the largest LLMs currently, and they can handle a wider range of tasks.
Applicable scenarios: More suitable for natural language generation tasks. Currently, the largest LLMs are all of this mode: GPT series, PaLM, LaMDA..., and repetitive. For generation tasks/general models, the GPT mode is recommended.
The T5 model combines the features of BERT and GPT and is suitable for generation and understanding tasks. The Span Corruption task of the T5 model is an effective pre-training method, which performs well in natural language understanding tasks. Two-stage (one-way language model pre-training + Fine-tuning)
Features: Similar to GPT in appearance and Bert in spirit
Applicable scenarios: Both generation and understanding are possible. In terms of performance, it is more suitable for natural language understanding tasks. Many large LLMs in China adopt this mode. If it is a natural language understanding task in a single field, it is recommended to use the T5 mode.
Super large LLM: pursuit of zero shot/few shot/instruct effect
Conclusions of current research
(When the model is not large):
Current research conclusions (super large scale):
Fact: Almost all LLM models with a size of more than 100B adopt the GPT model.
possible reason:
1. Bidirectional attention in Encoder-Decoder impairs zero shot capability (Check)
2. When the Encoder-Decoder structure generates a token, it can only pay attention to the Encoder high layer. When the Decoder-only structure generates a token, it can pay attention to each layer, and the information is more fine-grained.
3. Encoder-Decoder training "fill in the blank" generates the last word Next Token, which is inconsistent. Decoder-only structure training and generation methods are consistent
As the model size grows, researchers are faced with the challenge of how to effectively utilize parameter space. Research on the Chinchilla model shows that when there is sufficient data, the current LLM size may be larger than the ideal size, resulting in a waste of parameter space. However, the Scaling Law also points out that the larger the model size, the more data, and the more sufficient the training, the better the effect of the LLM model. A more feasible idea is to make it small first (GPT 3 should not be so large), and then make it bigger (after making full use of the model parameters, continue to make it bigger).
Of course, given that multimodal LLM requires richer real-world environment perception capabilities, higher requirements are also placed on LLM parameters.
Multimodal LLM: visual input (pictures, videos), auditory input (audio), tactile input (pressure)
Problems: Multimodal LLM seems to work well, but it relies heavily on manually curated large data sets
For example, ALIGN:1.8B image/LAION:5.8B image data (filtered by CLIP, currently the largest image data) is currently text with image?
Image Processing:The self-supervised technology route is being tried, but has not yet been implemented (contrastive learning/MAE)/If it can be implemented, it will be another huge technological breakthrough in the field of AI;
If this works, some current image understanding tasks (semantic segmentation/recognition, etc.) will probably be integrated into LLM and disappear.
Although the current LLM has a certain ability to perform simple reasoning, it still lacks in complex reasoning. For example, tasks such as multi-digit addition remain a challenge for LLM. Researchers are exploring how to distill complex reasoning capabilities into smaller models through technical means, such as semantic decomposition.
Of course, this problem can also be circumvented by outsourcing capabilities, such as combining with tools: computing power (external calculator), new information query (search engine) and other capabilities can be completed with the help of external tools.
The concept of embodied intelligence combines LLM with robotics to gain embodied intelligence through interaction with the physical world using reinforcement learning.For example, Google’s PaLM-E model combines 540B of PaLM and 22B of ViT, demonstrating the potential of LLM in a multimodal setting.
This article explores in depth the development history, technical routes and impact of LLM on the future of AI. The development of LLM is not only a technological advancement, but also a profound reflection on our ability to understand machines. From rules to statistics, to deep learning and pre-training, each step provides us with new perspectives and tools. Today, we are standing at the threshold of a new era of large language models, facing unprecedented opportunities and challenges.