Technology Sharing

Language Model Evolution: The Journey from NLP to LLM

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

In the vast universe of artificial intelligence, natural language processing (NLP) has always been a field full of challenges and opportunities. With the development of technology, we have witnessed the evolution from traditional rules to statistical machine learning, and then to deep learning and pre-trained models. Today, we are standing on the threshold of large language models (LLMs), which are redefining the way we communicate with machines. This article will take a deep look at the development history of LLMs, the technical routes, and their impact on the future of AI.

introduction

The goal of natural language processing (NLP) is to enable machines to understand, interpret, and generate human language. The development of this field has gone through several important stages, each of which marks a leap in the depth of language understanding. From early rule-based systems, to statistical learning methods, to deep learning models, to today's large language models (LLMs), each step is a transcendence of the previous stage.
insert image description here

From rules to statistics: Early exploration of NLP

Rule phase (1956-1992)

In the early days of NLP, researchers relied on hand-written rules to process language. The technology stack at this stage included finite state machines and rule-based systems. For example, Apertium is a rule-based machine translation system that shows how early researchers achieved automatic translation of languages ​​by manually organizing dictionaries and writing rules.
insert image description here

Statistical Machine Learning Stage (1993-2012)

Over time, researchers began to turn to statistical learning methods, using tools such as support vector machines (SVM), hidden Markov models (HMM), maximum entropy models (MaxEnt), and conditional random fields (CRF). This stage is characterized by the combination of a small amount of manually annotated domain data and manual feature engineering, marking the transition from manually written rules to machines automatically learning knowledge from data.
insert image description here

Deep learning breakthroughs: opening a new era

Deep learning stage (2013-2018)

The emergence of deep learning has brought revolutionary changes to NLP. Technologies such as encoder-decoder, long short-term memory network (LSTM), attention mechanism and embedding enable models to process larger data sets and require almost no manual feature engineering. Google's neural machine translation system (2016) is a representative work of this stage.
insert image description here

The rise of pre-trained models: self-discovery of knowledge

Pre-training phase (2018-2022)

The emergence of pre-trained models marks another leap forward in the field of NLP. The technology stack with Transformer and attention mechanism as the core combines massive unlabeled data for self-supervised learning, generates general knowledge, and then adapts to specific tasks through fine-tuning. This stage is very mutational because it expands the range of available data from labeled data to unlabeled data.
insert image description here

The new era of LLM: the fusion of intelligence and versatility

LLM stage (2023-?)

LLMs represent the latest developments in language models, which typically use a decoder-based architecture combined with Transformers and reinforcement learning human feedback (RLHF). This phase is characterized by a two-stage process: pre-training and human alignment. The pre-training phase uses massive unlabeled data and domain data to generate knowledge through self-supervised learning; the human alignment phase uses habits and values ​​to align, enabling the model to adapt to a variety of tasks.
insert image description here
Looking back at each stage of development, we can see the following trends:

Data: From data to knowledge, more and more data are being utilized/future: More text data, more other forms of data → any data
Algorithms: increasingly expressive; increasingly large-scale; increasingly autonomous learning; from professional to general/future: Transformer seems to be sufficient for now, new models (should emphasize learning efficiency)? → AGI?
Human-machine relationship: Position shifted backward, from instructor to supervisor/future: Human-machine collaboration, machines learn from humans → humans learn from machines? → machines expand the boundaries of human knowledge

insert image description here

LLM technology development route: diverse paths

In the past few years, the development of LLM technology has shown a variety of paths, including BERT mode, GPT mode and T5 mode, etc. Each mode has its own characteristics and applicable scenarios.
insert image description here

BERT Mode (Encoder-Only)

The BERT model is suitable for natural language understanding tasks through a two-stage process of bidirectional language model pre-training and task fine-tuning (bidirectional language model pre-training + task fine-tuning). BERT pre-training extracts general knowledge from general data, while fine-tuning extracts domain knowledge from domain data.
insert image description here
Suitable for solving task scenarios: more suitable for natural language understanding, specific tasks in a scenario, focused but light;
insert image description here

GPT Mode (Decoder-Only)

The GPT model is developed from the one-stage process of unidirectional language model pre-training and zero shot/few shot prompt or instruction (unidirectional language model pre-training + zero shot/few shot prompt/Instruct), which is suitable for natural language generation tasks. GPT model models are usually the largest LLMs currently, and they can handle a wider range of tasks.
insert image description here
Applicable scenarios: More suitable for natural language generation tasks. Currently, the largest LLMs are all of this mode: GPT series, PaLM, LaMDA..., and repetitive. For generation tasks/general models, the GPT mode is recommended.
insert image description here

T5 Mode (Encoder-Decoder)

The T5 model combines the features of BERT and GPT and is suitable for generation and understanding tasks. The Span Corruption task of the T5 model is an effective pre-training method, which performs well in natural language understanding tasks. Two-stage (one-way language model pre-training + Fine-tuning)
insert image description here
Features: Similar to GPT in appearance and Bert in spirit
Applicable scenarios: Both generation and understanding are possible. In terms of performance, it is more suitable for natural language understanding tasks. Many large LLMs in China adopt this mode. If it is a natural language understanding task in a single field, it is recommended to use the T5 mode.
insert image description here

Why are all large LLMs in GPT mode?

Super large LLM: pursuit of zero shot/few shot/instruct effect
Conclusions of current research

(When the model is not large):

  • Natural language understanding: T5 mode works best.
  • Natural language generation category: GPT mode works best.
  • Zero shot: GPT mode works best.
    If multi-task fine-tuning is introduced after Pretrain, the T5 mode will have a better effect (the conclusion is questionable: the current experimental Encoder-Decoder has twice the number of parameters of Decoder-only, is the conclusion reliable?)

Current research conclusions (super large scale):
Fact: Almost all LLM models with a size of more than 100B adopt the GPT model.

possible reason:
1. Bidirectional attention in Encoder-Decoder impairs zero shot capability (Check)
2. When the Encoder-Decoder structure generates a token, it can only pay attention to the Encoder high layer. When the Decoder-only structure generates a token, it can pay attention to each layer, and the information is more fine-grained.
3. Encoder-Decoder training "fill in the blank" generates the last word Next Token, which is inconsistent. Decoder-only structure training and generation methods are consistent

Challenges and opportunities of super large LLM

As the model size grows, researchers are faced with the challenge of how to effectively utilize parameter space. Research on the Chinchilla model shows that when there is sufficient data, the current LLM size may be larger than the ideal size, resulting in a waste of parameter space. However, the Scaling Law also points out that the larger the model size, the more data, and the more sufficient the training, the better the effect of the LLM model. A more feasible idea is to make it small first (GPT 3 should not be so large), and then make it bigger (after making full use of the model parameters, continue to make it bigger).
insert image description here

Of course, given that multimodal LLM requires richer real-world environment perception capabilities, higher requirements are also placed on LLM parameters.
Multimodal LLM: visual input (pictures, videos), auditory input (audio), tactile input (pressure)
insert image description here
Problems: Multimodal LLM seems to work well, but it relies heavily on manually curated large data sets

For example, ALIGN:1.8B image/LAION:5.8B image data (filtered by CLIP, currently the largest image data) is currently text with image?

Image Processing:The self-supervised technology route is being tried, but has not yet been implemented (contrastive learning/MAE)/If it can be implemented, it will be another huge technological breakthrough in the field of AI;

If this works, some current image understanding tasks (semantic segmentation/recognition, etc.) will probably be integrated into LLM and disappear.

insert image description here

Improving the complex reasoning ability of LLM

Although the current LLM has a certain ability to perform simple reasoning, it still lacks in complex reasoning. For example, tasks such as multi-digit addition remain a challenge for LLM. Researchers are exploring how to distill complex reasoning capabilities into smaller models through technical means, such as semantic decomposition.
insert image description here
Of course, this problem can also be circumvented by outsourcing capabilities, such as combining with tools: computing power (external calculator), new information query (search engine) and other capabilities can be completed with the help of external tools.
insert image description here

Interaction of LLM with the physical world

The concept of embodied intelligence combines LLM with robotics to gain embodied intelligence through interaction with the physical world using reinforcement learning.For example, Google’s PaLM-E model combines 540B of PaLM and 22B of ViT, demonstrating the potential of LLM in a multimodal setting.
insert image description here
insert image description here

Other research directions

  1. Acquisition of new knowledge: Currently there are some difficulties, but there are some methods (LLM+Retrieval)
  2. Correction of old knowledge: There are some research results that need to be optimized.
  3. Integration of private domain knowledge: Fine-tune?
  4. Better understanding of commands: still needs to be optimized (serious nonsense)
  5. Reduction in training and reasoning costs: Rapid development in the next one to two years
  6. Construction of Chinese evaluation datasets: a test of capability. There are currently some evaluation datasets in English, such as HELM/BigBench, but Chinese lacks multi-task, high-difficulty, and multi-angle evaluation datasets.

Conclusion

This article explores in depth the development history, technical routes and impact of LLM on the future of AI. The development of LLM is not only a technological advancement, but also a profound reflection on our ability to understand machines. From rules to statistics, to deep learning and pre-training, each step provides us with new perspectives and tools. Today, we are standing at the threshold of a new era of large language models, facing unprecedented opportunities and challenges.