"Deep Analysis" ChatGPT2: Language Model for Unsupervised Multi-Task Learning (2019)

2024-07-12

Paper Summary

The following is my personal summary after reading the entire paper, which contains the main content of the ChatGPT-2 article. You can only read the [Paper Summary] section.

data set

A web crawler was built by myself. Some of the crawled web pages came from social platforms, and these web pages were filtered manually.
WebText Dataset
, containing 45,000,000 links. The other part comes from news websites. As of December 2017, the total amount of data reached 8,000,000 articles, with a total of 40GB of text content. The article also mentioned that texts including Wikipedia were also included in the training data set,
Millions of people around the world participated
To create and clean the dataset used for GPT-2 training.

Input representation

Designed a
Hybrid input representation combining word-level representation and byte-level representation
. A large number of repeated words were removed from the previous word-level library, and byte-level representation was introduced to improve generalization ability.

Word-level representation has the advantage of priors, and byte-level representation has the advantage of generalization.

Model

Some modifications have been made for GPT1:

1. Move the layer normalization to the input of each sub-block.

2. Adding extra layer normalization after the self-attention block.

3. Improved initialization method (during initialization, the weights of the residual layer are expanded by multiples of 1/√N, where N is the number of residual layers).

4. The dictionary is expanded, the word segmentation is expanded, the instruction set is expanded, and the batch size is expanded.

5.GPT contains 117,000,000 parameters,
GPT-2 contains 1,542,000,000 parameters
。

experiment

Because we only train once but want to observe the performance of the model in various sub-fields, all experiments can be classified into
Zero-shot learning
。

Test items	What aspects of the model are tested?	Test Results
Children's Books	Identify different types of vocabulary	ACC increased from 85.7 to 93.3
LAMBADA test	Ability to identify long dependencies in text	PPL99.8 decreased to 8.63
Winograd Schema Challenge	Common sense reasoning	63.7% to 70.7%
Reading comprehension	The model needs to have a certain memory capacity	3 out of 4 tests set new records
Summary	Ability to summarize news articles	On par with historical performance
translate	Translation capabilities of large models automatically learned	The English-French translation is poor, and the French-English translation is at a benchmark level
Questions and Answers	The model’s ability to answer plausible questions correctly	Accuracy increased by 5.3 times

Summarize

The core content of the GPT-2 paper can be summarized in one sentence:
Based on the GPT model, the author increased the model size and training data set size, and found that GPT-2 can automatically adapt to and complete the learning of task objectives in different fields of NLP.
。

For example, if we feed a fixed language model with a dataset of daily conversation text and news report text, and the dataset is large enough, the model is large enough, and the training time is long enough, the final model will have the ability to distinguish between different scenarios of daily conversation and news reports, and the model will also automatically have some new capabilities, such as the ability to write news summaries.

This means that large language models have strong generalization capabilities, but also means that
Large language models will have potential autonomous consciousness
The paper then presents experimental results for several independent areas listed by the authors.

Compared to the GPT paper which only mentioned Large Dataset, the GPT-2 paper began to mention LLM (Large Language Model).

Interpretation of the original paper

Original paper address: https://cdn.openai.com/better-language-models/language_models_are_uns

Technology Sharing