2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
The following is my personal summary after reading the entire paper, which contains the main content of the ChatGPT-2 article. You can only read the [Paper Summary] section.
A web crawler was built by myself. Some of the crawled web pages came from social platforms, and these web pages were filtered manually.
WebText Dataset
, containing 45,000,000 links. The other part comes from news websites. As of December 2017, the total amount of data reached 8,000,000 articles, with a total of 40GB of text content. The article also mentioned that texts including Wikipedia were also included in the training data set,
Millions of people around the world participated
To create and clean the dataset used for GPT-2 training.
Designed a
Hybrid input representation combining word-level representation and byte-level representation
. A large number of repeated words were removed from the previous word-level library, and byte-level representation was introduced to improve generalization ability.
Word-level representation has the advantage of priors, and byte-level representation has the advantage of generalization.
Some modifications have been made for GPT1:
1. Move the layer normalization to the input of each sub-block.
2. Adding extra layer normalization after the self-attention block.
3. Improved initialization method (during initialization, the weights of the residual layer are expanded by multiples of 1/√N, where N is the number of residual layers).
4. The dictionary is expanded, the word segmentation is expanded, the instruction set is expanded, and the batch size is expanded.
5.GPT contains 117,000,000 parameters,
GPT-2 contains 1,542,000,000 parameters
。
Because we only train once but want to observe the performance of the model in various sub-fields, all experiments can be classified into
Zero-shot learning
。
Test items | What aspects of the model are tested? | Test Results |
---|---|---|
Children's Books | Identify different types of vocabulary | ACC increased from 85.7 to 93.3 |
LAMBADA test | Ability to identify long dependencies in text | PPL99.8 decreased to 8.63 |
Winograd Schema Challenge | Common sense reasoning | 63.7% to 70.7% |
Reading comprehension | The model needs to have a certain memory capacity | 3 out of 4 tests set new records |
Summary | Ability to summarize news articles | On par with historical performance |
translate | Translation capabilities of large models automatically learned | The English-French translation is poor, and the French-English translation is at a benchmark level |
Questions and Answers | The model’s ability to answer plausible questions correctly | Accuracy increased by 5.3 times |
The core content of the GPT-2 paper can be summarized in one sentence:
Based on the GPT model, the author increased the model size and training data set size, and found that GPT-2 can automatically adapt to and complete the learning of task objectives in different fields of NLP.
。
For example, if we feed a fixed language model with a dataset of daily conversation text and news report text, and the dataset is large enough, the model is large enough, and the training time is long enough, the final model will have the ability to distinguish between different scenarios of daily conversation and news reports, and the model will also automatically have some new capabilities, such as the ability to write news summaries.
This means that large language models have strong generalization capabilities, but also means that
Large language models will have potential autonomous consciousness
The paper then presents experimental results for several independent areas listed by the authors.
Compared to the GPT paper which only mentioned Large Dataset, the GPT-2 paper began to mention LLM (Large Language Model).
Original paper address: https://cdn.openai.com/better-language-models/language_models_are_uns