2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Table of contents
3. Evaluating PEGASUS on the CNN/DailyMail dataset
4. Training the Summarization Model
1. Evaluating the performance of PEGASUS on SAMSum
3. Generate conversation summary
AGI, or Artificial General Intelligence, is an artificial intelligence system with human-level intelligence. It can not only perform specific tasks, but also understand, learn and apply knowledge to a wide range of problem solving, with high autonomy and adaptability. AGI's capabilities include but are not limited to self-learning, self-improvement, self-adjustment, and can solve various complex problems without human intervention.
- AGI can do a wide range of things:
Cross-domain task execution: AGI can handle tasks in multiple fields and is not limited to specific application scenarios.
Autonomous learning and adaptation: AGI is able to learn from experience and adapt to new environments and situations.
Creative thinking: AGI is able to think innovatively and come up with new solutions.
Social Interaction: AGI is able to engage in complex social interactions with humans, understanding emotions and social signals.
- Regarding the future development prospects of AGI, it is considered one of the ultimate goals of artificial intelligence research and has great transformative potential:
Technological innovation: With the advancement of technologies such as machine learning and neural networks, the realization of AGI may be getting closer.
Interdisciplinary integration: Realizing AGI requires integrating knowledge from multiple disciplines such as computer science, neuroscience, and psychology.
Ethical and social considerations: The development of AGI needs to consider ethical and social issues such as privacy, security, and employment.
Enhanced learning and adaptive capabilities: Future AGI systems may use advanced algorithms to learn from their environment and optimize their behavior.
Multimodal interaction: AGI will have multiple perception and interaction modes to interact with humans and other systems.
As one of the most popular open source machine learning communities and platforms in the world, Hugging Face plays an important role in the AGI era. It provides a wealth of pre-trained models and dataset resources, promoting the development of the field of machine learning. Hugging Face is characterized by ease of use and openness. Through its Transformers library, it provides users with a convenient way to process text with models. With the development of AI technology, the Hugging Face community will continue to play an important role in promoting the development and application of AI technology, especially in the development of multimodal AI technology. The Hugging Face community will expand the diversity of its models and datasets, including multimodal data such as images, audio and video.
- In the AGI era, Hugging Face might work in the following ways:
Model Sharing: As a model sharing platform, Hugging Face will continue to promote the sharing and collaboration of advanced AGI models.
Open source ecosystem: Hugging Face's open source ecosystem will help accelerate the development and innovation of AGI technology.
Tools and Services: Provide a variety of tools and services to support developers and researchers in the research and application of AGI.
Ethics and social responsibility: Hugging Face pays attention to AI ethics and will promote the development and application of responsible AGI models to ensure that technological progress complies with ethical standards.
As an advanced form of artificial intelligence in the future, AGI has broad application prospects. As an open source community, Hugging Face will play a key role in promoting the development and application of AGI.
(Note: To run the following code, you may need to use the Internet)
You may have had to summarize a document before, including a research article, a financial earnings report, or a series of emails. If you think about it, this requires a range of capabilities, including understanding long-form content, reasoning about it, and then producing a fluent text that includes the main themes of the original document. In addition, accurately summarizing a news article is very different from summarizing a legal contract, so complex domain generalization capabilities are required. For these reasons, summarizing text (technically known as text summarization) is a difficult task for neural language models, including Transformer models. Despite these challenges, text summarization can significantly accelerate the workflow of domain experts. Enterprises can use text summarization to compress internal knowledge, summarize contracts, automatically generate social media postings, and more. Therefore, the text summarization NLP task is valuable.
To help you understand the challenges involved, this section will explore how to use the Transformer pre-trained model for text summarization. Summarization is a classic sequence-to-sequence (seq2seq) task that requires input text and target text.
Text summarization is a natural language processing task that aims to extract concise and important information from a long text and generate a short version. Text summarization can be divided into two main types: extractive summarization and generative summarization.
- Extractive Summarization
Extractive summarization selects important sentences or paragraphs from the original text and directly extracts these contents as summaries. This method does not change the words and sentence structures in the original text.
Implementation principle:
- Feature extraction: First, it is necessary to extract various features of the text, such as word frequency, sentence position, keywords, named entities, etc.
- Importance Scoring: Based on the extracted features, a score is calculated for each sentence to determine its importance.
- Sentence selection: Based on the importance score, the most important sentences are selected to construct the summary.
difficulty:
- Importance measurement: How to accurately measure the relative importance of sentences.
- Redundancy elimination: Avoid selecting sentences with repetitive content.
Method to realize:
- Rule-based methods: Use predefined rules and statistical methods to select sentences.
- Machine learning approach: Use a supervised learning algorithm to learn how to select important sentences based on training data.
- Generative Summarization
Generative summarization works by understanding the original text and generating new sentences to summarize its content. This approach can create more natural and coherent summaries, but it is also more complex.
Implementation principle:
- Encoder-Decoder Architecture: Uses a sequence-to-sequence (Seq2Seq) model where the encoder encodes the input text into a context vector and the decoder generates a summary based on the context vector.
- Attention Mechanism: During decoding, the model can focus on different parts of the input text to generate more relevant content.
- Pre-trained models: Use pre-trained language models (such as BERT, GPT, etc.) to improve the quality of generated summaries.
difficulty:
- Content coherence: The generated summary needs to maintain logical coherence and avoid content discontinuity.
- Information completeness: Ensure that the generated summary contains the key information from the original text.
- Model complexity: Generative summarization models are usually more complex than extractive summarization models and require more computing resources and training data.
Method to realize:
- Classic Seq2Seq model: such as the LSTM-based encoder-decoder model.
- Pre-trained Transformer models: such as BERTSUM, T5, BART, etc.
- Text Summarization in Hugging Face
Hugging Face provides a variety of pre-trained models and tools to easily implement text summarization tasks. The following are some commonly used text summarization models and how to use them:
- Summarization using pre-trained models
The following is the sample code for text summarization using the BART model provided by Hugging Face:
from transformers import BartForConditionalGeneration, BartTokenizer # 加载预训练的BART模型和对应的tokenizer model_name = "facebook/bart-large-cnn" model = BartForConditionalGeneration.from_pretrained(model_name) tokenizer = BartTokenizer.from_pretrained(model_name) # 输入文本 input_text = """Your text to summarize goes here.""" # 对输入文本进行tokenize,并添加必要的模型输入 inputs = tokenizer([input_text], max_length=1024, return_tensors='pt') # 使用模型生成摘要 summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True) # 将生成的token序列转换回文本 summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print(summary)
- Supported summary models
Hugging Face provides a variety of pre-trained models for text summarization, including but not limited to:
- BART (facebook/bart-large-cnn)
- T5 (t5-small, t5-base, t5-large, t5-3b, t5-11b)
- PEGASUS (google/pegasus-xsum, google/pegasus-cnn_dailymail)
- Train your own summarization model
If you need to better adapt to the text summarization task in a specific field, you can fine-tune the pre-trained model using your own dataset. The following is a simple fine-tuning example:
from transformers import Trainer, TrainingArguments, BartForConditionalGeneration, BartTokenizer from datasets import load_dataset # 加载数据集 dataset = load_dataset("cnn_dailymail", "3.0.0") # 加载预训练的BART模型和tokenizer model_name = "facebook/bart-large-cnn" model = BartForConditionalGeneration.from_pretrained(model_name) tokenizer = BartTokenizer.from_pretrained(model_name) # 数据预处理 def preprocess_function(examples): inputs = [doc for doc in examples['article']] model_inputs = tokenizer(inputs, max_length=1024, truncation=True) # 设定摘要作为目标 with tokenizer.as_target_tokenizer(): labels = tokenizer(examples['highlights'], max_length=150, truncation=True) model_inputs['labels'] = labels['input_ids'] return model_inputs tokenized_dataset = dataset.map(preprocess_function, batched=True) # 定义训练参数 training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=4, per_device_eval_batch_size=4, num_train_epochs=3, weight_decay=0.01, ) # 使用Trainer进行训练 trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset["train"], eval_dataset=tokenized_dataset["validation"], ) trainer.train()Text summarization is a complex and challenging natural language processing task. By using the pre-trained models and tools provided by Hugging Face, the implementation process of text summarization can be greatly simplified. Users can choose the appropriate model according to their specific needs and fine-tune it to obtain the best summarization effect.
In this section, we will build our own encoder-decoder model to compress multi-person conversations into concise summaries. But before that, let's take a look at a classic dataset in the field of summarization: the CNN/DailyMail corpus.
Now we have everything we need to fully evaluate our model: we have the CNN/DailyMail test set, the ROUGE metric for evaluation, and a summarization model.
- # 导入所需的库
- import matplotlib.pyplot as plt # 导入 matplotlib.pyplot,用于绘制图形
- import pandas as pd # 导入 pandas,用于数据处理
- from datasets import load_dataset, load_metric # 从 datasets 库中导入 load_dataset 和 load_metric 函数
- from transformers import AutoModelForSeq2SeqLM, AutoTokenizer # 从 transformers 库中导入 AutoModelForSeq2SeqLM 和 AutoTokenizer
-
- # 加载 CNN/DailyMail 数据集,版本为 3.0.0
- dataset = load_dataset("cnn_dailymail", "3.0.0")
-
- # 加载 ROUGE 评价指标,用于计算文本摘要的质量
- rouge_metric = load_metric("rouge", cache_dir=None)
-
- # 定义要计算的 ROUGE 分数的名称列表
- rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
We just need to put these pieces together. First, we evaluate the performance of our three-sentence baseline model:
- # 定义一个函数,用于评估基线模型生成的摘要
- def evaluate_summaries_baseline(dataset, metric, column_text="article", column_summary="highlights"):
- # 使用 three_sentence_summary 函数对数据集中的每篇文章生成摘要
- summaries = [three_sentence_summary(text) for text in dataset[column_text]]
-
- # 将生成的摘要和参考摘要添加到评价指标中
- metric.add_batch(predictions=summaries, references=dataset[column_summary])
-
- # 计算评价指标的分数
- score = metric.compute()
-
- # 返回评价指标的分数
- return score
We then apply the function to a subset of the data. Since the test portion of the CNN/DailyMail dataset contains about 10,000 examples, generating summaries for all of these articles will take a lot of time. Recall from Chapter 5 that each generated token requires a forward pass through the model. Generating 100 tokens for each example will require 1 million forward passes, and if we use beam search, this number needs to be multiplied by the number of beams. To make the calculations faster, we will subsample the test set and use 1,000 examples for evaluation. This allows us to evaluate the PEGASUS model in less than an hour on a single GPU and get a stable score estimate:
- # 从测试集中随机抽取1000条样本,用于评估
- test_sampled = dataset["test"].shuffle(seed=42).select(range(1000))
-
- # 使用基线模型生成摘要并评估其质量
- score = evaluate_summaries_baseline(test_sampled, rouge_metric)
-
- # 将评价指标的分数存储在字典中
- rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
-
- # 将评价指标的分数转换为DataFrame格式,并转置以便显示
- pd.DataFrame.from_dict(rouge_dict, orient="index", columns=["baseline"]).T
operation result:
rouge1 | rouge2 | rougeL | rougeLsum | |
baseline | 0.38928 | 0.171296 | 0.245061 | 0.354239 |
The scores are mostly worse than the previous example, but still better than what GPT-2 achieves! Now let’s do the same thing to evaluate the PEGASUS model:
- # 导入 tqdm 模块,用于显示进度条
- from tqdm import tqdm
- # 导入 torch 模块,用于使用 GPU 或 CPU 进行计算
- import torch
-
- # 设置设备为 GPU(如果可用)或 CPU
- device = "cuda" if torch.cuda.is_available() else "cpu"
-
- def chunks(list_of_elements, batch_size):
- """将 list_of_elements 按 batch_size 切分成多个小块"""
- for i in range(0, len(list_of_elements), batch_size):
- yield list_of_elements[i : i + batch_size]
-
- def evaluate_summaries_pegasus(dataset, metric, model, tokenizer,
- batch_size=16, device=device,
- column_text="article",
- column_summary="highlights"):
- """评估使用 Pegasus 模型生成的摘要"""
-
- # 将文章和摘要分别按 batch_size 切分成多个小块
- article_batches = list(chunks(dataset[column_text], batch_size))
- target_batches = list(chunks(dataset[column_summary], batch_size))
-
- # 使用 tqdm 显示进度条,遍历每个文章批次和相应的摘要批次
- for article_batch, target_batch in tqdm(
- zip(article_batches, target_batches), total=len(article_batches)):
-
- # 对文章批次进行标记,将其转换为模型输入的张量
- inputs = tokenizer(article_batch, max_length=1024, truncation=True,
- padding="max_length", return_tensors="pt")
-
- # 使用 Pegasus 模型生成摘要
- summaries = model.generate(input_ids=inputs["input_ids"].to(device),
- attention_mask=inputs["attention_mask"].to(device),
- length_penalty=0.8, num_beams=8, max_length=128)
-
- # 解码生成的摘要,将其从张量转换为字符串
- decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
- clean_up_tokenization_spaces=True)
- for s in summaries]
- decoded_summaries = [d.replace("", " ") for d in decoded_summaries]
-
- # 将生成的摘要和目标摘要添加到评价指标中
- metric.add_batch(predictions=decoded_summaries, references=target_batch)
-
- # 计算评价指标分数
- score = metric.compute()
- return score
Let's explain this evaluation code in detail. First, we split the dataset into smaller batches so that they can be processed simultaneously. Then for each batch, we tokenize the input articles and feed them to the generate() function to generate summaries using beam search. We use the same generation parameters as in the paper. The penalty parameter new length ensures that the model does not generate sequences that are too long. Finally, we decode the generated text and replace<n> tokens and add the decoded text to the metrics along with the reference text. Finally, we compute and return the ROUGE score. Now, we load the model again using the AutoModelForSeq2SeqLM class for seq2seq generation tasks and evaluate it:
- # 从 transformers 库中导入用于序列到序列任务的模型和标记器
- from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
- # 设置模型检查点名称,使用 Google 的 PEGASUS 模型,预训练于 CNN/DailyMail 数据集
- model_ckpt = "google/pegasus-cnn_dailymail"
-
- # 从预训练的模型检查点中加载标记器和模型,并将模型移动到指定的设备(CPU 或 GPU)
- tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
- model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
-
- # 使用评估函数 evaluate_summaries_pegasus 评估 PEGASUS 模型生成的摘要
- # 输入参数包括测试数据、ROUGE 评价指标、模型、标记器和批处理大小
- score = evaluate_summaries_pegasus(test_sampled, rouge_metric,
- model, tokenizer, batch_size=8)
-
- # 从评估结果中提取 ROUGE 分数,将其转换为字典格式,其中键为 ROUGE 指标名称,值为 F-measure 分数
- rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
-
- # 将 ROUGE 分数字典转换为 pandas 数据框,并以 "pegasus" 作为索引
- pd.DataFrame(rouge_dict, index=["pegasus"])
operation result:
(Temporary error here:
TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto
, the running results of the example reference are as follows)
rouge1 | rouge2 | rougeL | rougeLsum | |
pegasus | 0.43438 | 0.210883 | 0.307195 | 0.373231 |
These numbers are very close to the results in the paper. It is important to note here that the loss and per-token accuracy are somewhat decoupled from the ROUGE score. The loss is independent of the decoding strategy, while the ROUGE score is strongly coupled.
Since ROUGE and BLEU are better than human-evaluated loss or accuracy, you should focus on them when building text generation models and carefully explore and choose decoding strategies. However, these metrics are far from perfect, so human evaluation should always be considered.
Now that we have our evaluation function, we can train our own summarization model.
Now that we have gone through many of the details of text summarization and evaluation, let’s use this knowledge to train a custom text summarization model! For our custom application, we will use the SAMSum dataset developed by Samsung (https://oreil.ly/n1ggq), which contains a series of conversations and short summaries. These conversations can represent the interaction between customers and the customer service center, and can be used to generate accurate summaries to help improve customer service and detect common patterns in customer requests. Let's load the dataset and look at a sample:
- # 从 datasets 库中导入用于加载数据集的函数
- from datasets import load_dataset
-
- # 加载 SamSum 数据集,该数据集包含对话和相应的摘要
- dataset_samsum = load_dataset("samsum",trust_remote_code=True)
-
- # 获取数据集的每个划分(训练集、验证集、测试集)的长度,并存储在列表 split_lengths 中
- split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]
-
- # 打印每个数据集划分的长度
- print(f"Split lengths: {split_lengths}")
-
- # 打印训练集中列的名称(特征)
- print(f"Features: {dataset_samsum['train'].column_names}")
-
- # 打印测试集中第一个对话样本
- print("nDialogue:")
- print(dataset_samsum["test"][0]["dialogue"])
-
- # 打印测试集中第一个对话样本的摘要
- print("nSummary:")
- print(dataset_samsum["test"][0]["summary"])
(Note: You may need to install py7zr, pip install py7zr)
operation result:
Split lengths: [14732, 819, 818] Features: ['id', 'dialogue', 'summary'] Dialogue: Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye Summary: Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.
The conversation looks just like you would chat via SMS or WhatsApp, complete with emojis and placeholders for GIFs. The dialogue field contains the full text, while the summary field is a summary of the conversation. Can a model fine-tuned on the CNN/DailyMail dataset handle this dataset? Let’s find out!
First, we will run the same summary generation pipeline using PEGASUS to see the output. We can reuse the code for CNN/DailyMail summary generation:
- # 使用已加载的summarization管道对测试集中的第一个对话样本进行摘要
- pipe_out = pipe(dataset_samsum["test"][0]["dialogue"])
-
- # 打印生成的摘要标题
- print("Summary:")
-
- # 打印生成的摘要文本,并将每个句子的句号后面的空格替换为换行符
- # 这行代码会输出生成的摘要,其中 ". " 替换为 ".n" 使其更易读
- print(pipe_out[0]["summary_text"].replace(" .", ".n"))
operation result:
Summary: Hannah asks Amanda for Betty's number. Amanda can't find it. Hannah asks Larry. Amanda asks Larry to text him. Hannah says she'll text him back. Hannah calls it a day and says she's going to go home. Hannah: "Bye bye"
We can see that the model mainly tries to summarize by extracting key sentences in the conversation. This may work relatively well on the CNN/DailyMail dataset, but in SAMSum, the summary is more abstract and the effect is not necessarily good. We can confirm this by running a full ROUGE evaluation on the test set:
- # 使用评估函数 evaluate_summaries_pegasus 对 SamSum 数据集的测试集进行摘要生成评估
- # 传入的参数包括数据集、评价指标、模型、tokenizer、文本列名、摘要列名和批量大小
- score = evaluate_summaries_pegasus(dataset_samsum["test"], rouge_metric, model,
- tokenizer, column_text="dialogue",
- column_summary="summary", batch_size=8)
-
- # 创建一个字典 rouge_dict,用于存储 ROUGE 评分的中值 F-measure 值
- rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
-
- # 将 ROUGE 评分字典转换为 Pandas 数据框,并以 "pegasus" 为索引
- pd.DataFrame(rouge_dict, index=["pegasus"])
operation result:
(Temporary error here:
TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto
, the running results of the example reference are as follows)
rouge1 | rouge2 | rougeL | rougeLsum | |
pegasus | 0.29617 | 0.087803 | 0.229604 | 0.229514 |
While the results aren't great, this isn't unexpected given the distance away from the CNN/DailyMail data distribution. Nonetheless, setting up the evaluation process before training has two advantages: we can directly use the metric to measure the success of training, and we have a good baseline. Fine-tuning the model on our dataset should immediately improve the ROUGE metric, and if it doesn't, then we know something is wrong with our training loop.
Before we train our data, let’s take a quick look at the distribution of input and output lengths:
- # 编码训练集中的对话文本和摘要,并计算其长度
- d_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["dialogue"]]
- s_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["summary"]]
-
- # 创建一个包含两个子图的图形对象
- fig, axes = plt.subplots(1, 2, figsize=(10, 3.5), sharey=True)
-
- # 绘制对话文本的长度分布直方图
- axes[0].hist(d_len, bins=20, color="C0", edgecolor="C0")
- axes[0].set_title("Dialogue Token Length")
- axes[0].set_xlabel("Length")
- axes[0].set_ylabel("Count")
-
- # 绘制摘要的长度分布直方图
- axes[1].hist(s_len, bins=20, color="C0", edgecolor="C0")
- axes[1].set_title("Summary Token Length")
- axes[1].set_xlabel("Length")
-
- # 调整子图布局,使其更加紧凑
- plt.tight_layout()
-
- # 显示绘制的图形
- plt.show()
operation result:
(Temporary error here:
TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto
, the running results of the example reference are as follows)
We can see that most conversations are much shorter than CNN/DailyMail articles, with about 100-200 tokens per conversation. Similarly, the summaries are much shorter, with about 20-40 tokens (the same length as the average tweet).
Let's remember these results for later use. First, we need to tokenize the dataset. We set the maximum length of conversations and summaries to 1024 and 128 respectively:
- def convert_examples_to_features(example_batch):
- """
- 将示例批处理转换为模型输入特征。
-
- Args:
- - example_batch (dict): 包含对话和摘要的示例批处理字典。
-
- Returns:
- - dict: 包含转换后特征的字典,包括输入编码和目标编码。
- """
- # 对对话文本进行编码处理,生成输入编码
- input_encodings = tokenizer(example_batch["dialogue"], max_length=1024,
- truncation=True)
-
- # 使用目标编码器处理摘要文本,生成目标编码
- with tokenizer.as_target_tokenizer():
- target_encodings = tokenizer(example_batch["summary"], max_length=128,
- truncation=True)
-
- # 返回包含输入编码、目标标签和注意力掩码的字典
- return {
- "input_ids": input_encodings["input_ids"],
- "attention_mask": input_encodings["attention_mask"],
- "labels": target_encodings["input_ids"]
- }
-
- # 使用 map 方法将 SamSum 数据集转换为 PyTorch 格式
- dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features,
- batched=True)
-
- # 设置数据集格式为 Torch 张量类型,并指定列名
- columns = ["input_ids", "labels", "attention_mask"]
- dataset_samsum_pt.set_format(type="torch", columns=columns)
operation result:
(Temporary error here:
TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto)
There is a new thing in the lemmatization step: the tokenizer.as_target_tokenizer() context. Some models require special tokens in the decoder input, so it is important to separate the lemmatization steps for the encoder and decoder inputs. In the with statement (called a context manager), the lemmatizer knows that it is lemmatizing for the decoder.
Now, we need to create the data collator. In most cases, we can use the default collator, which collects all tensors in a batch and simply stacks them. For summarization tasks, we need to not only stack the inputs, but also prepare the targets on the decoder side. PEGASUS is an encoder-decoder Transformer and therefore has a classic seq2seq architecture. In a seq2seq setting, a common approach is to apply teacher forcing in the decoder. When using this strategy, the decoder receives input tokens (the same as in a decoder-only model, such as GPT-2), which are shifted one position to the right by the annotations, in addition to the encoder output. Therefore, when predicting the next token, the decoder will get as input the true value shifted one position to the right, as shown in the following table:
- # 示例文本序列和标签生成过程
- text = ['PAD', 'Transformers', 'are', 'awesome', 'for', 'text', 'summarization']
-
- # 初始化存储每步结果的列表
- rows = []
-
- # 循环生成每步的数据行
- for i in range(len(text)-1):
- rows.append({
- 'step': i+1, # 步骤号,从1开始
- 'decoder_input': text[:i+1], # 解码器输入序列,从文本开始到当前位置
- 'label': text[i+1] # 标签,当前位置的下一个词
- })
-
- # 创建数据帧,并以步骤号作为索引
- pd.DataFrame(rows).set_index('step')
operation result:
step | decoder_input | label |
1 | [PAD] | Transformers |
2 | [PAD, Transformers] | are |
3 | [PAD, Transformers, are] | awesome |
4 | [PAD, Transformers, are, awesome] | for |
5 | [PAD, Transformers, are, awesome, for] | text |
6 | [PAD, Transformers, are, awesome, for, text] | summarization |
We shift it one position to the right so that the decoder only sees the previous correct annotation and not the current or future annotations. Just shifting is enough because the decoder has a masked self-attention mechanism that masks all current and future inputs.
So, when preparing the batch, we set the decoder input by shifting the annotations one position to the right. After that, we make sure to ignore the padding tokens in the loss function by setting the padding tokens in the annotations to -100. In fact, we don't have to do these steps manually, because DataCollatorForSeq2Seq does all of this for us:
- # 导入 Seq2Seq 数据集整理器模块
- from transformers import DataCollatorForSeq2Seq
-
- # 创建 Seq2Seq 数据集整理器实例
- seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
Then, as usual, we set up a TrainingArguments for training:
- # 导入训练参数和训练器模块
- from transformers import TrainingArguments, Trainer
-
- # 定义训练参数
- training_args = TrainingArguments(
- output_dir='pegasus-samsum', # 模型输出目录
- num_train_epochs=1, # 训练的轮数
- warmup_steps=500, # 学习率预热步数
- per_device_train_batch_size=1, # 每个设备的训练批次大小
- per_device_eval_batch_size=1, # 每个设备的评估批次大小
- weight_decay=0.01, # 权重衰减率
- logging_steps=10, # 训练日志记录步数
- push_to_hub=True, # 是否推送到模型中心
- evaluation_strategy='steps', # 评估策略
- eval_steps=500, # 评估步数间隔
- save_steps=1e6, # 模型保存步数间隔
- gradient_accumulation_steps=16 # 梯度累积步数
- )
What's different from the previous settings is that this time there is a new parameter gradient_accumulation_steps. Since the model is very large, we had to set the batch size to 1. However, a batch size that is too small will affect convergence. To solve this problem, we can use a clever trick called gradient accumulation. As the name suggests, instead of calculating the gradients for the entire batch at once, we calculate and aggregate the gradients in batches. When we have aggregated enough gradients, we run the optimization step. This is naturally slower than doing it all at once, but it can save us a lot of GPU memory.
Now, we log into Hugging Face so that we can push the model to the Hub after training:
- from huggingface_hub import notebook_login
-
- notebook_login()
operation result:
Now we have everything we need to initialize the trainer, including the model, tokenizer, training parameters, data tidy, and training and evaluation datasets:
- from transformers import TrainingArguments, Trainer
-
- # 创建一个 Trainer 实例用于训练序列到序列模型。
- trainer = Trainer(
- model=model, # 要训练的序列到序列模型
- args=training_args, # 定义的训练参数
- tokenizer=tokenizer, # 用于预处理输入数据的分词器
- data_collator=seq2seq_data_collator, # 用于批处理数据的数据整理器
- train_dataset=dataset_samsum_pt["train"], # 训练数据集
- eval_dataset=dataset_samsum_pt["validation"] # 评估数据集
- )
operation result:
(Temporary error here:
TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto)
We are ready to train. Once training is complete, we can run the evaluation function directly on the test set to see how the model performs:
- from transformers import TrainingArguments, Trainer
-
- # 开始训练模型
- trainer.train()
-
- # 使用评估函数评估 Pegasus 模型的摘要质量
- score = evaluate_summaries_pegasus(
- dataset_samsum["test"], rouge_metric, trainer.model, tokenizer,
- batch_size=2, column_text="dialogue", column_summary="summary")
-
- # 提取 ROUGE 指标结果
- rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
-
- # 创建 DataFrame 显示 ROUGE 指标
- pd.DataFrame(rouge_dict, index=[f"pegasus"])
operation result:
(Temporary error here:
TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto
, the running results of the example reference are as follows)
rouge1 | rouge2 | rougeL | rougeLsum | |
pegasus | 0.42761 | 0.200571 | 0.340648 | 0.340738 |
We can see that the ROUGE score is significantly improved over the model without fine-tuning, so even though the previous model was also trained for summary generation, it did not adapt well to the new domain. Let's push our model to the Hub:
- # 将训练完成的模型推送到 Hub 上
- trainer.push_to_hub("Training complete!")
Next we will use this model to generate some summaries for us.
You can also evaluate the generated results as part of a training loop: use the TrainingArguments extension named Seq2SeqTrainingArguments and specify predict_with_generate=True. Pass it to a specialized Trainer named Seq2SeqTrainer, which uses the generate() function instead of a forward pass of the model to create predictions for evaluation. Try it yourself!
Judging from the loss and ROUGE scores, the model appears to show significant improvement over the original model trained only on CNN/DailyMail. The generated summary from a sample in the test set is shown below:
- import transformers
-
- # 设置transformers的日志级别为错误,以减少输出日志
- transformers.logging.set_verbosity_error()
-
- # 定义生成摘要时的参数
- gen_kwargs = {"length_penalty": 0.8, "num_beams": 8, "max_length": 128}
-
- # 从测试集中选择一个示例
- sample_text = dataset_samsum["test"][0]["dialogue"]
- reference = dataset_samsum["test"][0]["summary"]
-
- # 使用预训练的pegasus-samsum模型创建摘要管道
- pipe = pipeline("summarization", model="transformersbook/pegasus-samsum")
-
- # 输出对话和参考摘要
- print("Dialogue:")
- print(sample_text)
- print("nReference Summary:")
- print(reference)
-
- # 使用模型生成摘要并输出
- print("nModel Summary:")
- print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])
operation result:
Dialogue: Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye Reference Summary: Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry. Model Summary: Amanda can't find Betty's number. Larry called Betty last time they were at the park together. Hannah wants Amanda to text Larry instead of calling Betty.
This is very similar to the reference summary. It looks like the model has learned to synthesize the conversation into summaries instead of just extracting paragraphs. Now for the final test: how does the model perform on custom input?
- # 自定义对话示例
- custom_dialogue = """
- Thom: Hi guys, have you heard of transformers?
- Lewis: Yes, I used them recently!
- Leandro: Indeed, there is a great library by Hugging Face.
- Thom: I know, I helped build it ;)
- Lewis: Cool, maybe we should write a book about it. What do you think?
- Leandro: Great idea, how hard can it be?!
- Thom: I am in!
- Lewis: Awesome, let's do it together!
- """
-
- # 使用预训练的pegasus-samsum模型生成摘要,并输出摘要结果
- print(pipe(custom_dialogue, **gen_kwargs)[0]["summary_text"])
operation result:
Thom and Lewis wanted to write a book about transformers. They came up with the idea with the help of Hugging Face's Leandro. The book will be called "Transformers: The Power of Transformers" and will be published in 2015. The project is currently in the planning stages.
The generated custom conversation summary makes a lot of sense. It does a good job of summarizing what everyone in the discussion wants to write a book about together, rather than just extracting individual sentences. For example, it combines lines 3 and 4 into a logical combination.