Technology Sharing

[MindSpore Learning Check-in] Application Practice-LLM Principles and Practice-Realizing BERT Dialogue Emotion Recognition Based on MindSpore

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

In today's natural language processing (NLP) field, emotion recognition is a very important application scenario. Whether in intelligent customer service, social media analysis, or in the field of emotional computing, accurately identifying users' emotions can greatly improve user experience and the intelligence level of the system. As a powerful pre-trained language model, BERT (Bidirectional Encoder Representations from Transformers) has demonstrated its excellent performance in multiple NLP tasks. In this blog, we will introduce in detail how to use the BERT model to implement conversation emotion recognition based on the MindSpore framework. Through step-by-step code examples and detailed explanations, we will help you master this technology.

Model Introduction

BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional encoder representation model based on Transformer. It mainly captures word and sentence level representations through two pre-training tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP).

  • Masked Language Model: Randomly mask 15% of the words in the corpus, and the model needs to predict these masked words.
  • Next Sentence Prediction: The model needs to predict whether there is a sequential relationship between two sentences.

After BERT pre-training, it can be used for a variety of downstream tasks, such as text classification, similarity judgment, reading comprehension, etc.

Dataset preparation

In the data set preparation section, we downloaded and unzipped the robot chat data set provided by the Baidu PaddlePaddle team. This data set has been preprocessed and contains sentiment labels. Each row of data consists of a label and a segmented text. The label represents the sentiment category (0 for negative, 1 for neutral, and 2 for positive), and the text is the user's conversation content. By using this structured data, we can more easily perform sentiment classification tasks.

# 下载数据集
!wget https://baidu-nlp.bj.bcebos.com/emotion_detection-dataset-1.0.0.tar.gz -O emotion_detection.tar.gz
!tar xvf emotion_detection.tar.gz
  • 1
  • 2
  • 3

The dataset format is as follows:

label--text_a
0--谁骂人了?我从来不骂人,我骂的都不是人,你是人吗 ?
1--我有事等会儿就回来和你聊
2--我见到你很高兴谢谢你帮我
  • 1
  • 2
  • 3
  • 4

Data loading and preprocessing

Data loading and preprocessing is a crucial step in the machine learning process. We usedGeneratorDatasetTo load the data, we use a mapping operation to convert the text into a format acceptable to the model. Specifically, we useBertTokenizerTokenize the text into vocabulary IDs and perform padding operations. The purpose of this is to ensure that the length of all input sequences is consistent, thereby improving training efficiency and model performance.

import numpy as np
from mindspore.dataset import text, GeneratorDataset, transforms
from mindnlp.transformers import BertTokenizer

def process_dataset(source, tokenizer, max_seq_len=64, batch_size=32, shuffle=True):
    is_ascend = mindspore.get_context('device_target') == 'Ascend'
    column_names = ["label", "text_a"]
    
    dataset = GeneratorDataset(source, column_names=column_names, shuffle=shuffle)
    type_cast_op = transforms.TypeCast(mindspore.int32)
    
    def tokenize_and_pad(text):
        if is_ascend:
            tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)
        else:
            tokenized = tokenizer(text)
        return tokenized['input_ids'], tokenized['attention_mask']
    
    dataset = dataset.map(operations=tokenize_and_pad, input_columns="text_a", output_columns=['input_ids', 'attention_mask'])
    dataset = dataset.map(operations=[type_cast_op], input_columns="label", output_columns='labels')
    
    if is_ascend:
        dataset = dataset.batch(batch_size)
    else:
        dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id), 'attention_mask': (None, 0)})

    return dataset

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
dataset_train = process_dataset(SentimentDataset("data/train.tsv"), tokenizer)
dataset_val = process_dataset(SentimentDataset("data/dev.tsv"), tokenizer)
dataset_test = process_dataset(SentimentDataset("data/test.tsv"), tokenizer, shuffle=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

insert image description here

Model building

In the model building part, we usedBertForSequenceClassificationto perform sentiment classification tasks. This pre-trained model has been trained on a large-scale corpus and has strong language understanding capabilities. By loading pre-trained weights, we can significantly improve the performance of the model on sentiment classification tasks. At the same time, we use auto mixed precision technology, which not only speeds up the training process, but also reduces the use of video memory, thereby achieving more efficient training under limited hardware resources.

Optimizers and evaluation metrics are important components in model training. We chose the Adam optimizer because it performs well when processing large-scale data and complex models. In terms of evaluation metrics, we used accuracy to measure the performance of the model. With these settings, we can ensure that the model is continuously optimized during training and achieves good performance on the validation set.

The callback function plays an important role in the model training process. We set two callback functions:CheckpointCallbackandBestModelCallbackThe former is used to periodically save the model weights, and the latter automatically loads the best performing model weights. With these callback functions, we can ensure that important model parameters are not lost during training, and always use the best performing model for inference and evaluation.

from mindnlp.transformers import BertForSequenceClassification
from mindspore import nn
from mindnlp._legacy.amp import auto_mixed_precision

model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
model = auto_mixed_precision(model, 'O1')

optimizer = nn.Adam(model.trainable_params(), learning_rate=2e-5)
metric = Accuracy()
ckpoint_cb = CheckpointCallback(save_path='checkpoint', ckpt_name='bert_emotect', epochs=1, keep_checkpoint_max=2)
best_model_cb = BestModelCallback(save_path='checkpoint', ckpt_name='bert_emotect_best', auto_load=True)

trainer = Trainer(network=model, train_dataset=dataset_train,
                  eval_dataset=dataset_val, metrics=metric,
                  epochs=5, optimizer=optimizer, callbacks=[ckpoint_cb, best_model_cb])
trainer.run(tgt_columns="labels")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

Model Validation

In the model validation part, we use the validation dataset to evaluate the performance of the model. By calculating the accuracy of the model on the validation set, we can understand the generalization ability and actual effect of the model. This step is very important because it can help us find possible problems in the model during training and make corresponding adjustments and optimizations.

evaluator = Evaluator(network=model, eval_dataset=dataset_test, metrics=metric)
evaluator.run(tgt_columns="labels")
  • 1
  • 2

Model Inference

The model inference section shows how to use the trained model to classify the sentiment of new data. We define apredictFunction, which predicts sentiment by inputting text and outputs the prediction result. This step demonstrates the practical application ability of the model and verifies the generalization performance of the model.

dataset_infer = SentimentDataset("data/infer.tsv")

def predict(text, label=None):
    label_map = {0: "消极", 1: "中性", 2: "积极"}
    text_tokenized = Tensor([tokenizer(text).input_ids])
    logits = model(text_tokenized)
    predict_label = logits[0].asnumpy().argmax()
    info = f"inputs: '{text}', predict: '{label_map[predict_label]}'"
    if label is not None:
        info += f" , label: '{label_map[label]}'"
    print(info)

for label, text in dataset_infer:
    predict(text, label)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

insert image description here

Customizing inference data

Finally, we show how to use the model to perform emotion recognition on custom inputs. This step not only demonstrates the practical application capabilities of the model, but also verifies the performance of the model under different inputs. In this way, we can further understand the generalization ability and actual effect of the model.

predict("家人们咱就是说一整个无语住了 绝绝子叠buff")
  • 1

insert image description here