notes for datawhale 2th summer camp NLP task1

2024-07-12

//I wrote this note in obsidian and copied it here. The strange format in this note is due to lack of obsidian plugins.

tags:

AI-study
ML
status: done

Target: Run the baseline, experience the process of NLP model problem solving, basically understand the requirements of the competition, and understand the competition scenario
Difficulty:very low
Recommended steps:

Run through the document and submit it and get the first score
Understand the format of contest submission
Data formats related to model training
Check in the first score and try to take notes

‌⁠‌‍⁠⁠⁠‌‌‍‌Task1 Knowledge Point Document - Feishu Cloud Document (feishu.cn)

brief history of ML

Machine Translation (MT) is an important branch of natural language processing.Automatically convert text from one language to another

Machine translation methods: rule-based -> statistics-based -> deep learning
Rule-driven -> Data-driven -> Intelligent-driven

Rule-based machine translation (1950s-1980s)：Early machine translation systems mainly adopted rule-based methods, that is, usingGrammar rules and dictionaries written by linguists for translationThis method requires a deep understanding of the grammar and vocabulary of the source and target languages, but it is less flexible and adaptable and has difficulty in dealing with complex language structures and polysemous words.

Statistics-based machine translation (1990s-2000s)：With the improvement of computer performance and the emergence of large-scale parallel corpora, statistical machine translation began to rise. This methodAutomatically learn the correspondence between the source language and the target language by analyzing large amounts of bilingual text, thus achieving translation. Statistical machine translation shows better results in dealing with polysemous words and language variations, but because it relies on a large amount of training data, it lacks support for resource-poor languages.

Neural network-based machine translation (2010s-present)：The application of neural network methods in machine translation tasks can be traced back to the 1980s and 1990s. However, due to the limitations of computing resources and data size at the time, the performance of neural network methods was unsatisfactory, so its development stagnated for many years. In recent years, the rapid development of deep learning technology has promoted the rise of neural network machine translation (NMT). NMT uses deep neural network models, such asLong Short-Term Memory (LSTM) and Transformer, which can automatically learn the complex mapping relationship between the source language and the target language without manually designing features or rules. NMT has made significant progress in translation quality, speed and adaptability, and has become the mainstream method in the current machine translation field.

Data partitioning

In machine learning and deep learning projects, data sets are usually divided into three parts: training set, development set (Development Set, also often called validation set) and test set (Test Set).

Training set, training model
Development set to prevent the model from overfitting to the training set
Test set, simulate real data, test effect

Analysis of the competition question

Background

at presentNeural Machine TranslationTechnology has made great breakthroughs, butIn certain fields or industries, machine translation cannot guarantee terminology consistency, resulting in less than ideal translation results.For inaccurate machine translation results of terminology, names, and place names, you canCorrection via terminology dictionary, avoiding confusion or ambiguity and maximizing translation quality.

Competition Mission

Machine Translation Challenge Based on Terminology Dictionary InterventionSelect machine translation with English as the source language and Chinese as the target language. In addition to bilingual data from English to Chinese, this competition also provides a terminology dictionary between English and Chinese. The participating teams need to select a translation from the provided training data samples.Build and train multilingual machine translation models, and provide final translation results based on test sets and terminology dictionaries

//RAG🤗

Competition data

Training set: Bilingual data - more than 140,000 bilingual sentence pairs in Chinese and English
Development set: 1,000 English-Chinese bilingual sentence pairs
Test set: 1000 English-Chinese bilingual sentence pairs
Terminology dictionary: English-Chinese 2226 entries

[!info] 🐵

The training set is used to run your learning algorithm.
Development setUsed to tune parameters, select features, and make other decisions about learning algorithms. Sometimes also calledHold-out cross validation set。
The test set is used to evaluate the performance of the algorithm, but the learning algorithm or parameters will not be changed accordingly.

Evaluation Metrics

For the test set translation result files submitted by the participating teams, automatic evaluation indicators are used BLUE-4 Evaluation, specific tools usedsacrebleu open source version。

[!info] 📘
what isBLUE-4 ？

BLEU, the full name isBilingual Evaluation Understudy(Bilingual Assessment Replacement), is a生成语句conduct评估的指标The BLEU score was derived from a 2002 paper by Kishore Papineni et al.《BLEU: a Method for Automatic Evaluation of Machine Translation》Proposed in.

In the field of machine translation, BLEU (Bilingual Evaluation Understudy) is a commonly used automatic evaluation indicator to measureSimilarity between a computer-generated translation and a set of reference translationsThis indicator pays special attention to n-gramsAn exact match (of n consecutive words) can be thought of as a statistical estimate of the accuracy and fluency of the translation. When calculating the BLUE score, the frequency of n-grams in the generated text is first counted, and then these frequencies are compared with the n-grams in the reference text. If the generated translation contains the same n-grams that appear in the reference translation, it is considered a match. The final BLUE score is a value between 0 and 1, where 1 indicates a perfect match with the reference translation and 0 indicates no match at all.

BLUE-4 Specifically, it refers to the consideration of matching of quadruplets (i.e. four consecutive words) in the calculation.

BLUE Characteristics of evaluation indicators:

Advantages: fast calculation speed, low calculation cost, easy to understand, independent of specific language, and highly correlated with human evaluation.
Disadvantages: Does not consider the accuracy of language expression (grammar); The evaluation accuracy will be affected by common words; The evaluation accuracy of short translation sentences is sometimes higher; Synonyms or similar expressions are not considered, which may lead to the rejection of reasonable translations.

In addition to translation, BLEU scores combined with deep learning methods can be applied to other language generation problems, such as language generation, image caption generation, text summarization, and speech recognition.

After-class thinking

I'll use Magic Tower from now on, my 8GB laptop can't handle it
I just looked at the code and data, but I don't understand it very well.
Guess, in the translation process, for each word, several options are retrieved from the dictionary, and the combination with the highest probability is the translation result?

Technology Sharing