Unstructured domain text knowledge extraction based on BERT

2024-07-12

topic

Large language models for food testing

Paper address: https://arxiv.org/abs/2103.00728

Summary

With the development of knowledge graph technology and the popularization of commercial applications, the demand for extracting knowledge graph entities and relationship data from various unstructured domain texts is increasing. This makes automated knowledge extraction for domain texts quite meaningful. This paper proposes a BERT-based knowledge extraction method for automatically extracting knowledge points from unstructured specific domain texts (such as insurance clauses in the insurance industry) to save manpower in the process of building knowledge graphs. Different from the commonly used knowledge point extraction methods based on rules, templates or entity extraction models, this paper converts the knowledge points of domain texts into question-answer pairs, takes the text before and after the answer as the context, and fine-tunes BERT based on SQuAD data for reading comprehension tasks. The fine-tuned model was used to automatically extract knowledge points from more insurance clauses, and achieved good results.

method

In recent years, with the deepening of digital transformation in various industries, the number of related electronic texts has increased dramatically. At the same time, more and more companies have begun to pay attention to data analysis, mining and the development and utilization of data resources. Computer application systems such as knowledge graphs and intelligent dialogues have become the basis for various enterprises and institutions to provide services internally and externally. Such applications often need to extract the structured information contained in various unstructured field texts for the construction of digital knowledge bases. Data is the basis of computer products and services, and providing data for computers has become a new task for the development of enterprises and institutions in the new era. The original various business and business documents in enterprises and institutions contain rich knowledge and information, but they are all written for human reading. Compared with the needs of computer programs, there is a lot of redundant information. At present, when applying such data, it is basically necessary to invest a lot of manpower to manually extract the required information by reading the documents and express it in a form that the computer can read ("understand") . This has caused a lot of additional learning costs and human resource consumption. How to use automated means to discover knowledge from unstructured text data as a data resource that various intelligent applications rely on is a research hotspot in the field of knowledge extraction. This paper takes unstructured text in a specific field as the research object and proposes a method for extracting knowledge from it through a language understanding model based on deep learning. This method presents the knowledge points to be extracted in the form of question-answer pairs, uses manually annotated data as training data, performs transfer learning based on the pre-trained model, and obtains a model that automatically extracts knowledge points from texts in the same field through fine tuning.

For documents with uniform structural specifications, knowledge extraction can be performed by constructing rules. The construction of rules is often completed through manual induction and summarization - that is, reading a large number of texts in the same field, selecting from them, and summarizing the final extraction rules. Jyothi et al. used a rule-based approach to extract effective information from a large number of personal resumes and build a database. JunJun et al. used a similar method to extract academic concept knowledge from academic literature. The advantage of this method is that it does not require model training, is simple and efficient; the disadvantage is also obvious. The rules we construct are only applicable to texts with the same structure, and must have strict format specifications. Once the text structure changes slightly, new knowledge extraction rules need to be manually constructed, so this method is not portable.

One task of knowledge extraction is called entity extraction, which is to extract predefined label content from text, such as time, place, etc. The specific label depends on the application. The most commonly used knowledge extraction is named entity recognition (NER). Entity extraction itself can be directly solved as a sequence labeling task, which can be processed using traditional statistical learning methods such as hidden Markov model (HMM) or conditional random field (CRF). In recent years, some deep learning models have also been applied to such problems. For example, the sequence labeling method combining BiLSTM and CRF has achieved good results. Lample et al. proposed a new network structure that uses stacked LSTM to represent a stack structure, directly constructs the representation of multiple words, and compares it with the LSTM-CRF model. Ma et al. proposed an end-to-end sequence labeling model based on BiLSTM-CNN-CRF. In addition, the fine-tuned BERT model can also achieve good results in sequence labeling tasks.

In addition to extracting entities from text, the relationship between entities is also the focus of knowledge extraction. Entities and their relationships are usually grouped into triplets.<E1, R, E2> , then the task goal is to extract all possible entity-relationship triplets from the text, whose relationships are limited to the pre-set schema. Zeng et al. designed CNN to classify relationships, but not triplets. Makoto et al. constructed a stacked network based on BiLSTM and Bi-TreeLSTM to simultaneously extract entities and detect relationships, thereby achieving end-to-end prediction of entity relationships. Li et al. used a two-layer LSTM with an encoder-decoder architecture to build a knowledge extraction model that is not limited to triples and can predict structured knowledge in a fixed format. Zheng et al. converted the entity and relationship extraction task into a sequence labeling task through a labeling strategy, and then built a Bi-LSTM model similar to the previous one to handle it. Luan et al. designed a multi-task learning framework for identifying entities and relationships in scientific literature to build a scientific knowledge graph. This model is superior to existing models without any prior knowledge in the field.

除了以上提到的知识抽取模式，一个不同的角度是将知识点本身看作一个问题，将知识点的内容作为该问题的答案，将知识点所在的文本段作为这个问答对的上下文，这样知识抽取模型便可以用问答模型来构造。近年来，GPT和 BERT等预训练模型的出现使得这类问答阅读理解任务可以很好地作为其下游任务，仅需简单改造原有网络结构，并进行微调，即可得到很好的效果。Wang等人在原始BERT 的基础上使用多段落预测的方式改进了其在 SQuAD数据集上的效果。Alberti等人在BERT 与 SQuAD 的基础上改进后，将其应用在一个更困难的问答数据集 NQ上，𝐹1分数相对之前的基准线提升了 30%。这种问答形式的知识抽取可以更灵活地处理不同结构的知识——只需将其定义为不同的问题，而不需要根据知识的形式单独设计新的网络结构。

The structured texts of different industries have different characteristics due to their industry characteristics. Specific documents in some industries (such as medical instructions) not only have strict structures but also have very strict requirements on terminology and wording, which are more suitable for rule-based knowledge extraction. There are also some industries whose texts are not much different from general texts (such as news reports, interviews, etc.), and general extraction techniques can be directly applied to them. There are also some fields of text that are between the two, with a certain degree of professionalism, but not very strict. The structure and wording of similar texts in different companies are similar, but there are some differences. The use and display of terms within the same company are relatively unified. The insurance clause documents in the insurance industry belong to this third category of texts. Insurance clauses are provisions on the rights and obligations of both parties to the insurance contract - the insurer (insurance company) and the policyholder - jointly agreed upon. An insurance clause generally consists of three parts:

Basic information, i.e. the information of the clause itself, including: insurer, clause name, clause abbreviation, clause type, term type, cooling-off period, statute of limitations, filing number and filing time, whether it can be sold as the main insurance, etc.;
Purchase conditions, i.e. the objective conditions that the insured must meet to be covered by this clause, including: the insured's annual salary, gender, occupation/job requirements, physical examination requirements, social security requirements, personal circumstances that must be disclosed truthfully, etc.;
Insurance liability, that is, the scope of liability and compensation content of this clause;

Although insurance clauses have a certain degree of professional vocabulary, the use of professional vocabulary is mostly not standardized in the industry (for example, "hesitation period" can also be called "cooling-off period", etc.), and the clause document is a document delivered to the insured for reading. Most of the knowledge points that need to be extracted are mixed in a natural language expression, which is not suitable for text extraction based on static rules. Although the knowledge points to be extracted can be obtained by entity extraction, the values corresponding to the knowledge points are often mixed in a natural language expression and cannot be extracted together with the knowledge point description. For example: the statute of limitations for a certain clause is 2 years, and this "2 years" may appear in the following description: "The statute of limitations for the beneficiary to request us to pay the insurance money or the exemption of the insurance premium is 2 years, calculated from the date on which he knows or should know that the insurance accident has occurred." Therefore, when it is necessary to extract basic information, purchase conditions, insurance liability and other knowledge points from the insurance clause, we directly exclude rule-based and entity-based extraction methods. If Schema-based extraction is used to convert knowledge points into triples, the required training data set and annotation volume are relatively large, which is inevitably not worth the loss for our purpose. Therefore, we finally chose the question-answering based knowledge extraction method.

In recent years, the method of learning through fine-tuning based on pre-trained models has achieved great success in the field of Natural Language Processing (NLP), and the BERT model is an important representative of them. BERT is a bidirectional encoding representation model based on transformers, and its topological structure is a multi-layer bidirectional transformer network. The BERT model is a typical application based on fine-tuning learning, that is, its construction includes two steps: pre-training and fine-tuning. First, in the pre-training stage, a large amount of unlabeled corpus data of different training tasks is trained, and the knowledge in the corpus is transferred into the text embedding of the pre-trained model. In this way, in the fine-tuning stage, only an additional output layer needs to be added to the neural network to adjust the pre-trained model. Specifically, fine-tuning is to initialize the BERT model with pre-trained parameters, and then fine-tune the model using labeled data from downstream tasks. In view of our need to extract knowledge points from insurance documents, we only need to use insurance terms data to fine-tune BERT's question-answering task to meet the needs of insurance terms knowledge extraction.

The insurance clause knowledge extraction process firstly creates the manually annotated insurance clause knowledge points into<question, answer>, and then use a text parser to parse an insurance clause document into a document tree, where the main title is the root node, each subsequent level of title is the child node of the previous level, and each paragraph of text is read as a leaf node. According to the answer in the question-answer pair, it is matched to the leaf node where it is located, and the text corresponding to the entire leaf node is used as the context of the question-answer pair. Finally, a tree consisting of<question, answer, context>Finally, the BERT pre-trained model was trained using this dataset in the same way as the SQuAD data for reading comprehension tasks to obtain the final knowledge extraction model. As shown in the figure above, for the question-answering task, we only need to add an extra fully connected layer after the encoding vector output by BERT to predict the position of the answer in the context. When testing, for new insurance terms, we need to analyze the context of different knowledge points in the same way, and then<question, context> The above method can be used to better handle the same type of insurance clauses of the same company, because the structure of the insurance clauses of the same company is consistent, and the same program can be used to analyze the context. However, for insurance clauses of different companies and types, the original analysis program cannot handle them due to different terms and structures, and it is not feasible to rewrite a text analysis program for each clause, so the model needs to be improved.

In order to make the knowledge extraction process more versatile, we first modified the prediction process: the original text of the new clause was segmented according to the number of words, with each segment of about 300 words (try not to break up sentences), and then each text segment was used as a possible context for any knowledge point as the input of the model. If the output answer is empty, it means that there is no corresponding knowledge point in this segment. Otherwise, the output of each knowledge point under all text segments is considered comprehensively, and the one with the highest probability is selected as the answer for the knowledge point. This new prediction method is versatile for any clause and no additional text parsing is required. We tested the clauses of several different companies with this method, and the results showed that it did not work well on the old model, and the accuracy dropped significantly. The reason is that before the improvement, the context of each knowledge point was accurately located according to the document structure during training, and there were not many negative samples, resulting in the model being able to make predictions only through the accurately located context. Once the text organization structure and title format change, the original text parsing program cannot accurately locate the question context, generating a lot of interference data, affecting the effect of the model. Therefore, the training process of the model needs to be modified. We add segmented text data, that is, we segment each clause in the training set in the same way. If the segment contains the answer marked by the knowledge point, it is used as a new sample, otherwise it is used as a negative sample (answer is empty). In actual testing, if all these new samples are added to the training set, too much training data is generated, and the number of negative samples far exceeds the number of positive samples. In order to balance this process, we further make the following improvements: for each knowledge point question, if the clause itself does not contain the knowledge point (because the knowledge point is uniformly defined for all insurance clauses, so for a specific clause, not all knowledge points may be included in it), each segment is used as a negative sample of the question with a probability of 10%; if the clause itself contains the knowledge point, there are two cases. If the current text segment contains the target knowledge point, it is used as a positive sample, otherwise it is selected as a negative sample with a probability of 50%. In this way, a new training set is constructed to obtain a new model. The idea is: if a certain knowledge point is contained in the clause, the number of negative samples related to the knowledge point is increased, so that the model can better handle the interference of similar segments and improve the accuracy of the answer. If the clause itself does not contain the knowledge point, the fit between the text fragment and the knowledge point should be poor, and selecting a small number of negative samples is sufficient. After testing, the new model is greatly improved compared to the old model, and is more in line with the new prediction method. It can be used as a more general insurance clause knowledge extraction model.

experiment

我们的数据集由某保险公司的保险条款组成，每个条款具有人工标注的知识点，如犹豫期，诉讼期，保险金额等。在实验过程中，训练集，测试集分别由 251 个条款和 98 个条款组成。经过统计，这些条款中所有可能的知识点问题数量为309 条，平均每个条款有 45 条知识点需要提取。测试过程中，我们将条款文本分段，尝试从所有段中提取知识点𝑘𝑖，并根据模型输出的概率，选择概率最高的文本作为该知识点的答案。如果最终得到的输出为空字符串，则代表条款不存在该知识点。由于每个条款提取的知识点只占 309条中的小部分，大多数知识点的输出应当是空的，因此我们在评估时忽略这部分空知识点，关注两个指标：模型输出的知识点正确率𝑃，即精准率（precision），以及应提取知识点中确实被正确提取的比率𝑅，即召回率（recall）。假设知识点𝑘𝑖标注为𝑦𝑖，模型的输出为𝑦̃𝑖，则𝑃和𝑅可定义为：

We use Google's open source BERT Chinese pre-trained model BERT_chinese_L-12_H-768_A-12, and conduct subsequent tests on this basis. In terms of parameter settings, the initial learning rate is 3E-5, the batch size is 4, the number of training epochs is 4, and the rest of the parameters use the default configuration of the model. The experiment in this article includes two parts of testing. The first part is the test of the benchmark model. The training process is: first use the text parsing program to parse the structure of the insurance terms, extract the context of the corresponding knowledge points, and then combine them into a training set to fine-tuning the BERT model. The second part is the test of the new model. The training process is: add new samples based on the training set of the benchmark model. The corresponding insurance terms are segmented according to the number of words, and each paragraph of text is about 300 words. For each knowledge point question, a training set is constructed to train the new model. The test results are the average of the 98 insurance terms in the test set, as shown in the following table:

可以看出，以前文所述的方法添加有限的负样本后训练的模型明显优于基准模型，其中𝑃提高了约 40%，𝑅提高了约 20%。𝑃的提升相当显著。基准模型的训练集中，仅通过文本解析程序精确定位知识点的上下文信息，导致模型只具备从正确的上下文中抽取对应的知识点的能力，而不具备辨别无效上下文的能力，因此基准模型存在很大比例的无效输出。而按比例添加负样本后，新模型的无效输出大幅度减少，输出的知识点中 60%以上是有效且正确的输出。而由于添加了相对于基准模型粒度更粗的上下文信息（文本段）组成的正样本，使得模型能够更好地从无规则截取的文本段中抽取出目标知识点，因此召回率𝑅也有大幅提升。最终𝐹1值提升了约30%。

The experimental results show that the new model trained by our optimized training set performs better than the original benchmark model in the text segmentation prediction method, and can be further used in more general insurance clause knowledge extraction tasks. At the same time, the current model still has a lot of room for improvement.

Due to the limitation of practical conditions (data annotation amount), our training only covers 251 clauses, and all training data comes from the same insurer. After expanding the data set to include clause data formulated by more insurers, the effect of the model should be further optimized.
Currently, our data annotation only contains the content of the knowledge points of the clauses, while the corresponding context in the training data is obtained through a self-written text analysis program, so the context obtained in this way has a small part of errors. We can optimize the manual annotation strategy and annotate the knowledge points and their context at the same time, so that the data obtained can be more accurate.

Technology Sharing