2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Research Background: Deep learning vulnerability detection tools have achieved promising results in recent years. State-of-the-art models report an F1 score of 0.9 and outperform static analyzers. The results are exciting because deep learning could revolutionize software assurance. As a result, industry companies such as IBM, Google, and Amazon are very interested and have invested heavily in developing such tools and datasets.
Existing Problems: Although deep learning vulnerability detection is promising, it has not yet reached the level of computer vision and natural language processing. Most of the current research focus is on trying emerging deep learning models and applying them to datasets like Devign or MSR datasets. However, we know very little about the models themselves, such as what types of programs the models can handle effectively, whether we should build models for each vulnerability type or build a unified model for all vulnerability types, what a good training dataset looks like, and what information the model uses when making decisions. Knowing the answers to these questions can help us better develop, debug, and apply models. However, given the black-box nature of deep learning, these questions are difficult to answer. The purpose of this paper is not to provide a complete solution to these problems, but to explore these goals.
Scientific question: In this paper, we survey and reproduce a range of state-of-the-art deep learning vulnerability detection models and establish research questions to understand these models, aiming to draw lessons and guidance to better design and debug future models. We structure the research questions into three areas, namelyModel Capabilities、Training DataandModel explanationSpecifically, the primary goal of the paper is to understand the capabilities of deep learning for vulnerability detection problems, with a particular focus on the following research questions:
The second research focus of the paper is on training data. The goal of the paper is to understand whether and how the size and composition of training data affect model performance. Specifically, the paper constructs the following research questions:
Finally, the third research area of the paper is model interpretation. The paper uses SOTA model interpretation tools to investigate:
research content:To answer the above questions, the paper surveyed the most advanced deep learning models and successfully reproduced 11 models on their original datasets. These models adopted different deep learning architectures, such as GNN, RNN, LSTM, CNN, and Transformers. To compare these models, the paper managed to run 9 models with two popular datasets, Devign and MSR. The reasons for choosing these two datasets are: (1) both datasets contain real-world projects and vulnerabilities; (2) most models in the papers are evaluated and tuned using the Devign dataset; (3) the MSR dataset contains 310 projects, where the data is annotated with vulnerability types, which is crucial to our research questions. Through carefully designed experiments and consideration of threats, the paper found results for 6 research questions. Overall, the research contributions of the paper include:
To collect the most advanced deep learning models, the paper studied papers from 2018 to 2022 and referred to Microsoft's CodeXGLUE leaderboard and IBM's defect detection D2A leaderboard. The paper used all available open source models and successfully reproduced 11 models. The data replication package of the paper contains a complete list of models and the reasons why we failed to reproduce some models.
As shown in the table above, the reproduced models cover a variety of deep learning architectures. Devign and ReVeal use GNN on the property graph, integrating control flow, data dependency, and AST. ReGVD uses GNN on token. Code2Vec uses multi-layer perceptron (MLP) on AST. VulDeeLocator and SySeVR are sequence models based on RNN and Bi-LSTM. Recent deep learning detection uses pre-trained Transformer, including CodeBERT, VulBERTa-CNN, VulBERTa-MLP, PLBART, and LineVul.
The paper selected the Devign and MSR datasets for the research questions. The paper studied the datasets used by these 11 models in their original papers, which are shown in the table above. The paper found that the Devign dataset has been used to evaluate and adjust 8 models. This dataset is a balanced dataset that contains approximately the same number of vulnerable and non-vulnerable examples, with a total of 27,318 data points (each example is also called a data point). LineVul used the MSR dataset, which is a recently available dataset. This dataset is unbalanced and contains 10,900 vulnerable examples and 177,736 non-vulnerable examples. These examples have their source projects as well as Common Weakness Enumeration (CWE) entries, showing the type of vulnerability. The paper uses these dataset features to formulate some research questions.
The paper reproduces the results of the model based on the original dataset and settings, as shown in the table above. The columns A, P, R, and F represent commonly used metrics in deep learning vulnerability detection, including accuracy, precision, recall, and F1 score. The paper's reproduction results are generally calculated within 2% of the original paper. Special cases are ReVeal, where the authors confirm that our results correct the data leakage error in the original paper, and Devign, where the paper uses third-party reproduction code (published by Chakaborthy et al.) because the original Devign code is not open source.