Technology Sharing

An Empirical Study of Deep Learning Models for Vulnerability Detection

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

1 Introduction

Research Background: Deep learning vulnerability detection tools have achieved promising results in recent years. State-of-the-art models report an F1 score of 0.9 and outperform static analyzers. The results are exciting because deep learning could revolutionize software assurance. As a result, industry companies such as IBM, Google, and Amazon are very interested and have invested heavily in developing such tools and datasets.

Existing Problems: Although deep learning vulnerability detection is promising, it has not yet reached the level of computer vision and natural language processing. Most of the current research focus is on trying emerging deep learning models and applying them to datasets like Devign or MSR datasets. However, we know very little about the models themselves, such as what types of programs the models can handle effectively, whether we should build models for each vulnerability type or build a unified model for all vulnerability types, what a good training dataset looks like, and what information the model uses when making decisions. Knowing the answers to these questions can help us better develop, debug, and apply models. However, given the black-box nature of deep learning, these questions are difficult to answer. The purpose of this paper is not to provide a complete solution to these problems, but to explore these goals.

Scientific question: In this paper, we survey and reproduce a range of state-of-the-art deep learning vulnerability detection models and establish research questions to understand these models, aiming to draw lessons and guidance to better design and debug future models. We structure the research questions into three areas, namelyModel CapabilitiesTraining DataandModel explanationSpecifically, the primary goal of the paper is to understand the capabilities of deep learning for vulnerability detection problems, with a particular focus on the following research questions:

  • Question 1: Can different models reach a consensus on vulnerability detection? What are the differences between different models?
  • Issue 2: Are certain types of vulnerabilities easier to detect? Should we build models for each type of vulnerability, or should we build one model that can detect all vulnerabilities?
  • Issue 3: Are there some code patterns that are difficult for the model to predict? If so, what kind of code patterns are they?

The second research focus of the paper is on training data. The goal of the paper is to understand whether and how the size and composition of training data affect model performance. Specifically, the paper constructs the following research questions:

  • Question 4: Does increasing dataset size help improve model performance for vulnerability detection?
  • Question 5: How does the composition of items in the training dataset affect the performance of the model?

Finally, the third research area of ​​the paper is model interpretation. The paper uses SOTA model interpretation tools to investigate:

  • Question 6: What source code information does the model use to make predictions? Is the model consistent about important features?

research content:To answer the above questions, the paper surveyed the most advanced deep learning models and successfully reproduced 11 models on their original datasets. These models adopted different deep learning architectures, such as GNN, RNN, LSTM, CNN, and Transformers. To compare these models, the paper managed to run 9 models with two popular datasets, Devign and MSR. The reasons for choosing these two datasets are: (1) both datasets contain real-world projects and vulnerabilities; (2) most models in the papers are evaluated and tuned using the Devign dataset; (3) the MSR dataset contains 310 projects, where the data is annotated with vulnerability types, which is crucial to our research questions. Through carefully designed experiments and consideration of threats, the paper found results for 6 research questions. Overall, the research contributions of the paper include:

  • contribution: 1:The paper conducts a comprehensive survey of deep learning vulnerability detection models.
  • Contribution 2: The paper provides a code repository containing training models and datasets for 11 SOTA deep learning frameworks with various research settings.
  • Contribution 3: The paper designs 6 scientific questions to understand model capabilities, training data and model interpretation.
  • Contributions 4: The paper constructs the research and obtains the results of the scientific questions raised through experiments.
  • Contribution 5:The paper prepares interesting examples and data to further study the interpretability of the model.

2. Model Reproduction

To collect the most advanced deep learning models, the paper studied papers from 2018 to 2022 and referred to Microsoft's CodeXGLUE leaderboard and IBM's defect detection D2A leaderboard. The paper used all available open source models and successfully reproduced 11 models. The data replication package of the paper contains a complete list of models and the reasons why we failed to reproduce some models.

As shown in the table above, the reproduced models cover a variety of deep learning architectures. Devign and ReVeal use GNN on the property graph, integrating control flow, data dependency, and AST. ReGVD uses GNN on token. Code2Vec uses multi-layer perceptron (MLP) on AST. VulDeeLocator and SySeVR are sequence models based on RNN and Bi-LSTM. Recent deep learning detection uses pre-trained Transformer, including CodeBERT, VulBERTa-CNN, VulBERTa-MLP, PLBART, and LineVul.

The paper selected the Devign and MSR datasets for the research questions. The paper studied the datasets used by these 11 models in their original papers, which are shown in the table above. The paper found that the Devign dataset has been used to evaluate and adjust 8 models. This dataset is a balanced dataset that contains approximately the same number of vulnerable and non-vulnerable examples, with a total of 27,318 data points (each example is also called a data point). LineVul used the MSR dataset, which is a recently available dataset. This dataset is unbalanced and contains 10,900 vulnerable examples and 177,736 non-vulnerable examples. These examples have their source projects as well as Common Weakness Enumeration (CWE) entries, showing the type of vulnerability. The paper uses these dataset features to formulate some research questions.

The paper reproduces the results of the model based on the original dataset and settings, as shown in the table above. The columns A, P, R, and F represent commonly used metrics in deep learning vulnerability detection, including accuracy, precision, recall, and F1 score. The paper's reproduction results are generally calculated within 2% of the original paper. Special cases are ReVeal, where the authors confirm that our results correct the data leakage error in the original paper, and Devign, where the paper uses third-party reproduction code (published by Chakaborthy et al.) because the original Devign code is not open source.