1. Introduction
Security vulnerabilities are pervasive in software. It is estimated that up to 98% of software contains a security vulnerability that could harm users [1]. Furthermore, up to 91% of web software is vulnerable to data breaches [1], which could cost companies an average of 4.88 million USD as of 2024 [2]. Despite the prevalence and cost of security vulnerabilities, detecting software defects that could lead to these vulnerabilities remains an open problem. Current methods have drawbacks, such as a high rate of false positives and an inability to detect a wide range of CWEs [3].
However, modern large language models have made significant advances in code understanding through tasks such as program repair [4] and bug detection [5]. Despite these advances, studies on the efficacy of LLMs for security analysis still find that LLMs struggle to reason over entire codebases in ways typical of security defects [6] [7].
2. Existing Work
The existing body of work on static application security testing is extensive. Research ranges from methods not using any machine learning to those dominated by large language models. One classic technique enumerates paths from source to sink, where sources and sinks are determined by hand-designed rules operating on bytecode [8]. There are deep learning methods that do not rely on large language models; they simply use CNNs and RNNs [9]. However, these methods are limited to a small subset of security vulnerabilities, and the authors express that more work is needed.
Recently, techniques leveraging pre-trained LLMs have emerged. An early technique involved directly querying the LLM with a code snippet to determine if it contained a vulnerability [10]. The authors noted that ChatGPT showed impressive results, but its ability to detect vulnerabilities directly was still inferior to prior SAST methods. The authors of DiverseVul [11] made a similar observation about the LLMs they trained for this task: good performance, but not good enough. The consensus is that LLMs alone are not yet ready for this use case [7], as shown in Figure 1.
Figure 1. A comparison between LLMs at detecting vulnerabilities across datasets shows authors noting low F1 scores, even with frontier models.
Recently, efforts have combined traditional methods that analyze code structure with LLM methods to better understand that code. One such effort is the IRIS framework [12], which combines LLM techniques with CodeQL [13]. This approach operates similarly to classical techniques, identifying sources and sinks and detecting unsanitized paths between them. However, it uses LLMs to label potential sources and sinks and to prune false positives when identifying unsanitized paths from a source to a sink. This approach leverages the strengths of classical techniques in scanning an entire code base while utilizing LLMs’ ability to understand isolated snippets.
3. Proposed Approach
This approach builds on the IRIS framework [12] by incorporating a recent technique from general static analysis, the Intelligent Code Analysis Agent (ICAA) [14]. We aim to deconstruct the IRIS framework into its core components, using CodeQL to scan the codebase and LLMs to determine which parts to scan. We combine this with the ReAct agent from the ICAA paper. To our knowledge, this is the first agent-based model used solely for static security analysis. A diagram of this architecture can be seen in Figure 2. We expand on the IRIS paper in two ways. First, our agent’s behavior encompasses all of IRIS’s capabilities, as it has access to all its constituent parts. Additionally, we leverage LLMs to guide traditional static analysis tools for a better understanding of extensive codebases, addressing a traditional weakness of machine learning techniques. LLMs can write CodeQL queries directly to refine their search of the entire codebase.
Figure 2. Adapted ICAA framework.
To measure the progress of our work, we use the CWE-Bench from the Li et al. paper [12]. It consists of a set of Java repositories and CVE-fixing commits. Additionally, it labels a set of vulnerabilities.
4. Detailed Design
At a high level, this framework takes a description of the codebase, uses it to generate a plan to employ CodeQL for understanding the codebase, and detects vulnerabilities. This plan is sent to a ReAct-style agent that can generate a thought, choose to generate a CodeQL expression, or report a vulnerability. Due to the complexity of writing CodeQL expressions, we delegate this task to an agent separate from the ReAct loop. This agent tries to generate an expression that fulfills the ReAct agent’s request. Once it has a query that can run and produce results, those results are passed to a summarizer agent. A summary is returned to the ReAct agent, which continues operating until it reaches its turn limit or finishes finding vulnerabilities. Once completed, the ReAct agent returns all reported vulnerabilities as a JSON report.
4.1. Task Planning
Since it’s impossible to pass the entire codebase into an LLM to generate a task plan, the task planning agent divides this into two steps: generating a code briefing and then using the code briefing to generate a task plan.
4.1.1. Code Briefing
To generate the briefing, the agent starts by scraping the codebase, examining the file structure and the README file. We then pass this information to an LLM, which creates a brief summary of the codebase with instructions to focus on potential security concerns. Snippets from a sample codebase summary are included in Figure 3.
Figure 3. A snippet from a sample codebase summary from the codebase summarizing step of the task planning agent.
4.1.2. Task Plan
Given this summary, we ask the task planning agent to generate a plan. The strategy used is to assign each agent a software engineer of a given level using the system prompt. For the task plan, we inform the LLM that it is a principal engineer hosting office hours. We then prompt it with the code briefing, asking how it would use CodeQL to identify security flaws.
Figure 4. A snippet of the task plan generated by the task-planning agent.
Since this output is for our ReAct agent, which can be long-running and quickly reach its context limit, we are mindful of the number of tokens consumed by this plan. To create a concise summary, we generate a bulleted list with sections. This is achieved by specifying a tool that requires the LLM to return valid JSON with a schema, which is then formatted into a bulleted list. A snippet of a sample plan can be seen in Figure 4.
4.2. ReAct Agent
The ReAct agent [15] is central to this technique. The intention of this agent is to create an agent that can write CodeQL expressions, analyze the results, and make follow-up queries based on the results. To facilitate this, when generating a response, it can choose from a list of four possible tools. The tools, their required fields, and a brief description of how the LLM should fill those fields are provided below:
• Thought
- Description: Explain your thoughts and the next steps you plan to take in as much detail as possible, using clear and straightforward language.
• CodeQL
- Description: Provide a comprehensive overview of a CodeQL query you wish to execute on the codebase. Avoid including any code. Instead, offer a detailed explanation to assist a junior developer, who may not be familiar with CodeQL, in crafting the query.
• Report Vulnerability
- Filepath: The primary filepath where the vulnerability exists
- Explanation: A brief overview of how the vulnerability works
- CWE: The CWE that most closely matches the detected vulnerability
• No More Vulnerabilities
- Recap: Provide a recap of the work completed by the agent so far to determine if it can be terminated early when no longer useful.
Figure 5. A sample output of the agent using the CodeQL tool to describe a CodeQL query it would like to run.
When the agent generates a thought or reports a vulnerability, we add it to the history and pass it back to the agent for another response. When the agent chooses the CodeQL tool, we send the description to a separate code-writing agent who returns a valid CodeQL expression. We summarize these results and pass the summary back to the ReAct agent. Each request for the agent to generate more text from the LLM constitutes a turn. We limit the number of turns to prevent infinite loops. This parameter can be adjusted to balance precision and recall. To inform the LLM agent about its turn limit and encourage progress, we prepend a turn counter to each input (e.g., 12/50). A sample turn of the agent is shown in Figure 5.
We instruct the LLM, acting as a senior engineer, to assist junior engineers in using CodeQL to identify security flaws. This approach encourages the LLM to be more descriptive, aiding in progress. We provide the task plan from the task planning phase.
4.3. Code Writing Agent
We noticed that even advanced models struggle to produce valid CodeQL on their first attempt. This is likely because CodeQL is not a popular language in terms of online resources during the training time of the LLM. However, passing the errors back to the LLM often enables it to generate a valid CodeQL expression. This process is expensive in terms of the number of tokens spent going back and forth with the LLM, which pollutes the ReAct agent’s history. To remedy this, we offload the code-writing process to a separate agent that only needs a plain English description of a CodeQL query.
Figure 6. A sample valid CodeQL expression generated by the code-writing LLM agent.
In this agent, we pass errors back to the LLM until it generates a valid CodeQL expression. Once we have a valid expression, the task is not complete. An issue can arise when a CodeQL produces results that exceed the LLM’s context window limit. In such cases, the results might be too long to return to the ReAct agent. To address this, we allow the code-writing agent to run the CodeQL query and count the number of tokens. If it exceeds a certain threshold, we return the task plan to the code-writing agent and ask it to write a query that reduces the number of results. A sample valid CodeQL generated by the agent is provided in Figure 6.
4.4. Summarizer Agent
For many of the reasons provided earlier, simply piping the CodeQL results back into the ReAct agent often encounters problems with the context-window limit. Therefore, a simple LLM query is run to summarize the results before returning them to the ReAct agent. A sample output can be seen in Figure 7.
Figure 7. A sample output of the summarizer agent using parsed CodeQL results.
5. Results
5.1. Experimental Setup
For this design, we chose Anthropic’s Claude Sonnet 3.5 and 3.7 for the model underpinning the agent framework. We experimented with smaller models; however, these did not produce acceptable results for agent framework use. We set max_tokens to 4096 and temperature to 0.3. These values were chosen based on [16], which recommends a low temperature (less than 1.0) for structured outputs, such as code. As for the max tokens, CodeQL expressions can be quite verbose, so we allowed the LLM to produce long outputs. We observed early stopping frequently with Claude, so we were confident it would only use the extra tokens when useful.
We set 30 turns for the ReAct agent, allowing it to use a tool before being cut off. We observed a weak negative correlation between turn count and detection rate during tests. Generally, by 30 turns, the agent completed its productive work and tended to meander afterward.
Finally, we set a 50k character limit on the result size from a CodeQL query before asking for a query that produces fewer results. Although we could specify larger limits according to our model’s capabilities, we wanted to avoid a “needle in the haystack” problem, where the LLM would struggle to find useful results among many.
5.2. Dataset
To measure the efficacy of our technique, we chose the CWE-Bench-Java dataset from Li et al. [12]. This dataset consists of real-world Java repositories with known and patched vulnerabilities. The patches are used to annotate file names and function names of vulnerable code. This is a hand-curated and validated dataset. By relying on real-world datasets, it tests the ability of our agent framework to work with complicated codebases.
We were unable to generate CodeQL databases for all repositories. We were limited to 95 out of 120 repositories. Therefore, our results might not directly compare to those from the IRIS paper. However, we have included them here for reference.
One challenge is that using CodeQL requires an instrumented build of the Java application. While this might not be difficult for teams familiar with their builds, it is challenging for security researchers working on unfamiliar projects. Generating an instrumented build with CodeQL is also more temperamental than building the project as is (see Appendix).
5.3. Evaluation Metrics
We analyzed three main metrics to compare our framework’s performance with previous work. For our agent, we consider a vulnerability detected if the filename, description, and reported CWE are all correct.
• Detection Rate
• Average False Detection Rate (FDR)
• Average score
We measure the average false detection rate and the average F1 score for each repository in the dataset and then calculate the average across all repositories.
5.4. Evaluation
Given our experimental setup, we observed performance comparable to IRIS in terms of the F1 score. For this dataset, we exceeded both CodeQL and IRIS in the false detection rate. Combined with the small number of reported vulnerabilities per repository, this should reduce the manual work needed to vet vulnerabilities. However, IRIS detected more vulnerabilities overall than our technique. Our results are shown in Table 1.
Table 1. Comparison of static security analysis techniques. As mentioned, we cannot directly compare our results to those from the IRIS paper because we were unable to generate CodeQL databases for 25 out of 120 repositories in the dataset. Our work is labeled as ReAct agent for Static Analysis.
Method |
F1 Score |
FDR |
Detection Rate |
ReAct agent for Static Analysis (+ Claude Sonnet 3.5) |
0.1281 |
0.8491 |
0.1591 |
ReAct agent for Static Analysis (+ Claude Sonnet 3.7) |
0.0754 |
0.5696 |
0.0857 |
CodeQL |
0.0760 |
0.9003 |
0.2250 |
IRIS (+ ChatGPT 4) |
0.1770 |
0.8482 |
0.4583 |
5.4.1. Breakdown by CWE
We can break down the detected vulnerabilities by their CWE ID. Looking at Figure 8, we can see that the detection rates are relatively uniform across the different CWE classes.
Figure 8. Detection Rate by CWE ID with the Claude Sonnet 3.5 Run.
5.4.2. Effect of Number of Turns on Performance
As mentioned above, we measure the performance of the agent framework against three criteria: F1 score, detection rate, and false detection rate. In Figure 9, we examine how the number of turns used by the ReAct agent affects performance as measured by these criteria.
One trend observed in this data is that the number of turns negatively correlates with performance. In all examples where our framework correctly identified vulnerabilities, the ReAct agent was terminated early via the NoMore tool.
Figure 9. Performance versus the number of turns used by the ReAct agent: We observe a weak negative relationship between the number of turns used and performance in vulnerability detection.
In Figure 10, we see that early termination is the most common outcome; however, the ReAct agent does not behave this way for all repositories in the dataset.
Figure 10. Comparison of the number of repositories where the ReAct agent terminated early versus those where it used all available turns.
5.4.3. Impact of The Code Writing Agent on Performance
Figure 11. The code writing agent’s success rate positively affects the framework’s ability to detect vulnerabilities. We observe a positive relationship between the agent’s success rate and the framework’s performance.
One aspect that heavily influences the agent’s ability to understand the code and find vulnerabilities is whether it can write the necessary CodeQL queries. In Figure 11, we see that the majority of repositories where the framework performed well are those where the code-writing agent can produce reliable code more than half of the time.
Figure 12. The distribution of the number of rounds used by the code-writing agent to produce a working query includes fixing all compilation errors in the CodeQL query, optimizing the query to run within the allotted time, and rewriting the query to prevent returning too many results if necessary. Only successful query generations are included in this chart.
Figure 13. The number of times the code-writing agent produced a valid CodeQL query versus the times it failed to do so.
One interesting aspect of the code writing agent’s performance is its binary nature: it either works or it doesn’t. In Figure 12, we see the distribution of the number of rounds used by the agent to produce a valid query. In our dataset, using more than three rounds was rare in scenarios where it successfully produced a query. However, Figure 13 shows the rate at which successful queries were produced. Roughly a third of the time, the code writing agent was unable to produce a valid CodeQL query despite being given 10 rounds.
There are multiple reasons the code-writing agent could mark a round as failed. In Figure 13, we see that most errors were caused by generating code that failed to compile. Interestingly, the agent is almost never able to recover from runtime errors or cases when the output is too long. Enabling the agent to recover from these scenarios could be a topic of future research. One idea would be to use an agent with access to the CodeQL documentation to help the code-writing agent troubleshoot runtime errors (Figure 14).
Figure 14. The number of times we observed a particular reason for the CodeQL agent failing to produce a valid expression. This counts the attempts made within the code-writing agent’s retry loop. As the code-writing agent gets up to 10 rounds to produce a valid CodeQL expression, each request from the ReAct agent to the code-writing agent can result in up to 10 failures on this chart.
6. Conclusions and Future Work
This project has successfully positioned the LLM at the forefront of automated static security analysis. We demonstrated that it can write useful CodeQL queries and advance the understanding of the codebase. In bringing this idea to life, we’ve created a framework with a significantly improved F1 score compared to base CodeQL techniques, approaching the previous state-of-the-art LLM-enabled static security analysis, IRIS. Furthermore, in false detection rates, a key issue for many developers, the agent framework significantly exceeds previous research.
6.1. Future Work
6.1.1. False Pruner Agent
One key strength of this work was the relatively low rate of false positives. We believe this could be further improved by using a false pruner agent, which could further reduce the rate of false positives. One approach could involve sending the reported vulnerabilities back through the ReAct agent, allowing it to write CodeQL to scrutinize each vulnerability further.
6.1.2. Comparison of Frontier Models
After some experimentation, we decided against models like LLAMA 7b and DeepSeeker coder. We found these models were not suitable for working as agents or writing CodeQL. Ultimately, we based much of our work on Anthropic’s Claude family of models. However, future work could explore comparable models from OpenAI and others.
6.1.3. Additional Tools
The design of this agent framework only allows the model to interact with the codebase via CodeQL. While this is a powerful tool, having access to the actual code might be beneficial. For example, using the cat CLI utility could be helpful, especially for finalizing reported vulnerabilities.
6.2. Analysis of LLM Data Contamination
One concern is that when working with publicly known vulnerabilities, such as those in this dataset, the LLM may have encountered information about them before. We’d like to use this agent framework, as well as the IRIS framework, to analyze private datasets—specifically, on code the LLM could not have seen before—and determine if these methods are as effective in that context.
Appendix
A1. Sample Correctly Detected Vulnerabilities
A1.1. Jolokia: CVE-2018-1000129
Here is a sample output where our tool was able to successfully detect a real vulnerability. Compared to the diff here, the description is correct and quite instructive.
https://github.com/jolokia/jolokia/commit/5895d5c137c335e6b473e9dcb9baf748851bbc5f#diff-f19898247eddb55de6400489bff748adGithub diff for security patch.
A1.2. Vertx-Web: CVE-2019-17640
Figure A1. A sample vulnerability was detected by our tool in the Jolokia repository. All output is AI-generated by our LLM agent.
Compared to the description from
https://bugs.eclipse.org/bugs/show_bug.cgi?id=567416 “Eclipse Vert.x StaticHandler doesn’t correctly process backslashes,” all fields are correct. The description, the marked CWE, and the explanation are all accurate.
Figure A2. Another sample vulnerability in the vertex web repo.
A2. Incorrectly Marked CWEs
The agent marked the wrong CWE for a few vulnerabilities but found the right file. After manual examination, we decided not to mark these as correctly identified vulnerabilities.
Jenkinsci Perfecto-Plugin: CVE-2020-2261
https://github.com/jenkinsci/perfecto-plugin
A3. Some Examples of the Code Writing Agent
A3.1. Jolokia: CVE-2018-1000129
Figure A3. This description, while pointing in the right direction, seemed too loosely related to actual vulnerabilities to effectively guide developers.
A3.1.1. Prompt
Figure A4. The plain English description of a CodeQL query.
A3.1.2. CodeQL Output
Figure A5. CodeQL output.
A3.1.3. Analysis of CodeQL Results
Figure A6. Summary of CodeQL Query Results by the summarizer agent.