From ae54e61f155763174a18e585bf5765bd4421d124 Mon Sep 17 00:00:00 2001 From: matteogreek Date: Sun, 27 Aug 2023 15:40:42 +0200 Subject: [PATCH 01/83] Update README.md --- prospector/README.md | 45 ++++++++++++++++++++++---------------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/prospector/README.md b/prospector/README.md index de74ce53f..b0957ead3 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -19,29 +19,30 @@ Given an advisory expressed in natural language, Prospector processes the commit To quickly set up Prospector: 1. Clone the project KB repository -``` -git clone https://github.com/sap/project-kb -``` + ``` + git clone https://github.com/sap/project-kb + ``` 2. Navigate to the *prospector* folder -``` -cd project-kb/prospector -``` - -3. Execute the bash script *run_prospector.sh* specifying the *-h* flag. This will display a list of options that you can use to customize the execution of Prospector. -``` -./run_prospector.sh -h -``` - -The bash script builds and starts the required Docker containers. Once the building step is completed, the script will show the list of available options. - -4. Try the following example: -``` -./run_prospector.sh CVE-2020-1925 --repository https://github.com/apache/olingo-odata4 -``` - -By default, Prospector saves the results in a HTML file named *prospector-report.html*. - -Open this file in a web browser to view what Prospector was able to find! + ``` + cd project-kb/prospector + ``` +3. Rename the *config-sample.yaml* file in *config.yaml*.
Optionally adjust settings such as backend usage, NVD database preference, report format, and more. + ``` + mv config-sample.yaml config.yaml + ``` + +4. Execute the bash script *run_prospector.sh* specifying the *-h* flag.
This will display a list of options that you can use to customize the execution of Prospector. + ``` + ./run_prospector.sh -h + ``` + The bash script builds and starts the required Docker containers. Once the building step is completed, the script will show the list of available options. + +5. Try the following example: + ``` + ./run_prospector.sh CVE-2020-1925 --repository https://github.com/apache/olingo-odata4 + ``` + By default, Prospector saves the results in a HTML file named *prospector-report.html*. + Open this file in a web browser to view what Prospector was able to find! ## Development Setup From 2dc91e9540873c80f26f1cbfc465873a436047a7 Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Tue, 30 Jan 2024 09:30:14 +0100 Subject: [PATCH 02/83] Update README.md to include star history --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 4b412f79e..28665fa0c 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,10 @@ scripts described in that paper](MSR2019) > If you wrote a paper that uses the data or the tools from this repository, please let us know (through an issue) and we'll add it to this list. +## Star History + +[![Star History Chart](https://api.star-history.com/svg?repos=sap/project-kb&type=Date)](https://star-history.com/#sap/project-kb&Date) + ## Credits ### EU-funded research projects From 27de8c0267638f0bc2864792a8711f1c92c29a0b Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Mon, 12 Feb 2024 14:22:43 +0100 Subject: [PATCH 03/83] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 28665fa0c..511e57e19 100644 --- a/README.md +++ b/README.md @@ -105,8 +105,9 @@ scripts described in that paper](MSR2019) The development of Project KB is partially supported by the following projects: -* [AssureMOSS](https://assuremoss.eu) (Grant No.952647). -* [Sparta](https://www.sparta.eu/) (Grant No.830892). +* [Sec4AI4Sec](https://www.sec4ai4sec-project.eu/) (Grant No. 101120393) +* [AssureMOSS](https://assuremoss.eu) (Grant No. 952647). +* [Sparta](https://www.sparta.eu/) (Grant No. 830892). ### Vulnerability data sources From 4e4322a860e837f1350d2d209a9d43ad8a10ea86 Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Fri, 1 Mar 2024 20:35:51 +0100 Subject: [PATCH 04/83] Update README.md --- README.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/README.md b/README.md index 511e57e19..0bfb210ba 100644 --- a/README.md +++ b/README.md @@ -37,11 +37,7 @@ in early 2019. In June 2020, we made a further step releasing the `kaybee` tool make the creation, aggregation, and consumption of vulnerability data much easier. In late 2020, we also released, as a proof-of-concept, the prototype `prospector`, whose goal is to automate the mapping of vulnerability advisories -onto their fix-commits. A technical description of the approach we implemented in -`prospector` can be found in this [preprint](https://arxiv.org/abs/2103.13375). -As of April 2021, together with our partners in the EU-funded project AssureMOSS, -we are reimplementing `prospector` to make it more robust, scalable, and user-friendly. -The reimplementation is carried out in the dedicate branch `prospector-assuremoss`. +onto their fix-commits. We hope this will encourage more contributors to join our efforts to build a collaborative, comprehensive knowledge base where each party remains in control From c520045c3c8d25f5eda8cff176c656f700cf1b88 Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Wed, 29 May 2024 18:13:43 +0200 Subject: [PATCH 05/83] Updated reference to FixFinder paper in README.md --- prospector/README.md | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/prospector/README.md b/prospector/README.md index b0957ead3..991409403 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -134,7 +134,9 @@ but fails without it) ## History The high-level structure of Prospector follows the approach of its -predecessor FixFinder, which is described in detail here: https://arxiv.org/pdf/2103.13375.pdf +predecessor FixFinder, which is described in: + +> Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, and Damian A. Tamburri. 2024. Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories. ACM Trans. Softw. Eng. Methodol. March 2024. https://doi.org/10.1145/3649590 FixFinder is the prototype developed by Daan Hommersom as part of his thesis done in partial fulfillment of the requirements for the degree of Master of @@ -145,16 +147,17 @@ The main difference between FixFinder and Prospector (which has been implemented is that the former takes a definite data-driven approach and trains a ML model to perform the ranking, whereas the latter applies hand-crafted rules to assign a relevance score to each candidate commit. -The document that describes FixFinder can be cited as follows: - -@misc{hommersom2021mapping, - title = {Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories}, - author = {Hommersom, Daan and - Sabetta, Antonino and - Coppola, Bonaventura and - Dario Di Nucci and - Tamburri, Damian A. }, - year = {2021}, - month = {March}, - url = {https://arxiv.org/pdf/2103.13375.pdf} +The paper that describes FixFinder can be cited as follows: + +@article{10.1145/3649590, +author = {Hommersom, Daan and Sabetta, Antonino and Coppola, Bonaventura and Nucci, Dario Di and Tamburri, Damian A.}, +title = {Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories}, +year = {2024}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +issn = {1049-331X}, +url = {https://doi.org/10.1145/3649590}, +doi = {10.1145/3649590}, +journal = {ACM Trans. Softw. Eng. Methodol.}, +month = {mar}, } From 099b8c217db59cdb392a6e75d4fcd168b8886601 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 24 May 2024 08:15:31 +0000 Subject: [PATCH 06/83] Adds LLM support to obtain the repository URL through LLM providers. LLM providers can be accessed through third party APIs (such as OpenAI), or through the Genrative AI Hub in the SAP AI Core. --- .gitignore | 2 + prospector/README.md | 98 ++++++++-- prospector/cli/main.py | 25 ++- prospector/config-sample.yaml | 13 +- prospector/core/prospector.py | 14 +- prospector/datamodel/nlp.py | 31 +-- prospector/llm/llm_operations.py | 168 ++++++++++++++++ prospector/llm/models.py | 191 +++++++++++++++++++ prospector/llm/prompts.py | 45 +++++ prospector/llm/test_llm.py | 48 +++++ prospector/pyproject.toml | 3 +- prospector/requirements.in | 44 +++-- prospector/requirements.txt | 201 ++++++++++++++------ prospector/service/api/routers/endpoints.py | 2 +- prospector/service/api/routers/home.py | 17 +- prospector/service/api/routers/jobs.py | 2 +- prospector/service/main.py | 7 +- prospector/util/config_parser.py | 145 +++++++++++--- prospector/util/http.py | 16 +- 19 files changed, 908 insertions(+), 164 deletions(-) create mode 100644 prospector/llm/llm_operations.py create mode 100644 prospector/llm/models.py create mode 100644 prospector/llm/prompts.py create mode 100644 prospector/llm/test_llm.py diff --git a/.gitignore b/.gitignore index 96042fb68..0b56f38a4 100644 --- a/.gitignore +++ b/.gitignore @@ -41,6 +41,7 @@ prospector/install_fastext.sh prospector/nvd.ipynb prospector/data/nvd.pkl prospector/data/nvd.csv +prospector/data_sources/reports .vscode/settings.json prospector/cov_html/* prospector/client/cli/cov_html/* @@ -51,6 +52,7 @@ prospector/.coverage **/cov_html prospector/cov_html .coverage +prospector/.venv prospector/prospector.code-workspace prospector/requests-cache.sqlite prospector/prospector-report.html diff --git a/prospector/README.md b/prospector/README.md index 991409403..f7729c7a8 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -5,18 +5,29 @@ currently under development: the instructions below are intended for development :exclamation: Please note that **Windows is not supported** while WSL and WSL2 are fine. -## Description +## Table of Contents + +1. [Description](#description) +2. [Quick Setup & Run](#setup--run) +3. [Development Setup](#development-setup) +4. [Contributing](#contributing) +5. [History](#history) + +## 📖 Description Prospector is a tool to reduce the effort needed to find security fixes for *known* vulnerabilities in open source software repositories. Given an advisory expressed in natural language, Prospector processes the commits found in the target source code repository, ranks them based on a set of predefined rules, and produces a report that the user can inspect to determine which commits to retain as the actual fix. -## Setup & Run +## ⚡️ Quick Setup & Run + +Prerequisites: -:warning: The tool requires Docker and Docker-compose, as it employes Docker containers for certain functionalities. Make sure you have Docker installed and running before proceeding with the setup and usage of Prospector. +* Docker (make sure you have Docker installed and running before proceeding with the setup) +* Docker-compose -To quickly set up Prospector: +To quickly set up Prospector, follow these steps. This will run Prospector in its containerised version. If you wish to debug or run Prospector's components individually, follow the steps below at [Development Setup](#development-setup). 1. Clone the project KB repository ``` @@ -44,7 +55,52 @@ To quickly set up Prospector: By default, Prospector saves the results in a HTML file named *prospector-report.html*. Open this file in a web browser to view what Prospector was able to find! -## Development Setup +### 🤖 LLM Support + +To use Prospector with LLM support, set the `use_llm_<...>` parameters in `config.yaml`. Additionally, you must specify required parameters for API access to the LLM. These parameters can vary depending on your choice of provider, please follow what fits your needs: + +
Use SAP AI CORE SDK + +You will need the following parameters in `config.yaml`: + +```yaml +llm_service: + type: sap + model_name: +``` + +`` refers to the model names available in the Generative AI Hub in SAP AI Core. [Here](https://github.tools.sap/I343697/generative-ai-hub-readme#1-supported-models) you can find an overview of available models. + +In `.env`, you must set the deployment URL as an environment variable following this naming convention: +```yaml +_URL +``` + +
+ +
Use personal third party provider + +Implemented third party providers are **OpenAI**, **Google** and **Mistral**. + +1. You will need the following parameters in `config.yaml`: + ```yaml + llm_service: + type: third_party + model_name: + ``` + + `` refers to the model names available, for example `gpt-4o` for OpenAI. You can find a lists of available models here: + 1. [OpenAI](https://platform.openai.com/docs/models) + 2. [Google](https://ai.google.dev/gemini-api/docs/models/gemini) + 3. [Mistral](https://docs.mistral.ai/getting-started/models/) + +2. Make sure to add your OpenAI API key to your `.env` file as `[OPENAI|GOOGLE|MISTRAL]_API_KEY`. + +
+ +## 👩‍💻 Development Setup + +Following these steps allows you to run Prospector's components individually: [Backend database and worker containers](#starting-the-backend-database-and-the-job-workers), [RESTful Server](#starting-the-restful-server) for API endpoints, [Prospector CLI](#running-the-cli-version) and [Tests](#testing). Prerequisites: @@ -53,6 +109,8 @@ Prerequisites: * gcc g++ libffi-dev python3-dev libpq-dev * Docker & Docker-compose +### General + You can setup everything and install the dependencies by running: ``` make setup @@ -81,11 +139,13 @@ your editor so that autoformatting is enforced "on save". The pre-commit hook en black is run prior to committing anyway, but the auto-formatting might save you some time and avoid frustration. -If you use VSCode, this can be achieved by pasting these lines in your configuration file: +If you use VSCode, this can be achieved by installing the Black Formatter extension and pasting these lines in your configuration file: -``` - "python.formatting.provider": "black", - "editor.formatOnSave": true, +```json + "[python]": { + "editor.defaultFormatter": "ms-python.black-formatter", + "editor.formatOnSave": true, + } ``` ### Starting the backend database and the job workers @@ -94,17 +154,23 @@ If you run the client without running the backend you will get a warning and hav You can then start the necessary containers with the following command: -`make docker-setup` +```bash +make docker-setup +``` This also starts a convenient DB administration tool at http://localhost:8080 If you wish to cleanup docker to run a fresh version of the backend you can run: -`make docker-clean` +```bash +make docker-clean +``` ### Starting the RESTful server -`uvicorn api.main:app --reload` +```bash +uvicorn service.main:app --reload +``` Note, that it requires `POSTGRES_USER`, `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DBNAME` to be set in the .env file. @@ -113,7 +179,9 @@ You might also want to take a look at `http://127.0.0.1:8000/docs`. *Alternatively*, you can execute the RESTful server explicitly with: -`python api/main.py` +```bash +python api/main.py +``` which is equivalent but more convenient for debugging. @@ -127,11 +195,13 @@ Prospector makes use of `pytest`. :exclamation: **NOTE:** before using it please make sure to have running instances of the backend and the database. +## 🤝 Contributing + If you find a bug, please open an issue. If you can also fix the bug, please create a pull request (make sure it includes a test case that passes with your correction but fails without it) -## History +## 🕰️ History The high-level structure of Prospector follows the approach of its predecessor FixFinder, which is described in: diff --git a/prospector/cli/main.py b/prospector/cli/main.py index f29a4a59c..f619995df 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -7,6 +7,7 @@ from dotenv import load_dotenv +import llm.llm_operations as llm from util.http import ping_backend path_root = os.getcwd() @@ -32,10 +33,12 @@ def main(argv): # noqa: C901 with ConsoleWriter("Initialization") as console: config = get_configuration(argv) if not config: - logger.error("No configuration file found. Cannot proceed.") + logger.error( + "No configuration file found, or error in configuration file. Cannot proceed." + ) console.print( - "No configuration file found.", + "No configuration file found, or error in configuration file. Check logs.", status=MessageStatus.ERROR, ) return @@ -51,6 +54,16 @@ def main(argv): # noqa: C901 ) return + if not config.repository and not config.use_llm_repository_url: + logger.error( + "Either provide the repository URL or allow LLM usage to obtain it." + ) + console.print( + "Either provide the repository URL or allow LLM usage to obtain it.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + # if config.ping: # return ping_backend(backend, get_level() < logging.INFO) @@ -63,6 +76,12 @@ def main(argv): # noqa: C901 logger.debug("Vulnerability ID: " + config.vuln_id) + # whether to use LLM support + if not config.repository: + config.repository = llm.get_repository_url( + llm_config=config.llm, vuln_id=config.vuln_id + ) + results, advisory_record = prospector( vulnerability_id=config.vuln_id, repository_url=config.repository, @@ -88,7 +107,7 @@ def main(argv): # noqa: C901 ) execution_time = execution_statistics["core"]["execution time"][0] - ConsoleWriter.print(f"Execution time: {execution_time:.3f}s") + ConsoleWriter.print(f"Execution time: {execution_time:.3f}s\n") return diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 92feb6596..b1dc8c1c1 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -1,5 +1,3 @@ - - # Wheter to preprocess only the repository's commits or fully run prospector preprocess_only: False @@ -12,7 +10,7 @@ fetch_references: False use_nvd: True # The NVD API token -nvd_token: Null +# nvd_token: # Wheter to use a backend or not: "always", "never", "optional" use_backend: optional @@ -30,6 +28,13 @@ database: redis_url: redis://redis:6379/0 +# LLM Usage (check README for help) +llm_service: + type: sap + model_name: gpt-4-turbo + +use_llm_repository_url: True # whether to use LLM's to obtain the repository URL + # Report file format: "html", "json", "console" or "all" # and the file name report: @@ -43,4 +48,4 @@ log_level: INFO git_cache: /tmp/gitcache # The GitHub API token -github_token: Null +# github_token: diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index d132eb9c0..b740d74d6 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -1,6 +1,7 @@ # flake8: noqa import logging +import os import re import sys import time @@ -36,7 +37,7 @@ ONE_YEAR = 365 * SECS_PER_DAY MAX_CANDIDATES = 2000 -DEFAULT_BACKEND = "http://localhost:8000" +DEFAULT_BACKEND = "http://backend:8000" core_statistics = execution_statistics.sub_collection("core") @@ -157,7 +158,14 @@ def prospector( # noqa: C901 exc_info=get_level() < logging.WARNING, ) if use_backend == "always": - print("Backend not reachable: aborting") + if backend_address == "http://localhost:8000" and os.path.exists( + "/.dockerenv" + ): + print( + "The backend address should be 'http://backend:8000' when running the containerised version of Prospector: aborting" + ) + else: + print("Backend not reachable: aborting") sys.exit(1) print("Backend not reachable: continuing") @@ -227,7 +235,7 @@ def preprocess_commits(commits: List[RawCommit], timer: ExecutionTimer) -> List[ def filter(commits: Dict[str, RawCommit]) -> Dict[str, RawCommit]: - with ConsoleWriter("\nCandidate filtering\n") as console: + with ConsoleWriter("\nCandidate filtering") as console: commits, rejected = filter_commits(commits) if rejected > 0: console.print(f"Dropped {rejected} candidates") diff --git a/prospector/datamodel/nlp.py b/prospector/datamodel/nlp.py index 1c5ed76e9..150f4203f 100644 --- a/prospector/datamodel/nlp.py +++ b/prospector/datamodel/nlp.py @@ -139,23 +139,24 @@ def extract_ghissue_references(repository: str, text: str) -> Dict[str, str]: id = result.group(1) url = f"{repository}/issues/{id}" content = fetch_url(url=url, extract_text=False) - gh_ref_data = content.find_all( - attrs={ - "class": ["comment-body", "markdown-title"], - }, - recursive=False, - ) - # TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules. - gh_ref_data.extend( - content.find_all( + if content is not None: + gh_ref_data = content.find_all( attrs={ - "id": re.compile(r"ref-issue|ref-pullrequest"), - } + "class": ["comment-body", "markdown-title"], + }, + recursive=False, + ) + # TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules. + gh_ref_data.extend( + content.find_all( + attrs={ + "id": re.compile(r"ref-issue|ref-pullrequest"), + } + ) + ) + refs[id] = " ".join( + [" ".join(block.get_text().split()) for block in gh_ref_data] ) - ) - refs[id] = " ".join( - [" ".join(block.get_text().split()) for block in gh_ref_data] - ) return refs diff --git a/prospector/llm/llm_operations.py b/prospector/llm/llm_operations.py new file mode 100644 index 000000000..2023bf168 --- /dev/null +++ b/prospector/llm/llm_operations.py @@ -0,0 +1,168 @@ +import sys +from typing import Dict + +import validators +from dotenv import dotenv_values +from langchain_core.language_models.llms import LLM +from langchain_google_vertexai import ChatVertexAI +from langchain_mistralai import ChatMistralAI +from langchain_openai import ChatOpenAI + +from cli.console import ConsoleWriter, MessageStatus +from datamodel.advisory import get_from_mitre +from llm.models import Gemini, Mistral, OpenAI +from llm.prompts import best_guess +from log.logger import logger + + +class ModelDef: + def __init__(self, access_info: str, _class: LLM): + self.access_info = ( + access_info # either deployment_url (for SAP) or API key (for Third Party) + ) + self._class = _class + + +env: Dict[str, str | None] = dotenv_values() + +SAP_MAPPING = { + "gpt-35-turbo": ModelDef(env.get("GPT_35_TURBO_URL", None), OpenAI), + "gpt-35-turbo-16k": ModelDef(env.get("GPT_35_TURBO_16K_URL", None), OpenAI), + "gpt-35-turbo-0125": ModelDef(env.get("GPT_35_TURBO_0125_URL", None), OpenAI), + "gpt-4": ModelDef(env.get("GPT_4_URL", None), OpenAI), + "gpt-4-32k": ModelDef(env.get("GPT_4_32K_URL", None), OpenAI), + # "gpt-4-turbo": env.get("GPT_4_TURBO_URL", None), # currently TBD: https://github.tools.sap/I343697/generative-ai-hub-readme + # "gpt-4o": env.get("GPT_4O_URL", None), # currently TBD: https://github.tools.sap/I343697/generative-ai-hub-readme + "gemini-1.0-pro": ModelDef(env.get("GEMINI_1_0_PRO_URL", None), Gemini), + "mistralai--mixtral-8x7b-instruct-v01": ModelDef( + env.get("MISTRALAI_MIXTRAL_8X7B_INSTRUCT_V01", None), Mistral + ), +} + +THIRD_PARTY_MAPPING = { + "gpt-4": ModelDef(env.get("OPENAI_API_KEY", None), ChatOpenAI), + "gpt-3.5-turbo": ModelDef(env.get("OPENAI_API_KEY", None), ChatOpenAI), + "gemini-pro": ModelDef(env.get("GOOGLE_API_KEY", None), ChatVertexAI), + "mistral-large-latest": ModelDef(env.get("MISTRAL_API_KEY", None), ChatMistralAI), +} + + +def create_model_instance(llm_config) -> LLM: + """Creates and returns the model object given the user's configuration. + + Args: + llm_config (dict): A dictionary containing the configuration for the LLM. Expected keys are: + - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for + external providers). + - 'model_name' (str): Which model to use, e.g. gpt-4. + + Returns: + LLM: An instance of the specified LLM model. + """ + + def create_sap_provider(model_name: str): + d = SAP_MAPPING.get(model_name, None) + + if d is None: + raise ValueError(f"Model '{model_name}' is not available.") + + model = d._class( + model_name=model_name, + deployment_url=d.access_info, + ) + + return model + + def create_third_party_provider(model_name: str): + # obtain definition from main mapping + d = THIRD_PARTY_MAPPING.get(model_name, None) + + if d is None: + logger.error(f"Model '{model_name}' is not available.") + raise ValueError(f"Model '{model_name}' is not available.") + + model = d._class( + model=model_name, + api_key=d.access_info, + ) + + return model + + if llm_config is None: + raise ValueError( + "When using LLM support, please add necessary parameters to configuration file." + ) + + # LLM Instantiation + try: + match llm_config.type: + case "sap": + model = create_sap_provider(llm_config.model_name) + case "third_party": + model = create_third_party_provider(llm_config.model_name) + case _: + logger.error( + f"Invalid LLM type specified, '{llm_config.type}' is not available." + ) + raise ValueError( + f"Invalid LLM type specified, '{llm_config.type}' is not available." + ) + except Exception as e: + logger.error(f"Problem when initialising model: {e}") + raise ValueError(f"Problem when initialising model: {e}") + + return model + + +def get_repository_url(llm_config: Dict, vuln_id: str): + """Ask an LLM to obtain the repository URL given the advisory description and references. + + Args: + llm_config (dict): A dictionary containing the configuration for the LLM. Expected keys are: + - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for + external providers). + - 'model_name' (str): Which model to use, e.g. gpt-4. + vuln_id: The ID of the advisory, e.g. CVE-2020-1925. + + Returns: + The repository URL as a string. + + Raises: + ValueError if advisory information cannot be obtained or there is an error in the model invocation. + """ + with ConsoleWriter("Invoking LLM") as console: + details, _ = get_from_mitre(vuln_id) + if details is None: + logger.error("Error when getting advisory information from Mitre.") + console.print( + "Error when getting advisory information from Mitre.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + + try: + model = create_model_instance(llm_config=llm_config) + chain = best_guess | model + + url = chain.invoke( + { + "description": details["descriptions"][0]["value"], + "references": details["references"], + } + ) + if not validators.url(url): + logger.error(f"LLM returned invalid URL: {url}") + console.print( + f"LLM returned invalid URL: {url}", + status=MessageStatus.ERROR, + ) + sys.exit(1) + except Exception as e: + logger.error(f"Prompt-model chain could not be invoked: {e}") + console.print( + "Prompt-model chain could not be invoked.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + + return url diff --git a/prospector/llm/models.py b/prospector/llm/models.py new file mode 100644 index 000000000..9026d3524 --- /dev/null +++ b/prospector/llm/models.py @@ -0,0 +1,191 @@ +import json +from typing import Any, List, Mapping, Optional + +import requests +from dotenv import dotenv_values +from langchain_core.language_models.llms import LLM + +from log.logger import logger + + +class SAPProvider(LLM): + model_name: str + deployment_url: str + + @property + def _llm_type(self) -> str: + return "custom" + + @property + def _identifying_params(self) -> Mapping[str, Any]: + """Get the identifying parameters.""" + return { + "model_name": self.model_name, + } + + def _call( + self, + prompt: str, + stop: Optional[List[str]] = None, + **kwargs: Any, + ) -> str: + """Run the LLM on the given input. + + Override this method to implement the LLM logic. + + Args: + prompt: The prompt to generate from. + stop: Stop words to use when generating. Model output is cut off at the + first occurrence of any of the stop substrings. + If stop tokens are not supported consider raising NotImplementedError. + run_manager: Callback manager for the run. + **kwargs: Arbitrary additional keyword arguments. These are usually passed + to the model provider API call. + + Returns: + The model output as a string. Actual completions SHOULD NOT include the prompt. + """ + if self.deployment_url is None: + raise ValueError( + "Deployment URL not set. Maybe you forgot to create the environment variable." + ) + if stop is not None: + raise ValueError("stop kwargs are not permitted.") + return "" + + +class OpenAI(SAPProvider): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + # Model specific request data + endpoint = f"{self.deployment_url}/chat/completions?api-version=2023-05-15" + headers = get_headers() + data = { + "messages": [ + { + "role": "user", + "content": f"{prompt}", + } + ] + } + + response = requests.post(endpoint, headers=headers, json=data) + + if not response.status_code == 200: + logger.error( + f"Invalid response from AI Core API with error code {response.status_code}" + ) + raise Exception("Invalid response from AI Core API.") + + return self.parse(response.json()) + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["choices"][0]["message"]["content"] + + +class Gemini(SAPProvider): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + # Model specific request data + endpoint = f"{self.deployment_url}/models/{self.model_name}:generateContent" + headers = get_headers() + data = { + "generation_config": { + "maxOutputTokens": 1000, + "temperature": 0.0, + }, + "contents": [{"role": "user", "parts": [{"text": prompt}]}], + "safetySettings": [ + { + "category": "HARM_CATEGORY_DANGEROUS_CONTENT", + "threshold": "BLOCK_NONE", + }, + { + "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", + "threshold": "BLOCK_NONE", + }, + {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, + {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, + ], + } + + response = requests.post(endpoint, headers=headers, json=data) + + if not response.status_code == 200: + logger.error( + f"Invalid response from AI Core API with error code {response.status_code}" + ) + raise Exception("Invalid response from AI Core API.") + + return self.parse(response.json()) + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["candidates"][0]["content"]["parts"][0]["text"] + + +class Mistral(SAPProvider): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + # Model specific request data + endpoint = f"{self.deployment_url}/chat/completions" + headers = get_headers() + data = { + "model": "mistralai--mixtral-8x7b-instruct-v01", + "max_tokens": 100, + "temperature": 0.0, + "messages": [{"role": "user", "content": prompt}], + } + + response = requests.post(endpoint, headers=headers, json=data) + + if not response.status_code == 200: + logger.error( + f"Invalid response from AI Core API with error code {response.status_code}" + ) + raise Exception("Invalid response from AI Core API.") + + return self.parse(response.json()) + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["choices"][0]["message"]["content"] + + +def get_headers(): + """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. + + Returns: + The headers object needed to send requests to the SAP AI Core. + """ + with open(dotenv_values()["AI_CORE_KEY_FILEPATH"]) as f: + sk = json.load(f) + + auth_url = f"{sk['url']}/oauth/token" + client_id = sk["clientid"] + client_secret = sk["clientsecret"] + # api_base_url = f"{sk['serviceurls']['AI_API_URL']}/v2" + + response = requests.post( + auth_url, + data={"grant_type": "client_credentials"}, + auth=(client_id, client_secret), + timeout=8000, + ) + + headers = { + "AI-Resource-Group": "default", + "Content-Type": "application/json", + "Authorization": f"Bearer {response.json()['access_token']}", + } + return headers diff --git a/prospector/llm/prompts.py b/prospector/llm/prompts.py new file mode 100644 index 000000000..57fd2444a --- /dev/null +++ b/prospector/llm/prompts.py @@ -0,0 +1,45 @@ +from langchain.prompts import FewShotPromptTemplate, PromptTemplate + +# example output for few-shot prompting +examples_without_num = [ + { + "cve_description": "Apache Olingo versions 4.0.0 to 4.7.0 provide the AsyncRequestWrapperImpl class which reads a URL from the Location header, and then sends a GET or DELETE request to this URL. It may allow to implement a SSRF attack. If an attacker tricks a client to connect to a malicious server, the server can make the client call any URL including internal resources which are not directly accessible by the attacker.", + "cve_references": "https://www.zerodayinitiative.com/advisories/ZDI-24-196/", + "result": "https://github.com/apache/olingo-odata4", + }, + { + "cve_description": "Open-source project Online Shopping System Advanced is vulnerable to Reflected Cross-Site Scripting (XSS). An attacker might trick somebody into using a crafted URL, which will cause a script to be run in user's browser.", + "cve_references": "https://cert.pl/en/posts/2024/05/CVE-2024-3579, https://cert.pl/posts/2024/05/CVE-2024-3579", + "result": "https://github.com/PuneethReddyHC/online-shopping-system-advanced", + }, + { + "cve_description": "The Hoppscotch Browser Extension is a browser extension for Hoppscotch, a community-driven end-to-end open-source API development ecosystem. Due to an oversight during a change made to the extension in the commit d4e8e4830326f46ba17acd1307977ecd32a85b58, a critical check for the origin list was missed and allowed for messages to be sent to the extension which the extension gladly processed and responded back with the results of, while this wasn't supposed to happen and be blocked by the origin not being present in the origin list.\n\nThis vulnerability exposes Hoppscotch Extension users to sites which call into Hoppscotch Extension APIs internally. This fundamentally allows any site running on the browser with the extension installed to bypass CORS restrictions if the user is running extensions with the given version. This security hole was patched in the commit 7e364b928ab722dc682d0fcad713a96cc38477d6 which was released along with the extension version `0.35`. As a workaround, Chrome users can use the Extensions Settings to disable the extension access to only the origins that you want. Firefox doesn't have an alternative to upgrading to a fixed version.", + "cve_references": "https://github.com/hoppscotch/hoppscotch-extension/commit/7e364b928ab722dc682d0fcad713a96cc38477d6, https://github.com/hoppscotch/hoppscotch-extension/commit/d4e8e4830326f46ba17acd1307977ecd32a85b58, https://github.com/hoppscotch/hoppscotch-extension/security/advisories/GHSA-jjh5-pvqx-gg5v, https://server.yadhu.in/poc/hoppscotch-poc.html", + "result": "https://github.com/hoppscotch/hoppscotch-extension", + }, +] + +# Formatter for the few-shot examples without CVE numbers +examples_prompt_without_num = PromptTemplate( + input_variables=["cve_references", "result"], + template=""" {cve_description} + {cve_references} + + {result} """, +) + +best_guess = FewShotPromptTemplate( + prefix="""You will be provided with the ID, description and references of a vulnerability advisory (CVE). Return nothing but the URL of the repository the given CVE is concerned with.'. + +Here are a few examples delimited with XML tags:""", + examples=examples_without_num, + example_prompt=examples_prompt_without_num, + suffix="""Here is the CVE information: + {description} + {references} + +If you cannot find the URL, return your best guess of what the repository URL could be. Use any hints (eg. the mention of GitHub or GitLab) in the CVE description and references. Return nothing but the URL. +""", + input_variables=["description", "references"], + metadata={"name": "best_guess"}, +) diff --git a/prospector/llm/test_llm.py b/prospector/llm/test_llm.py new file mode 100644 index 000000000..abb3fe658 --- /dev/null +++ b/prospector/llm/test_llm.py @@ -0,0 +1,48 @@ +import pytest +import requests +from langchain_openai import ChatOpenAI + +from llm.llm_operations import create_model_instance, get_repository_url +from llm.models import Gemini, Mistral, OpenAI + + +# Mock the llm_service configuration object +class Config: + type: str = None + model_name: str = None + + def __init__(self, type, model_name): + self.type = type + self.model_name = model_name + + +# Vulnerability ID +vuln_id = "CVE-2024-32480" + + +class TestModel: + def test_sap_gpt35_instantiation(self): + config = Config("sap", "gpt-35-turbo") + model = create_model_instance(config) + assert isinstance(model, OpenAI) + + def test_sap_gpt4_instantiation(self): + config = Config("sap", "gpt-4") + model = create_model_instance(config) + assert isinstance(model, OpenAI) + + def test_thirdparty_gpt35_instantiation(self): + config = Config("third_party", "gpt-3.5-turbo") + model = create_model_instance(config) + assert isinstance(model, ChatOpenAI) + + def test_thirdparty_gpt4_instantiation(self): + config = Config("third_party", "gpt-4") + model = create_model_instance(config) + assert isinstance(model, ChatOpenAI) + + def test_invoke_fail(self): + with pytest.raises(SystemExit): + config = Config("sap", "gpt-35-turbo") + vuln_id = "random" + get_repository_url(llm_config=config, vuln_id=vuln_id) diff --git a/prospector/pyproject.toml b/prospector/pyproject.toml index 3511de431..87fd3e89b 100644 --- a/prospector/pyproject.toml +++ b/prospector/pyproject.toml @@ -9,7 +9,8 @@ testpaths = [ "api", "filtering", "stats", - "util" + "util", + "llm", ] [tool.isort] diff --git a/prospector/requirements.in b/prospector/requirements.in index bf42e07cb..6d7d7f4b3 100644 --- a/prospector/requirements.in +++ b/prospector/requirements.in @@ -1,19 +1,27 @@ -beautifulsoup4==4.11.1 -colorama==0.4.6 -datasketch==1.5.8 -fastapi==0.85.1 -Jinja2==3.1.2 -pandas==1.5.1 -plac==1.3.5 -psycopg2==2.9.5 -pydantic==1.10.2 -pytest==7.2.0 -python-dotenv==0.21.0 -python_dateutil==2.8.2 -redis==4.3.4 -requests==2.28.1 + +beautifulsoup4 +colorama +datasketch +fastapi +Jinja2 +langchain +langchain_openai +langchain_google_vertexai +langchain_mistralai +langchain_community +omegaconf +pandas +plac +psycopg2 +pydantic +pytest +python_dateutil +python-dotenv +redis +requests requests_cache==0.9.6 -rq==1.11.1 -spacy==3.4.2 -tqdm==4.64.1 -uvicorn==0.19.0 +rq +spacy +tqdm +uvicorn +validators diff --git a/prospector/requirements.txt b/prospector/requirements.txt index 16385f316..dc864b5d7 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -1,76 +1,153 @@ # -# This file is autogenerated by pip-compile with python 3.10 -# To update, run: +# This file is autogenerated by pip-compile with Python 3.10 +# by the following command: # # pip-compile --no-annotate --strip-extras # -anyio==3.6.2 +--extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/deploy-releases-pypi/simple +--extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/proxy-deploy-releases-hyperspace-pypi/simple +--trusted-host int.repositories.cloud.sap + +aiohttp==3.9.5 +aiosignal==1.3.1 +annotated-types==0.7.0 +antlr4-python3-runtime==4.9.3 +anyio==4.4.0 appdirs==1.4.4 -argparse==1.4.0 -async-timeout==4.0.2 -attrs==22.1.0 -beautifulsoup4==4.11.1 -blis==0.7.9 -catalogue==2.0.8 -cattrs==22.2.0 -certifi==2022.9.24 -charset-normalizer==2.1.1 -click==8.1.3 +async-timeout==4.0.3 +attrs==23.2.0 +beautifulsoup4==4.12.3 +blis==0.7.11 +cachetools==5.3.3 +catalogue==2.0.10 +cattrs==23.2.3 +certifi==2024.6.2 +charset-normalizer==3.3.2 +click==8.1.7 +cloudpathlib==0.18.1 colorama==0.4.6 -confection==0.0.3 -cymem==2.0.7 -datasketch==1.5.8 -deprecated==1.2.13 -exceptiongroup==1.0.0rc9 -fastapi==0.85.1 +confection==0.1.5 +cymem==2.0.8 +dataclasses-json==0.6.6 +datasketch==1.6.5 +distro==1.9.0 +dnspython==2.6.1 +docstring-parser==0.16 +email-validator==2.1.1 +exceptiongroup==1.2.1 +fastapi==0.111.0 +fastapi-cli==0.0.4 +filelock==3.14.0 +frozenlist==1.4.1 +fsspec==2024.6.0 +google-api-core==2.19.0 +google-auth==2.29.0 +google-cloud-aiplatform==1.53.0 +google-cloud-bigquery==3.24.0 +google-cloud-core==2.4.1 +google-cloud-resource-manager==1.12.3 +google-cloud-storage==2.16.0 +google-crc32c==1.5.0 +google-resumable-media==2.7.0 +googleapis-common-protos==1.63.1 +greenlet==3.0.3 +grpc-google-iam-v1==0.13.0 +grpcio==1.64.1 +grpcio-status==1.62.2 h11==0.14.0 -idna==3.4 -iniconfig==1.1.1 -jinja2==3.1.2 -langcodes==3.3.0 -markupsafe==2.1.1 -murmurhash==1.0.9 -numpy==1.23.4 -packaging==21.3 -pandas==1.5.1 -pathy==0.6.2 -plac==1.3.5 -pluggy==1.0.0 -preshed==3.0.8 -psycopg2==2.9.5 -pydantic==1.10.2 -pyparsing==3.0.9 -pytest==7.2.0 -python-dateutil==2.8.2 -python-dotenv==0.21.0 -pytz==2022.5 -redis==4.3.4 -requests==2.28.1 +httpcore==1.0.5 +httptools==0.6.1 +httpx==0.27.0 +httpx-sse==0.4.0 +huggingface-hub==0.23.3 +idna==3.7 +iniconfig==2.0.0 +jinja2==3.1.4 +jsonpatch==1.33 +jsonpointer==2.4 +langchain==0.2.2 +langchain-community==0.2.3 +langchain-core==0.2.4 +langchain-google-vertexai==1.0.5 +langchain-mistralai==0.1.8 +langchain-openai==0.1.8 +langchain-text-splitters==0.2.1 +langcodes==3.4.0 +langsmith==0.1.74 +language-data==1.2.0 +marisa-trie==1.2.0 +markdown-it-py==3.0.0 +markupsafe==2.1.5 +marshmallow==3.21.3 +mdurl==0.1.2 +multidict==6.0.5 +murmurhash==1.0.10 +mypy-extensions==1.0.0 +numpy==1.26.4 +omegaconf==2.3.0 +openai==1.31.1 +orjson==3.10.3 +packaging==23.2 +pandas==2.2.2 +plac==1.4.3 +pluggy==1.5.0 +preshed==3.0.9 +proto-plus==1.23.0 +protobuf==4.25.3 +psycopg2==2.9.9 +pyasn1==0.6.0 +pyasn1-modules==0.4.0 +pydantic==2.7.3 +pydantic-core==2.18.4 +pygments==2.18.0 +pytest==8.2.2 +python-dateutil==2.9.0.post0 +python-dotenv==1.0.1 +python-multipart==0.0.9 +pytz==2024.1 +pyyaml==6.0.1 +redis==5.0.5 +regex==2024.5.15 +requests==2.32.3 requests-cache==0.9.6 -rq==1.11.1 -scipy==1.9.3 +rich==13.7.1 +rq==1.16.2 +rsa==4.9 +scipy==1.13.1 +shapely==2.0.4 +shellingham==1.5.4 six==1.16.0 -smart-open==5.2.1 -sniffio==1.3.0 -soupsieve==2.3.2.post1 -spacy==3.4.2 -spacy-legacy==3.0.10 -spacy-loggers==1.0.3 -srsly==2.4.5 -starlette==0.20.4 -thinc==8.1.5 +smart-open==7.0.4 +sniffio==1.3.1 +soupsieve==2.5 +spacy==3.7.5 +spacy-legacy==3.0.12 +spacy-loggers==1.0.5 +sqlalchemy==2.0.30 +srsly==2.4.8 +starlette==0.37.2 +tenacity==8.3.0 +thinc==8.2.4 +tiktoken==0.7.0 +tokenizers==0.19.1 tomli==2.0.1 -tqdm==4.64.1 -typer==0.4.2 -typing-extensions==4.4.0 +tqdm==4.66.4 +typer==0.12.3 +typing-extensions==4.12.1 +typing-inspect==0.9.0 +tzdata==2024.1 +ujson==5.10.0 url-normalize==1.4.3 -urllib3==1.26.12 -uvicorn==0.19.0 -validators==0.20.0 -wasabi==0.10.1 -wrapt==1.14.1 -python-multipart==0.0.5 -omegaconf==2.2.3 +urllib3==2.2.1 +uvicorn==0.30.1 +uvloop==0.19.0 +validators==0.28.3 +wasabi==1.1.3 +watchfiles==0.22.0 +weasel==0.4.1 +websockets==12.0 +wrapt==1.16.0 +yarl==1.9.4 # The following packages are considered to be unsafe in a requirements file: # setuptools diff --git a/prospector/service/api/routers/endpoints.py b/prospector/service/api/routers/endpoints.py index d5cb0ff5c..9f446f86c 100644 --- a/prospector/service/api/routers/endpoints.py +++ b/prospector/service/api/routers/endpoints.py @@ -3,7 +3,6 @@ from datetime import datetime import redis -from api.rq_utils import get_all_jobs, queue from fastapi import APIRouter, FastAPI, Request from fastapi.responses import HTMLResponse from fastapi.templating import Jinja2Templates @@ -12,6 +11,7 @@ from starlette.responses import RedirectResponse from data_sources.nvd.job_creation import run_prospector +from service.api.rq_utils import get_all_jobs, queue from util.config_parser import parse_config_file # from core.report import generate_report diff --git a/prospector/service/api/routers/home.py b/prospector/service/api/routers/home.py index 46d80cdd6..020ea6469 100644 --- a/prospector/service/api/routers/home.py +++ b/prospector/service/api/routers/home.py @@ -1,14 +1,15 @@ -from api.rq_utils import queue, get_all_jobs -from fastapi import FastAPI, Request -from fastapi import APIRouter -from fastapi.templating import Jinja2Templates +import time + +import redis +from fastapi import APIRouter, FastAPI, Request from fastapi.responses import HTMLResponse -from starlette.responses import RedirectResponse -from util.config_parser import parse_config_file +from fastapi.templating import Jinja2Templates from rq import Connection, Queue from rq.job import Job -import redis -import time +from starlette.responses import RedirectResponse + +from service.api.rq_utils import get_all_jobs, queue +from util.config_parser import parse_config_file # from core.report import generate_report diff --git a/prospector/service/api/routers/jobs.py b/prospector/service/api/routers/jobs.py index 77f0f7667..0f759e485 100644 --- a/prospector/service/api/routers/jobs.py +++ b/prospector/service/api/routers/jobs.py @@ -5,9 +5,9 @@ from rq import Connection, Queue from rq.job import Job -from api.routers.nvd_feed_update import main from git.git import do_clone from log.logger import logger +from service.api.routers.nvd_feed_update import main from util.config_parser import parse_config_file config = parse_config_file() diff --git a/prospector/service/main.py b/prospector/service/main.py index 1756c2984..c33f41d1f 100644 --- a/prospector/service/main.py +++ b/prospector/service/main.py @@ -2,12 +2,13 @@ from fastapi import FastAPI from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import HTMLResponse, RedirectResponse +from fastapi.staticfiles import StaticFiles -# from .dependencies import oauth2_scheme -from api.routers import jobs, nvd, preprocessed, users, endpoints, home from log.logger import logger + +# from .dependencies import oauth2_scheme +from service.api.routers import endpoints, home, jobs, nvd, preprocessed, users from util.config_parser import parse_config_file -from fastapi.staticfiles import StaticFiles api_metadata = [ {"name": "data", "description": "Operations with data used to train ML models."}, diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index 49b59e840..e1f7dfad6 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -1,15 +1,23 @@ import argparse import os import sys -from dataclasses import dataclass +from dataclasses import MISSING, dataclass +from typing import Optional from omegaconf import OmegaConf +from omegaconf.errors import ( + ConfigAttributeError, + ConfigKeyError, + ConfigTypeError, + MissingMandatoryValue, +) from log.logger import logger def parse_cli_args(args): parser = argparse.ArgumentParser(description="Prospector CLI") + parser.add_argument( "vuln_id", nargs="?", @@ -17,7 +25,9 @@ def parse_cli_args(args): help="ID of the vulnerability to analyze", ) - parser.add_argument("--repository", default="", type=str, help="Git repository url") + parser.add_argument( + "--repository", default=None, type=str, help="Git repository url" + ) parser.add_argument( "--preprocess-only", @@ -27,6 +37,7 @@ def parse_cli_args(args): parser.add_argument("--pub-date", type=str, help="Publication date of the advisory") + # Allow the user to manually supply advisory description parser.add_argument("--description", type=str, help="Advisory description") parser.add_argument( @@ -81,7 +92,6 @@ def parse_cli_args(args): parser.add_argument( "--use-backend", - default="always", choices=["always", "never", "optional"], type=str, help="Use the backend server", @@ -131,12 +141,71 @@ def parse_cli_args(args): def parse_config_file(filename: str = "config.yaml"): if os.path.isfile(filename): logger.info(f"Loading configuration from {filename}") + schema = OmegaConf.structured(MandatoryConfig) config = OmegaConf.load(filename) - return config + try: + merged_config = OmegaConf.merge(schema, config) + return merged_config + except ConfigAttributeError as e: + logger.error(f"Attribute error in {filename}: {e}") + except ConfigKeyError as e: + logger.error(f"Key error in {filename}: {e}") + except ConfigTypeError as e: + logger.error(f"Type error in {filename}: {e}") + except Exception as e: + # General exception catch block for any other exceptions + logger.error(f"An unexpected error occurred when parsing config.yaml: {e}") + else: + logger.error("No configuration file found, cannot proceed.") + + +# Schema class for "database" configuration +@dataclass +class DatabaseConfig: + user: str + password: str + host: str + port: int + dbname: str + + +# Schema class for "report" configuration +@dataclass +class ReportConfig: + format: str + name: str + - return None +# Schema class for "llm_service" configuration +@dataclass +class LLMServiceConfig: + type: str + model_name: str + + +# Schema class for config.yaml parameters +@dataclass +class MandatoryConfig: + redis_url: str = MISSING + preprocess_only: bool = MISSING + max_candidates: int = MISSING + fetch_references: bool = MISSING + use_nvd: bool = MISSING + use_backend: str = MISSING + backend: str = MISSING + use_llm_repository_url: bool = MISSING + report: ReportConfig = MISSING + log_level: str = MISSING + git_cache: str = MISSING + nvd_token: Optional[str] = None + database: DatabaseConfig = DatabaseConfig( + user="postgres", password="example", host="db", port=5432, dbname="postgres" + ) + llm_service: Optional[LLMServiceConfig] = None + github_token: Optional[str] = None +# Prospector's own Configuration object (combining args and config.yaml) @dataclass class Config: def __init__( @@ -154,17 +223,21 @@ def __init__( keywords: str, use_nvd: bool, fetch_references: bool, - backend: str, use_backend: str, + backend: str, + use_llm_repository_url: bool, report: str, report_filename: str, ping: bool, log_level: str, git_cache: str, ignore_refs: bool, + llm: str, ): self.vuln_id = vuln_id self.repository = repository + self.use_llm_repository_url = use_llm_repository_url + self.llm = llm self.preprocess_only = preprocess_only self.pub_date = pub_date self.description = description @@ -190,27 +263,39 @@ def get_configuration(argv): args = parse_cli_args(argv) conf = parse_config_file(args.config) if conf is None: - sys.exit("No configuration file found") - return Config( - vuln_id=args.vuln_id, - repository=args.repository, - preprocess_only=args.preprocess_only or conf.preprocess_only, - pub_date=args.pub_date, - description=args.description, - modified_files=args.modified_files, - keywords=args.keywords, - max_candidates=args.max_candidates or conf.max_candidates, - # tag_interval=args.tag_interval, - version_interval=args.version_interval, - filter_extensions=args.filter_extensions, - use_nvd=args.use_nvd or conf.use_nvd, - fetch_references=args.fetch_references or conf.fetch_references, - backend=args.backend or conf.backend, - use_backend=args.use_backend or conf.use_backend, - report=args.report or conf.report.format, - report_filename=args.report_filename or conf.report.name, - ping=args.ping, - git_cache=conf.git_cache, - log_level=args.log_level or conf.log_level, - ignore_refs=args.ignore_refs, - ) + sys.exit( + "No configuration file found, or error in configuration file. Check logs." + ) + try: + config = Config( + vuln_id=args.vuln_id, + repository=args.repository, + use_llm_repository_url=conf.use_llm_repository_url, + llm=conf.llm_service, + preprocess_only=args.preprocess_only or conf.preprocess_only, + pub_date=args.pub_date, + description=args.description, + modified_files=args.modified_files, + keywords=args.keywords, + max_candidates=args.max_candidates or conf.max_candidates, + # tag_interval=args.tag_interval, + version_interval=args.version_interval, + filter_extensions=args.filter_extensions, + use_nvd=args.use_nvd or conf.use_nvd, + fetch_references=args.fetch_references or conf.fetch_references, + backend=args.backend or conf.backend, + use_backend=args.use_backend or conf.use_backend, + report=args.report or conf.report.format, + report_filename=args.report_filename or conf.report.name, + ping=args.ping, + git_cache=conf.git_cache, + log_level=args.log_level or conf.log_level, + ignore_refs=args.ignore_refs, + ) + return config + except MissingMandatoryValue as e: + logger.error(e) + sys.exit(f"'{e.full_key}' is missing in {args.config}.") + except Exception as e: + logger.error(f"Error in {args.config}: {e}.") + sys.exit(f"Error in {args.config}. Check logs.") diff --git a/prospector/util/http.py b/prospector/util/http.py index cfdb78f7d..5f3678594 100644 --- a/prospector/util/http.py +++ b/prospector/util/http.py @@ -9,6 +9,20 @@ def fetch_url(url: str, params=None, extract_text=True) -> Union[str, BeautifulSoup]: + """ + Fetches the content of a web page located at the specified URL and optionally extracts text from it. + + Parameters: + - url (str): The URL of the web page to fetch. + - params (dict, optional): Optional parameters to be sent with the request (default: None). + - extract_text (bool, optional): Whether to extract text content from the HTML (default: True). + + Returns: + - Union[str, BeautifulSoup]: If `extract_text` is True, returns the text content of the web page as a string. + If `extract_text` is False, returns the parsed HTML content as a BeautifulSoup object. + + If an exception occurs during the HTTP request, an empty string ("") is returned. + """ try: session = requests_cache.CachedSession("requests-cache", expire_after=604800) if params is None: @@ -17,7 +31,7 @@ def fetch_url(url: str, params=None, extract_text=True) -> Union[str, BeautifulS content = session.get(url, params=params).content except Exception: logger.debug(f"cannot retrieve url content: {url}", exc_info=True) - return "" + return None soup = BeautifulSoup(content, "html.parser") if extract_text: From 6b872f958cc43f565527a830ccc3a3324487704b Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 7 Jun 2024 12:26:13 +0000 Subject: [PATCH 07/83] implements PR feedback changes --- prospector/cli/main.py | 2 +- prospector/core/prospector.py | 2 +- ...m_operations.py => model_instantiation.py} | 59 ----------------- prospector/llm/operations.py | 64 +++++++++++++++++++ prospector/llm/test_llm.py | 2 +- 5 files changed, 67 insertions(+), 62 deletions(-) rename prospector/llm/{llm_operations.py => model_instantiation.py} (64%) create mode 100644 prospector/llm/operations.py diff --git a/prospector/cli/main.py b/prospector/cli/main.py index f619995df..9755b0a68 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -7,7 +7,7 @@ from dotenv import load_dotenv -import llm.llm_operations as llm +import llm.operations as llm from util.http import ping_backend path_root = os.getcwd() diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index b740d74d6..a73a74a6d 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -37,7 +37,7 @@ ONE_YEAR = 365 * SECS_PER_DAY MAX_CANDIDATES = 2000 -DEFAULT_BACKEND = "http://backend:8000" +DEFAULT_BACKEND = "http://localhost:8000" core_statistics = execution_statistics.sub_collection("core") diff --git a/prospector/llm/llm_operations.py b/prospector/llm/model_instantiation.py similarity index 64% rename from prospector/llm/llm_operations.py rename to prospector/llm/model_instantiation.py index 2023bf168..f1960d5c6 100644 --- a/prospector/llm/llm_operations.py +++ b/prospector/llm/model_instantiation.py @@ -1,17 +1,12 @@ -import sys from typing import Dict -import validators from dotenv import dotenv_values from langchain_core.language_models.llms import LLM from langchain_google_vertexai import ChatVertexAI from langchain_mistralai import ChatMistralAI from langchain_openai import ChatOpenAI -from cli.console import ConsoleWriter, MessageStatus -from datamodel.advisory import get_from_mitre from llm.models import Gemini, Mistral, OpenAI -from llm.prompts import best_guess from log.logger import logger @@ -112,57 +107,3 @@ def create_third_party_provider(model_name: str): raise ValueError(f"Problem when initialising model: {e}") return model - - -def get_repository_url(llm_config: Dict, vuln_id: str): - """Ask an LLM to obtain the repository URL given the advisory description and references. - - Args: - llm_config (dict): A dictionary containing the configuration for the LLM. Expected keys are: - - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for - external providers). - - 'model_name' (str): Which model to use, e.g. gpt-4. - vuln_id: The ID of the advisory, e.g. CVE-2020-1925. - - Returns: - The repository URL as a string. - - Raises: - ValueError if advisory information cannot be obtained or there is an error in the model invocation. - """ - with ConsoleWriter("Invoking LLM") as console: - details, _ = get_from_mitre(vuln_id) - if details is None: - logger.error("Error when getting advisory information from Mitre.") - console.print( - "Error when getting advisory information from Mitre.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - - try: - model = create_model_instance(llm_config=llm_config) - chain = best_guess | model - - url = chain.invoke( - { - "description": details["descriptions"][0]["value"], - "references": details["references"], - } - ) - if not validators.url(url): - logger.error(f"LLM returned invalid URL: {url}") - console.print( - f"LLM returned invalid URL: {url}", - status=MessageStatus.ERROR, - ) - sys.exit(1) - except Exception as e: - logger.error(f"Prompt-model chain could not be invoked: {e}") - console.print( - "Prompt-model chain could not be invoked.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - - return url diff --git a/prospector/llm/operations.py b/prospector/llm/operations.py new file mode 100644 index 000000000..39a74ae27 --- /dev/null +++ b/prospector/llm/operations.py @@ -0,0 +1,64 @@ +import sys +from typing import Dict + +import validators + +from cli.console import ConsoleWriter, MessageStatus +from datamodel.advisory import get_from_mitre +from llm.model_instantiation import create_model_instance +from llm.prompts import best_guess +from log.logger import logger + + +def get_repository_url(llm_config: Dict, vuln_id: str): + """Ask an LLM to obtain the repository URL given the advisory description and references. + + Args: + llm_config (dict): A dictionary containing the configuration for the LLM. Expected keys are: + - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for + external providers). + - 'model_name' (str): Which model to use, e.g. gpt-4. + vuln_id: The ID of the advisory, e.g. CVE-2020-1925. + + Returns: + The repository URL as a string. + + Raises: + ValueError if advisory information cannot be obtained or there is an error in the model invocation. + """ + with ConsoleWriter("Invoking LLM") as console: + details, _ = get_from_mitre(vuln_id) + if details is None: + logger.error("Error when getting advisory information from Mitre.") + console.print( + "Error when getting advisory information from Mitre.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + + try: + model = create_model_instance(llm_config=llm_config) + chain = best_guess | model + + url = chain.invoke( + { + "description": details["descriptions"][0]["value"], + "references": details["references"], + } + ) + if not validators.url(url): + logger.error(f"LLM returned invalid URL: {url}") + console.print( + f"LLM returned invalid URL: {url}", + status=MessageStatus.ERROR, + ) + sys.exit(1) + except Exception as e: + logger.error(f"Prompt-model chain could not be invoked: {e}") + console.print( + "Prompt-model chain could not be invoked.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + + return url diff --git a/prospector/llm/test_llm.py b/prospector/llm/test_llm.py index abb3fe658..9ef091659 100644 --- a/prospector/llm/test_llm.py +++ b/prospector/llm/test_llm.py @@ -2,8 +2,8 @@ import requests from langchain_openai import ChatOpenAI -from llm.llm_operations import create_model_instance, get_repository_url from llm.models import Gemini, Mistral, OpenAI +from llm.operations import create_model_instance, get_repository_url # Mock the llm_service configuration object From 303de9cd7dd1c7393d67306cec039718a04fdf74 Mon Sep 17 00:00:00 2001 From: I748376 Date: Tue, 11 Jun 2024 08:30:22 +0000 Subject: [PATCH 08/83] adds temperature as optional parameter in config --- prospector/cli/main.py | 2 +- prospector/config-sample.yaml | 1 + prospector/llm/model_instantiation.py | 15 +++++++++++---- prospector/llm/models.py | 8 +++++--- prospector/util/config_parser.py | 13 +++++++------ 5 files changed, 25 insertions(+), 14 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 9755b0a68..a51fcdb70 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -79,7 +79,7 @@ def main(argv): # noqa: C901 # whether to use LLM support if not config.repository: config.repository = llm.get_repository_url( - llm_config=config.llm, vuln_id=config.vuln_id + llm_config=config.llm_service, vuln_id=config.vuln_id ) results, advisory_record = prospector( diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index b1dc8c1c1..0c79dbb04 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -32,6 +32,7 @@ redis_url: redis://redis:6379/0 llm_service: type: sap model_name: gpt-4-turbo + # temperature: 0.0 # optional, default is 0.0 use_llm_repository_url: True # whether to use LLM's to obtain the repository URL diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index f1960d5c6..2ca1560f1 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -50,12 +50,13 @@ def create_model_instance(llm_config) -> LLM: - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for external providers). - 'model_name' (str): Which model to use, e.g. gpt-4. + - 'temperature' (Optional(float)): The temperature for the model, default 0.0. Returns: LLM: An instance of the specified LLM model. """ - def create_sap_provider(model_name: str): + def create_sap_provider(model_name: str, temperature: float): d = SAP_MAPPING.get(model_name, None) if d is None: @@ -64,11 +65,12 @@ def create_sap_provider(model_name: str): model = d._class( model_name=model_name, deployment_url=d.access_info, + temperature=temperature, ) return model - def create_third_party_provider(model_name: str): + def create_third_party_provider(model_name: str, temperature: float): # obtain definition from main mapping d = THIRD_PARTY_MAPPING.get(model_name, None) @@ -79,6 +81,7 @@ def create_third_party_provider(model_name: str): model = d._class( model=model_name, api_key=d.access_info, + temperature=temperature, ) return model @@ -92,9 +95,13 @@ def create_third_party_provider(model_name: str): try: match llm_config.type: case "sap": - model = create_sap_provider(llm_config.model_name) + model = create_sap_provider( + llm_config.model_name, llm_config.temperature + ) case "third_party": - model = create_third_party_provider(llm_config.model_name) + model = create_third_party_provider( + llm_config.model_name, llm_config.temperature + ) case _: logger.error( f"Invalid LLM type specified, '{llm_config.type}' is not available." diff --git a/prospector/llm/models.py b/prospector/llm/models.py index 9026d3524..cb3fb0d81 100644 --- a/prospector/llm/models.py +++ b/prospector/llm/models.py @@ -11,6 +11,7 @@ class SAPProvider(LLM): model_name: str deployment_url: str + temperature: float @property def _llm_type(self) -> str: @@ -69,7 +70,8 @@ def _call( "role": "user", "content": f"{prompt}", } - ] + ], + "temperature": self.temperature, } response = requests.post(endpoint, headers=headers, json=data) @@ -99,7 +101,7 @@ def _call( data = { "generation_config": { "maxOutputTokens": 1000, - "temperature": 0.0, + "temperature": self.temperature, }, "contents": [{"role": "user", "parts": [{"text": prompt}]}], "safetySettings": [ @@ -143,7 +145,7 @@ def _call( data = { "model": "mistralai--mixtral-8x7b-instruct-v01", "max_tokens": 100, - "temperature": 0.0, + "temperature": self.temperature, "messages": [{"role": "user", "content": prompt}], } diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index e1f7dfad6..281b17bb3 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -141,7 +141,7 @@ def parse_cli_args(args): def parse_config_file(filename: str = "config.yaml"): if os.path.isfile(filename): logger.info(f"Loading configuration from {filename}") - schema = OmegaConf.structured(MandatoryConfig) + schema = OmegaConf.structured(ConfigSchema) config = OmegaConf.load(filename) try: merged_config = OmegaConf.merge(schema, config) @@ -181,11 +181,12 @@ class ReportConfig: class LLMServiceConfig: type: str model_name: str + temperature: float = 0.0 # Schema class for config.yaml parameters @dataclass -class MandatoryConfig: +class ConfigSchema: redis_url: str = MISSING preprocess_only: bool = MISSING max_candidates: int = MISSING @@ -226,18 +227,18 @@ def __init__( use_backend: str, backend: str, use_llm_repository_url: bool, - report: str, + report: ReportConfig, report_filename: str, ping: bool, log_level: str, git_cache: str, ignore_refs: bool, - llm: str, + llm_service: LLMServiceConfig, ): self.vuln_id = vuln_id self.repository = repository self.use_llm_repository_url = use_llm_repository_url - self.llm = llm + self.llm_service = llm_service self.preprocess_only = preprocess_only self.pub_date = pub_date self.description = description @@ -271,7 +272,7 @@ def get_configuration(argv): vuln_id=args.vuln_id, repository=args.repository, use_llm_repository_url=conf.use_llm_repository_url, - llm=conf.llm_service, + llm_service=conf.llm_service, preprocess_only=args.preprocess_only or conf.preprocess_only, pub_date=args.pub_date, description=args.description, From f8974ac2b35ed1fa484a378f067d34b0c62bc909 Mon Sep 17 00:00:00 2001 From: I748376 Date: Tue, 11 Jun 2024 09:29:39 +0000 Subject: [PATCH 09/83] makes check for correct backend address less brittle and changes use_backend magic strings to constants --- prospector/config-sample.yaml | 5 ++-- prospector/core/prospector.py | 45 ++++++++++++++++++++++++-------- prospector/docker/cli/Dockerfile | 3 +++ 3 files changed, 39 insertions(+), 14 deletions(-) diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 0c79dbb04..7208bc3dd 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -15,8 +15,7 @@ use_nvd: True # Wheter to use a backend or not: "always", "never", "optional" use_backend: optional -# Optional backend info to save/use already preprocessed data -#backend: http://backend:8000 +# Backend address; when in containerised version, use http://backend:8000, otherwise http://localhost:8000 backend: http://localhost:8000 database: @@ -30,7 +29,7 @@ redis_url: redis://redis:6379/0 # LLM Usage (check README for help) llm_service: - type: sap + type: sap # use "sap" or "third_party" model_name: gpt-4-turbo # temperature: 0.0 # optional, default is 0.0 diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index a73a74a6d..a95603aae 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -6,6 +6,7 @@ import sys import time from typing import Dict, List, Set, Tuple +from urllib.parse import urlparse import requests from tqdm import tqdm @@ -38,6 +39,9 @@ MAX_CANDIDATES = 2000 DEFAULT_BACKEND = "http://localhost:8000" +USE_BACKEND_ALWAYS = "always" +USE_BACKEND_OPTIONAL = "optional" +USE_BACKEND_NEVER = "never" core_statistics = execution_statistics.sub_collection("core") @@ -58,7 +62,7 @@ def prospector( # noqa: C901 use_nvd: bool = True, nvd_rest_endpoint: str = "", backend_address: str = DEFAULT_BACKEND, - use_backend: str = "always", + use_backend: str = USE_BACKEND_ALWAYS, git_cache: str = "/tmp/git_cache", limit_candidates: int = MAX_CANDIDATES, rules: List[str] = ["ALL"], @@ -146,7 +150,7 @@ def prospector( # noqa: C901 ) as timer: with ConsoleWriter("\nProcessing commits") as writer: try: - if use_backend != "never": + if use_backend != USE_BACKEND_NEVER: missing, preprocessed_commits = retrieve_preprocessed_commits( repository_url, backend_address, @@ -157,17 +161,13 @@ def prospector( # noqa: C901 "Backend not reachable", exc_info=get_level() < logging.WARNING, ) - if use_backend == "always": - if backend_address == "http://localhost:8000" and os.path.exists( - "/.dockerenv" - ): + if use_backend == USE_BACKEND_ALWAYS: + if not is_correct_backend_url(backend_address): print( - "The backend address should be 'http://backend:8000' when running the containerised version of Prospector: aborting" + "The backend address should be 'backend:8000' when running the containerised version of Prospector, and 'localhost:8000' otherwise: Aborting." ) - else: - print("Backend not reachable: aborting") sys.exit(1) - print("Backend not reachable: continuing") + print("Backend not reachable: Continuing.") if "missing" not in locals(): missing = list(candidates.values()) @@ -202,7 +202,7 @@ def prospector( # noqa: C901 payload = [c.to_dict() for c in preprocessed_commits] - if len(payload) > 0 and use_backend != "never" and len(missing) > 0: + if len(payload) > 0 and use_backend != USE_BACKEND_NEVER and len(missing) > 0: save_preprocessed_commits(backend_address, payload) else: logger.warning("Preprocessed commits are not being sent to backend") @@ -428,6 +428,29 @@ def get_commits_no_tags(repository: Git, commit_ids: List[str]): return candidates +def is_correct_backend_url(backend_url: str) -> bool: + """Returns True if the backend URL set in the config file matches the way prospector is run. Returns False if + - Prospector is run containerised and backend_url is not 'backend:8000' + - Prospector is run locally and backend_url is not 'localhost:8000' + """ + parsed_config_url = urlparse(backend_url) + parsed_default_url = urlparse(DEFAULT_BACKEND) + + if parsed_config_url.port != 8000: + return False + + in_container = os.environ.get("IN_CONTAINER", "") == "1" + + if in_container: + if parsed_config_url.hostname != "backend": + return False + else: + if parsed_config_url.hostname != parsed_default_url.hostname: + return False + + return True + + # def prospector_find_twins( # advisory_record: AdvisoryRecord, # repository: Git, diff --git a/prospector/docker/cli/Dockerfile b/prospector/docker/cli/Dockerfile index 50df300d2..058e4bcdf 100644 --- a/prospector/docker/cli/Dockerfile +++ b/prospector/docker/cli/Dockerfile @@ -21,4 +21,7 @@ WORKDIR /clirun VOLUME ["/clirun"] ENV PYTHONPATH "${PYTHONPATH}:/clirun" +# check if Prospector is running containerised +ENV IN_CONTAINER=1 + ENTRYPOINT [ "python","cli/main.py" ] From 6f5ab2b55ed901c5cc2505507e26ae606abc12e9 Mon Sep 17 00:00:00 2001 From: I748376 Date: Tue, 11 Jun 2024 15:32:56 +0000 Subject: [PATCH 10/83] changes code structure now model gets instantiated and llm functions can be called with this model. This is because only one instantiation of the model is needed throughout the whole runtime of prospector --- prospector/cli/main.py | 10 ++++++---- prospector/llm/operations.py | 10 +++------- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index a51fcdb70..95b5ef723 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -8,6 +8,7 @@ from dotenv import load_dotenv import llm.operations as llm +from llm.model_instantiation import create_model_instance from util.http import ping_backend path_root = os.getcwd() @@ -54,6 +55,10 @@ def main(argv): # noqa: C901 ) return + # instantiate LLM model if set in config.yaml + if config.llm_service: + model = create_model_instance(llm_config=config.llm_service) + if not config.repository and not config.use_llm_repository_url: logger.error( "Either provide the repository URL or allow LLM usage to obtain it." @@ -76,11 +81,8 @@ def main(argv): # noqa: C901 logger.debug("Vulnerability ID: " + config.vuln_id) - # whether to use LLM support if not config.repository: - config.repository = llm.get_repository_url( - llm_config=config.llm_service, vuln_id=config.vuln_id - ) + config.repository = llm.get_repository_url(model=model, vuln_id=config.vuln_id) results, advisory_record = prospector( vulnerability_id=config.vuln_id, diff --git a/prospector/llm/operations.py b/prospector/llm/operations.py index 39a74ae27..f157e8590 100644 --- a/prospector/llm/operations.py +++ b/prospector/llm/operations.py @@ -2,22 +2,19 @@ from typing import Dict import validators +from langchain_core.language_models.llms import LLM from cli.console import ConsoleWriter, MessageStatus from datamodel.advisory import get_from_mitre -from llm.model_instantiation import create_model_instance from llm.prompts import best_guess from log.logger import logger -def get_repository_url(llm_config: Dict, vuln_id: str): +def get_repository_url(model: LLM, vuln_id: str): """Ask an LLM to obtain the repository URL given the advisory description and references. Args: - llm_config (dict): A dictionary containing the configuration for the LLM. Expected keys are: - - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for - external providers). - - 'model_name' (str): Which model to use, e.g. gpt-4. + model (LLM): The instantiated model (instantiated with create_model_instance()) vuln_id: The ID of the advisory, e.g. CVE-2020-1925. Returns: @@ -37,7 +34,6 @@ def get_repository_url(llm_config: Dict, vuln_id: str): sys.exit(1) try: - model = create_model_instance(llm_config=llm_config) chain = best_guess | model url = chain.invoke( From 591e2719f1fd5bbedd3224c174c24495e5df4eeb Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Thu, 20 Jun 2024 14:25:52 +0200 Subject: [PATCH 11/83] Update prospector/README.md with FixFinder information --- prospector/README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/prospector/README.md b/prospector/README.md index f7729c7a8..1d0cf72c5 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -213,9 +213,14 @@ done in partial fulfillment of the requirements for the degree of Master of Science in Data Science & Entrepreneurship at the Jheronimus Academy of Data Science during a graduation internship at SAP. +The source code of FixFinder can be obtained by checking out the tag [DAAN_HOMMERSOM_THESIS](https://github.com/SAP/project-kb/releases/tag/DAAN_HOMMERSOM_THESIS). + The main difference between FixFinder and Prospector (which has been implemented from scratch) is that the former takes a definite data-driven approach and trains a ML model to perform the ranking, -whereas the latter applies hand-crafted rules to assign a relevance score to each candidate commit. +whereas the latter is based on hand-crafted rules to assign a relevance score to each candidate commit. + +Recent versions of Prospector (2024) also use AI/ML; still that is done through suitable rules +that are based on the outcome of suitable requests to LLMs. The paper that describes FixFinder can be cited as follows: From bad9ed4b305e174e13cd1d1f8f78d9f2a7090910 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 13 Jun 2024 15:27:02 +0000 Subject: [PATCH 12/83] adjusts tests --- prospector/llm/test_llm.py | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/prospector/llm/test_llm.py b/prospector/llm/test_llm.py index 9ef091659..5d8cb9c1b 100644 --- a/prospector/llm/test_llm.py +++ b/prospector/llm/test_llm.py @@ -2,18 +2,21 @@ import requests from langchain_openai import ChatOpenAI -from llm.models import Gemini, Mistral, OpenAI -from llm.operations import create_model_instance, get_repository_url +from llm.model_instantiation import create_model_instance +from llm.models import OpenAI +from llm.operations import get_repository_url # Mock the llm_service configuration object class Config: type: str = None model_name: str = None + temperature: str = None - def __init__(self, type, model_name): + def __init__(self, type, model_name, temperature): self.type = type self.model_name = model_name + self.temperature = temperature # Vulnerability ID @@ -22,27 +25,28 @@ def __init__(self, type, model_name): class TestModel: def test_sap_gpt35_instantiation(self): - config = Config("sap", "gpt-35-turbo") + config = Config("sap", "gpt-35-turbo", "0.0") model = create_model_instance(config) assert isinstance(model, OpenAI) def test_sap_gpt4_instantiation(self): - config = Config("sap", "gpt-4") + config = Config("sap", "gpt-4", "0.0") model = create_model_instance(config) assert isinstance(model, OpenAI) def test_thirdparty_gpt35_instantiation(self): - config = Config("third_party", "gpt-3.5-turbo") + config = Config("third_party", "gpt-3.5-turbo", "0.0") model = create_model_instance(config) assert isinstance(model, ChatOpenAI) def test_thirdparty_gpt4_instantiation(self): - config = Config("third_party", "gpt-4") + config = Config("third_party", "gpt-4", "0.0") model = create_model_instance(config) assert isinstance(model, ChatOpenAI) def test_invoke_fail(self): with pytest.raises(SystemExit): - config = Config("sap", "gpt-35-turbo") + config = Config("sap", "gpt-35-turbo", "0.0") + model = create_model_instance(config) vuln_id = "random" - get_repository_url(llm_config=config, vuln_id=vuln_id) + get_repository_url(model=model, vuln_id=vuln_id) From 2bb83bc9f4682ad254101df1a5a866da4968e72f Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 12 Jun 2024 14:00:34 +0000 Subject: [PATCH 13/83] basic refactoring mvp --- prospector/cli/main.py | 21 +------ prospector/core/prospector.py | 14 ++++- prospector/datamodel/advisory.py | 20 ++++++- prospector/llm/llm_service.py | 58 ++++++++++++++++++ prospector/llm/operations.py | 60 ------------------- .../llm/{test_llm.py => test_llm_service.py} | 5 ++ prospector/util/config_parser.py | 5 +- 7 files changed, 96 insertions(+), 87 deletions(-) create mode 100644 prospector/llm/llm_service.py delete mode 100644 prospector/llm/operations.py rename prospector/llm/{test_llm.py => test_llm_service.py} (86%) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 95b5ef723..8af48c579 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -7,7 +7,6 @@ from dotenv import load_dotenv -import llm.operations as llm from llm.model_instantiation import create_model_instance from util.http import ping_backend @@ -55,20 +54,6 @@ def main(argv): # noqa: C901 ) return - # instantiate LLM model if set in config.yaml - if config.llm_service: - model = create_model_instance(llm_config=config.llm_service) - - if not config.repository and not config.use_llm_repository_url: - logger.error( - "Either provide the repository URL or allow LLM usage to obtain it." - ) - console.print( - "Either provide the repository URL or allow LLM usage to obtain it.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - # if config.ping: # return ping_backend(backend, get_level() < logging.INFO) @@ -81,12 +66,9 @@ def main(argv): # noqa: C901 logger.debug("Vulnerability ID: " + config.vuln_id) - if not config.repository: - config.repository = llm.get_repository_url(model=model, vuln_id=config.vuln_id) - results, advisory_record = prospector( vulnerability_id=config.vuln_id, - repository_url=config.repository, + repository_url=config.repository, # LASCHA: change to None publication_date=config.pub_date, vuln_descr=config.description, version_interval=config.version_interval, @@ -99,6 +81,7 @@ def main(argv): # noqa: C901 git_cache=config.git_cache, limit_candidates=config.max_candidates, # ignore_adv_refs=config.ignore_refs, + llm_service_config=config.llm_service, ) if config.preprocess_only: diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index a95603aae..a62ca5620 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -18,6 +18,7 @@ from git.git import Git from git.raw_commit import RawCommit from git.version_to_tag import get_possible_tags +from llm.llm_service import LLMService from log.logger import get_level, logger, pretty_log from rules.rules import apply_rules from stats.execution import ( @@ -68,6 +69,7 @@ def prospector( # noqa: C901 rules: List[str] = ["ALL"], tag_commits: bool = True, silent: bool = False, + llm_service_config=None, ) -> Tuple[List[Commit], AdvisoryRecord] | Tuple[int, int]: if silent: logger.disabled = True @@ -75,24 +77,30 @@ def prospector( # noqa: C901 logger.debug("begin main commit and CVE processing") + # instantiate LLM model if needed: + if llm_service_config and (llm_service_config.use_llm_repository_url): + llm_service = LLMService(llm_service_config) + # construct an advisory record with ConsoleWriter("Processing advisory") as console: advisory_record = build_advisory_record( vulnerability_id, vuln_descr, + repository_url, nvd_rest_endpoint, use_nvd, publication_date, set(advisory_keywords), set(modified_files), + llm_service if llm_service_config.use_llm_repository_url else None, ) if advisory_record is None: return None, -1 - fixing_commit = advisory_record.get_fixing_commit(repository_url) + fixing_commit = advisory_record.get_fixing_commit() # print(advisory_record.references) # obtain a repository object - repository = Git(repository_url, git_cache) + repository = Git(advisory_record.repository_url, git_cache) with ConsoleWriter("Git repository cloning") as console: logger.debug(f"Downloading repository {repository.url} in {repository.path}") @@ -152,7 +160,7 @@ def prospector( # noqa: C901 try: if use_backend != USE_BACKEND_NEVER: missing, preprocessed_commits = retrieve_preprocessed_commits( - repository_url, + advisory_record.repository_url, backend_address, candidates, ) diff --git a/prospector/datamodel/advisory.py b/prospector/datamodel/advisory.py index 15569b35b..4f8ae8e44 100644 --- a/prospector/datamodel/advisory.py +++ b/prospector/datamodel/advisory.py @@ -10,6 +10,7 @@ import validators from dateutil.parser import isoparse +from llm.llm_service import LLMService from log.logger import get_level, logger, pretty_log from util.http import extract_from_webpage, fetch_url, get_urls @@ -69,6 +70,7 @@ def __init__( reserved_timestamp: int = 0, published_timestamp: int = 0, updated_timestamp: int = 0, + repository_url: str = None, references: DefaultDict[str, int] = None, affected_products: List[str] = None, versions: Dict[str, List[str]] = None, @@ -81,6 +83,7 @@ def __init__( self.reserved_timestamp = reserved_timestamp self.published_timestamp = published_timestamp self.updated_timestamp = updated_timestamp + self.repository_url = repository_url self.references = references or defaultdict(lambda: 0) self.affected_products = affected_products or list() self.versions = versions or dict() @@ -176,7 +179,7 @@ def parse_advisory(self, data): ] self.versions["fixed"] = [v for v in self.versions["fixed"] if v is not None] - def get_fixing_commit(self, repository) -> List[str]: + def get_fixing_commit(self) -> List[str]: self.references = dict( sorted(self.references.items(), key=lambda item: item[1], reverse=True) ) @@ -315,11 +318,13 @@ def get_from_local(vuln_id: str, nvd_rest_endpoint: str = LOCAL_NVD_REST_ENDPOIN def build_advisory_record( cve_id: str, description: Optional[str] = None, + repository_url: str = None, nvd_rest_endpoint: Optional[str] = None, use_nvd: bool = True, publication_date: Optional[str] = None, advisory_keywords: Set[str] = set(), modified_files: Optional[str] = None, + llm_service: LLMService = None, ) -> AdvisoryRecord: advisory_record = AdvisoryRecord( cve_id=cve_id, @@ -335,6 +340,19 @@ def build_advisory_record( ) return None + # Get repository URL if not given by user + if llm_service: + try: + advisory_record.repository_url = llm_service.get_repository_url( + advisory_record.description, advisory_record.references + ) + except Exception as e: # LASCHA: understand this error + logger.error( + "URL returned by LLM was not valid.", + exc_info=get_level() < logging.INFO, + ) + return None + pretty_log(logger, advisory_record) advisory_record.analyze() diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py new file mode 100644 index 000000000..62f00fa04 --- /dev/null +++ b/prospector/llm/llm_service.py @@ -0,0 +1,58 @@ +import sys +from typing import Dict + +import validators +from langchain_core.language_models.llms import LLM + +from cli.console import ConsoleWriter, MessageStatus +from llm.model_instantiation import create_model_instance +from llm.prompts import best_guess +from log.logger import logger + + +class LLMService: + model: LLM + + def __init__(self, config): + self.model = create_model_instance(config) + + def get_repository_url(self, advisory_description, advisory_references): + """Ask an LLM to obtain the repository URL given the advisory description and references. + + Args: + advisory_description (str): The advisory description + advisory_references (dict): The advisory's references + + Returns: + The repository URL as a string. + + Raises: + ValueError if advisory information cannot be obtained or there is an error in the model invocation. + """ + with ConsoleWriter("Invoking LLM") as console: + + try: + chain = best_guess | self.model + + url = chain.invoke( + { + "description": advisory_description, + "references": advisory_references, + } + ) + if not validators.url(url): + logger.error(f"LLM returned invalid URL: {url}") + console.print( + f"LLM returned invalid URL: {url}", + status=MessageStatus.ERROR, + ) + sys.exit(1) + except Exception as e: + logger.error(f"Prompt-model chain could not be invoked: {e}") + console.print( + "Prompt-model chain could not be invoked.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + + return url diff --git a/prospector/llm/operations.py b/prospector/llm/operations.py deleted file mode 100644 index f157e8590..000000000 --- a/prospector/llm/operations.py +++ /dev/null @@ -1,60 +0,0 @@ -import sys -from typing import Dict - -import validators -from langchain_core.language_models.llms import LLM - -from cli.console import ConsoleWriter, MessageStatus -from datamodel.advisory import get_from_mitre -from llm.prompts import best_guess -from log.logger import logger - - -def get_repository_url(model: LLM, vuln_id: str): - """Ask an LLM to obtain the repository URL given the advisory description and references. - - Args: - model (LLM): The instantiated model (instantiated with create_model_instance()) - vuln_id: The ID of the advisory, e.g. CVE-2020-1925. - - Returns: - The repository URL as a string. - - Raises: - ValueError if advisory information cannot be obtained or there is an error in the model invocation. - """ - with ConsoleWriter("Invoking LLM") as console: - details, _ = get_from_mitre(vuln_id) - if details is None: - logger.error("Error when getting advisory information from Mitre.") - console.print( - "Error when getting advisory information from Mitre.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - - try: - chain = best_guess | model - - url = chain.invoke( - { - "description": details["descriptions"][0]["value"], - "references": details["references"], - } - ) - if not validators.url(url): - logger.error(f"LLM returned invalid URL: {url}") - console.print( - f"LLM returned invalid URL: {url}", - status=MessageStatus.ERROR, - ) - sys.exit(1) - except Exception as e: - logger.error(f"Prompt-model chain could not be invoked: {e}") - console.print( - "Prompt-model chain could not be invoked.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - - return url diff --git a/prospector/llm/test_llm.py b/prospector/llm/test_llm_service.py similarity index 86% rename from prospector/llm/test_llm.py rename to prospector/llm/test_llm_service.py index 5d8cb9c1b..3cde0a78d 100644 --- a/prospector/llm/test_llm.py +++ b/prospector/llm/test_llm_service.py @@ -2,9 +2,14 @@ import requests from langchain_openai import ChatOpenAI +<<<<<<< HEAD:prospector/llm/test_llm.py from llm.model_instantiation import create_model_instance from llm.models import OpenAI from llm.operations import get_repository_url +======= +from llm.llm_service import create_model_instance, get_repository_url +from llm.models import Gemini, Mistral, OpenAI +>>>>>>> 376db10 (basic refactoring mvp):prospector/llm/test_llm_service.py # Mock the llm_service configuration object diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index 281b17bb3..d110de1b8 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -181,6 +181,7 @@ class ReportConfig: class LLMServiceConfig: type: str model_name: str + use_llm_repository_url: bool temperature: float = 0.0 @@ -194,7 +195,6 @@ class ConfigSchema: use_nvd: bool = MISSING use_backend: str = MISSING backend: str = MISSING - use_llm_repository_url: bool = MISSING report: ReportConfig = MISSING log_level: str = MISSING git_cache: str = MISSING @@ -226,7 +226,6 @@ def __init__( fetch_references: bool, use_backend: str, backend: str, - use_llm_repository_url: bool, report: ReportConfig, report_filename: str, ping: bool, @@ -237,7 +236,6 @@ def __init__( ): self.vuln_id = vuln_id self.repository = repository - self.use_llm_repository_url = use_llm_repository_url self.llm_service = llm_service self.preprocess_only = preprocess_only self.pub_date = pub_date @@ -271,7 +269,6 @@ def get_configuration(argv): config = Config( vuln_id=args.vuln_id, repository=args.repository, - use_llm_repository_url=conf.use_llm_repository_url, llm_service=conf.llm_service, preprocess_only=args.preprocess_only or conf.preprocess_only, pub_date=args.pub_date, From 131dc92f6cfd95fd83d933c46dc555297aa35c83 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 13 Jun 2024 14:30:56 +0000 Subject: [PATCH 14/83] I can run the code, but the singleton pattern implemented with metaclasses still doesn't work --- prospector/cli/main.py | 4 +--- prospector/core/prospector.py | 6 +----- prospector/datamodel/advisory.py | 10 ++++++---- prospector/llm/llm_service.py | 17 ++++++++++++++--- 4 files changed, 22 insertions(+), 15 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 8af48c579..2621de1d7 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -1,5 +1,4 @@ #!/usr/bin/python3 -import logging import os import signal import sys @@ -7,7 +6,6 @@ from dotenv import load_dotenv -from llm.model_instantiation import create_model_instance from util.http import ping_backend path_root = os.getcwd() @@ -68,7 +66,7 @@ def main(argv): # noqa: C901 results, advisory_record = prospector( vulnerability_id=config.vuln_id, - repository_url=config.repository, # LASCHA: change to None + repository_url=config.repository, publication_date=config.pub_date, vuln_descr=config.description, version_interval=config.version_interval, diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index a62ca5620..c54024a1a 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -77,10 +77,6 @@ def prospector( # noqa: C901 logger.debug("begin main commit and CVE processing") - # instantiate LLM model if needed: - if llm_service_config and (llm_service_config.use_llm_repository_url): - llm_service = LLMService(llm_service_config) - # construct an advisory record with ConsoleWriter("Processing advisory") as console: advisory_record = build_advisory_record( @@ -92,7 +88,7 @@ def prospector( # noqa: C901 publication_date, set(advisory_keywords), set(modified_files), - llm_service if llm_service_config.use_llm_repository_url else None, + llm_service_config if llm_service_config.use_llm_repository_url else None, ) if advisory_record is None: return None, -1 diff --git a/prospector/datamodel/advisory.py b/prospector/datamodel/advisory.py index 4f8ae8e44..fb22589ad 100644 --- a/prospector/datamodel/advisory.py +++ b/prospector/datamodel/advisory.py @@ -324,7 +324,7 @@ def build_advisory_record( publication_date: Optional[str] = None, advisory_keywords: Set[str] = set(), modified_files: Optional[str] = None, - llm_service: LLMService = None, + llm_service_config=None, ) -> AdvisoryRecord: advisory_record = AdvisoryRecord( cve_id=cve_id, @@ -341,15 +341,17 @@ def build_advisory_record( return None # Get repository URL if not given by user - if llm_service: + if llm_service_config: + # instantiate LLM model if needed + llm_service = LLMService(llm_service_config) try: advisory_record.repository_url = llm_service.get_repository_url( advisory_record.description, advisory_record.references ) - except Exception as e: # LASCHA: understand this error + except Exception as e: logger.error( "URL returned by LLM was not valid.", - exc_info=get_level() < logging.INFO, + exc_info=get_level() < logging.INFO, # LASCHA: understand this error ) return None diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 62f00fa04..8e9efc464 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -10,11 +10,22 @@ from log.logger import logger -class LLMService: - model: LLM +class Singleton(type): + _instances = {} + + def __call__(cls, *args, **kwargs): + if cls not in cls._instances: + cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) + return cls._instances[cls] + + +class LLMService(metaclass=Singleton): + _instance = None def __init__(self, config): - self.model = create_model_instance(config) + if not hasattr(self, "_initialized"): + self._model: LLM = create_model_instance(config) + self._initliazed = True def get_repository_url(self, advisory_description, advisory_references): """Ask an LLM to obtain the repository URL given the advisory description and references. From bf13f4c1bbddc2c22274aa417b460c57f03e60e5 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 13 Jun 2024 15:11:16 +0000 Subject: [PATCH 15/83] singleton pattern works --- prospector/core/prospector.py | 51 ++++++++++++++++++++++++++++---- prospector/datamodel/advisory.py | 18 +---------- prospector/llm/llm_service.py | 26 ++++++++-------- 3 files changed, 59 insertions(+), 36 deletions(-) diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index c54024a1a..509474ad0 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -5,7 +5,7 @@ import re import sys import time -from typing import Dict, List, Set, Tuple +from typing import DefaultDict, Dict, List, Set, Tuple from urllib.parse import urlparse import requests @@ -52,7 +52,7 @@ @measure_execution_time(execution_statistics, name="core") def prospector( # noqa: C901 vulnerability_id: str, - repository_url: str, + repository_url: str = None, publication_date: str = "", vuln_descr: str = "", version_interval: str = "", @@ -82,21 +82,23 @@ def prospector( # noqa: C901 advisory_record = build_advisory_record( vulnerability_id, vuln_descr, - repository_url, nvd_rest_endpoint, use_nvd, publication_date, set(advisory_keywords), set(modified_files), - llm_service_config if llm_service_config.use_llm_repository_url else None, ) if advisory_record is None: return None, -1 + repository_url = repository_url or set_repository_url( + llm_service_config, advisory_record.description, advisory_record.references + ) + fixing_commit = advisory_record.get_fixing_commit() # print(advisory_record.references) # obtain a repository object - repository = Git(advisory_record.repository_url, git_cache) + repository = Git(repository_url, git_cache) with ConsoleWriter("Git repository cloning") as console: logger.debug(f"Downloading repository {repository.url} in {repository.path}") @@ -156,7 +158,7 @@ def prospector( # noqa: C901 try: if use_backend != USE_BACKEND_NEVER: missing, preprocessed_commits = retrieve_preprocessed_commits( - advisory_record.repository_url, + repository_url, backend_address, candidates, ) @@ -455,6 +457,43 @@ def is_correct_backend_url(backend_url: str) -> bool: return True +def set_repository_url( + config, advisory_description: str, advisory_references: DefaultDict[str, int] +): + """Returns the URL obtained through the LLM. + + Args: + config (LLMServiceConfig): The 'llm_service' configuration block in config.yaml + advisory_description (str): The description of the advisory + advisory_references (dict[str, int]): The references of the advisory + + Returns: + The respository URL as a string. + + Raises: + System Exit if no configuration for the llm_service is given or the LLM returns an invalid URL. + """ # LASCHA: check error flow in this method + if not config: + logger.error( + "No configuration given for model in `config.yaml`.", + exc_info=get_level() < logging.INFO, + ) + sys.exit(1) + llm_service = LLMService(config) + url_from_llm = None + try: + url_from_llm = llm_service.get_repository_url( + advisory_description, advisory_references + ) + except Exception as e: + logger.error( + "URL returned by LLM was not valid.", + exc_info=get_level() < logging.INFO, + ) + sys.exit(1) + return url_from_llm + + # def prospector_find_twins( # advisory_record: AdvisoryRecord, # repository: Git, diff --git a/prospector/datamodel/advisory.py b/prospector/datamodel/advisory.py index fb22589ad..d1e9e49a6 100644 --- a/prospector/datamodel/advisory.py +++ b/prospector/datamodel/advisory.py @@ -136,6 +136,7 @@ def parse_references_from_third_party(self): self.references[self.extract_hashes(ref)] += 2 def get_advisory(self): + """Fills the advisory record with information obtained from an advisory API.""" details, metadata = get_from_mitre(self.cve_id) if metadata is None: raise Exception("MITRE API Error") @@ -318,13 +319,11 @@ def get_from_local(vuln_id: str, nvd_rest_endpoint: str = LOCAL_NVD_REST_ENDPOIN def build_advisory_record( cve_id: str, description: Optional[str] = None, - repository_url: str = None, nvd_rest_endpoint: Optional[str] = None, use_nvd: bool = True, publication_date: Optional[str] = None, advisory_keywords: Set[str] = set(), modified_files: Optional[str] = None, - llm_service_config=None, ) -> AdvisoryRecord: advisory_record = AdvisoryRecord( cve_id=cve_id, @@ -340,21 +339,6 @@ def build_advisory_record( ) return None - # Get repository URL if not given by user - if llm_service_config: - # instantiate LLM model if needed - llm_service = LLMService(llm_service_config) - try: - advisory_record.repository_url = llm_service.get_repository_url( - advisory_record.description, advisory_record.references - ) - except Exception as e: - logger.error( - "URL returned by LLM was not valid.", - exc_info=get_level() < logging.INFO, # LASCHA: understand this error - ) - return None - pretty_log(logger, advisory_record) advisory_record.analyze() diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 8e9efc464..5e6144a8f 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -10,22 +10,22 @@ from log.logger import logger -class Singleton(type): - _instances = {} +class Singleton(object): + """Singleton class to ensure that any class inheriting from this one can only be instantiated once.""" - def __call__(cls, *args, **kwargs): - if cls not in cls._instances: - cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) - return cls._instances[cls] + def __new__(cls, *args, **kwargs): + # See if the instance is already in existence, and return it if yes + if not hasattr(cls, "_singleton_instance"): + cls._singleton_instance = super(Singleton, cls).__new__(cls) + return cls._singleton_instance -class LLMService(metaclass=Singleton): - _instance = None - +class LLMService(Singleton): def __init__(self, config): - if not hasattr(self, "_initialized"): - self._model: LLM = create_model_instance(config) - self._initliazed = True + if hasattr(self, "_instantiated"): + return + self._instantiated = True + self._model: LLM = create_model_instance(config) def get_repository_url(self, advisory_description, advisory_references): """Ask an LLM to obtain the repository URL given the advisory description and references. @@ -43,7 +43,7 @@ def get_repository_url(self, advisory_description, advisory_references): with ConsoleWriter("Invoking LLM") as console: try: - chain = best_guess | self.model + chain = best_guess | self._model url = chain.invoke( { From 240cb467af73829f65828bd6b068bd555408c3bf Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 13 Jun 2024 15:44:57 +0000 Subject: [PATCH 16/83] move models into their own files --- prospector/llm/model_instantiation.py | 2 +- prospector/llm/models.py | 193 -------------------------- prospector/llm/models/gemini.py | 50 +++++++ prospector/llm/models/mistral.py | 37 +++++ prospector/llm/models/openai.py | 40 ++++++ prospector/llm/models/sap_llm.py | 84 +++++++++++ prospector/llm/test_llm_service.py | 6 - 7 files changed, 212 insertions(+), 200 deletions(-) delete mode 100644 prospector/llm/models.py create mode 100644 prospector/llm/models/gemini.py create mode 100644 prospector/llm/models/mistral.py create mode 100644 prospector/llm/models/openai.py create mode 100644 prospector/llm/models/sap_llm.py diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index 2ca1560f1..45dba28c8 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -6,7 +6,7 @@ from langchain_mistralai import ChatMistralAI from langchain_openai import ChatOpenAI -from llm.models import Gemini, Mistral, OpenAI +from llm.models.models import Gemini, Mistral, OpenAI from log.logger import logger diff --git a/prospector/llm/models.py b/prospector/llm/models.py deleted file mode 100644 index cb3fb0d81..000000000 --- a/prospector/llm/models.py +++ /dev/null @@ -1,193 +0,0 @@ -import json -from typing import Any, List, Mapping, Optional - -import requests -from dotenv import dotenv_values -from langchain_core.language_models.llms import LLM - -from log.logger import logger - - -class SAPProvider(LLM): - model_name: str - deployment_url: str - temperature: float - - @property - def _llm_type(self) -> str: - return "custom" - - @property - def _identifying_params(self) -> Mapping[str, Any]: - """Get the identifying parameters.""" - return { - "model_name": self.model_name, - } - - def _call( - self, - prompt: str, - stop: Optional[List[str]] = None, - **kwargs: Any, - ) -> str: - """Run the LLM on the given input. - - Override this method to implement the LLM logic. - - Args: - prompt: The prompt to generate from. - stop: Stop words to use when generating. Model output is cut off at the - first occurrence of any of the stop substrings. - If stop tokens are not supported consider raising NotImplementedError. - run_manager: Callback manager for the run. - **kwargs: Arbitrary additional keyword arguments. These are usually passed - to the model provider API call. - - Returns: - The model output as a string. Actual completions SHOULD NOT include the prompt. - """ - if self.deployment_url is None: - raise ValueError( - "Deployment URL not set. Maybe you forgot to create the environment variable." - ) - if stop is not None: - raise ValueError("stop kwargs are not permitted.") - return "" - - -class OpenAI(SAPProvider): - def _call( - self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any - ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) - # Model specific request data - endpoint = f"{self.deployment_url}/chat/completions?api-version=2023-05-15" - headers = get_headers() - data = { - "messages": [ - { - "role": "user", - "content": f"{prompt}", - } - ], - "temperature": self.temperature, - } - - response = requests.post(endpoint, headers=headers, json=data) - - if not response.status_code == 200: - logger.error( - f"Invalid response from AI Core API with error code {response.status_code}" - ) - raise Exception("Invalid response from AI Core API.") - - return self.parse(response.json()) - - def parse(self, message) -> str: - """Parse the returned JSON object from OpenAI.""" - return message["choices"][0]["message"]["content"] - - -class Gemini(SAPProvider): - def _call( - self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any - ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) - # Model specific request data - endpoint = f"{self.deployment_url}/models/{self.model_name}:generateContent" - headers = get_headers() - data = { - "generation_config": { - "maxOutputTokens": 1000, - "temperature": self.temperature, - }, - "contents": [{"role": "user", "parts": [{"text": prompt}]}], - "safetySettings": [ - { - "category": "HARM_CATEGORY_DANGEROUS_CONTENT", - "threshold": "BLOCK_NONE", - }, - { - "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", - "threshold": "BLOCK_NONE", - }, - {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, - {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, - ], - } - - response = requests.post(endpoint, headers=headers, json=data) - - if not response.status_code == 200: - logger.error( - f"Invalid response from AI Core API with error code {response.status_code}" - ) - raise Exception("Invalid response from AI Core API.") - - return self.parse(response.json()) - - def parse(self, message) -> str: - """Parse the returned JSON object from OpenAI.""" - return message["candidates"][0]["content"]["parts"][0]["text"] - - -class Mistral(SAPProvider): - def _call( - self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any - ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) - # Model specific request data - endpoint = f"{self.deployment_url}/chat/completions" - headers = get_headers() - data = { - "model": "mistralai--mixtral-8x7b-instruct-v01", - "max_tokens": 100, - "temperature": self.temperature, - "messages": [{"role": "user", "content": prompt}], - } - - response = requests.post(endpoint, headers=headers, json=data) - - if not response.status_code == 200: - logger.error( - f"Invalid response from AI Core API with error code {response.status_code}" - ) - raise Exception("Invalid response from AI Core API.") - - return self.parse(response.json()) - - def parse(self, message) -> str: - """Parse the returned JSON object from OpenAI.""" - return message["choices"][0]["message"]["content"] - - -def get_headers(): - """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. - - Returns: - The headers object needed to send requests to the SAP AI Core. - """ - with open(dotenv_values()["AI_CORE_KEY_FILEPATH"]) as f: - sk = json.load(f) - - auth_url = f"{sk['url']}/oauth/token" - client_id = sk["clientid"] - client_secret = sk["clientsecret"] - # api_base_url = f"{sk['serviceurls']['AI_API_URL']}/v2" - - response = requests.post( - auth_url, - data={"grant_type": "client_credentials"}, - auth=(client_id, client_secret), - timeout=8000, - ) - - headers = { - "AI-Resource-Group": "default", - "Content-Type": "application/json", - "Authorization": f"Bearer {response.json()['access_token']}", - } - return headers diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py new file mode 100644 index 000000000..fd253d5de --- /dev/null +++ b/prospector/llm/models/gemini.py @@ -0,0 +1,50 @@ +from typing import Any, List, Optional + +import requests + +from llm.models.sap_llm import SAPLLM, get_headers +from log.logger import logger + + +class Gemini(SAPLLM): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + # Model specific request data + endpoint = f"{self.deployment_url}/models/{self.model_name}:generateContent" + headers = get_headers() + data = { + "generation_config": { + "maxOutputTokens": 1000, + "temperature": self.temperature, + }, + "contents": [{"role": "user", "parts": [{"text": prompt}]}], + "safetySettings": [ + { + "category": "HARM_CATEGORY_DANGEROUS_CONTENT", + "threshold": "BLOCK_NONE", + }, + { + "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", + "threshold": "BLOCK_NONE", + }, + {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, + {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, + ], + } + + response = requests.post(endpoint, headers=headers, json=data) + + if not response.status_code == 200: + logger.error( + f"Invalid response from AI Core API with error code {response.status_code}" + ) + raise Exception("Invalid response from AI Core API.") + + return self.parse(response.json()) + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["candidates"][0]["content"]["parts"][0]["text"] diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py new file mode 100644 index 000000000..22757ea82 --- /dev/null +++ b/prospector/llm/models/mistral.py @@ -0,0 +1,37 @@ +from typing import Any, List, Optional + +import requests + +from llm.models.sap_llm import SAPLLM, get_headers +from log.logger import logger + + +class Mistral(SAPLLM): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + # Model specific request data + endpoint = f"{self.deployment_url}/chat/completions" + headers = get_headers() + data = { + "model": "mistralai--mixtral-8x7b-instruct-v01", + "max_tokens": 100, + "temperature": self.temperature, + "messages": [{"role": "user", "content": prompt}], + } + + response = requests.post(endpoint, headers=headers, json=data) + + if not response.status_code == 200: + logger.error( + f"Invalid response from AI Core API with error code {response.status_code}" + ) + raise Exception("Invalid response from AI Core API.") + + return self.parse(response.json()) + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["choices"][0]["message"]["content"] diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py new file mode 100644 index 000000000..e399683f9 --- /dev/null +++ b/prospector/llm/models/openai.py @@ -0,0 +1,40 @@ +from typing import Any, List, Optional + +import requests + +from llm.models.sap_llm import SAPLLM, get_headers +from log.logger import logger + + +class OpenAI(SAPLLM): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + # Model specific request data + endpoint = f"{self.deployment_url}/chat/completions?api-version=2023-05-15" + headers = get_headers() + data = { + "messages": [ + { + "role": "user", + "content": f"{prompt}", + } + ], + "temperature": self.temperature, + } + + response = requests.post(endpoint, headers=headers, json=data) + + if not response.status_code == 200: + logger.error( + f"Invalid response from AI Core API with error code {response.status_code}" + ) + raise Exception("Invalid response from AI Core API.") + + return self.parse(response.json()) + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["choices"][0]["message"]["content"] diff --git a/prospector/llm/models/sap_llm.py b/prospector/llm/models/sap_llm.py new file mode 100644 index 000000000..62ddbf28a --- /dev/null +++ b/prospector/llm/models/sap_llm.py @@ -0,0 +1,84 @@ +import json +from typing import Any, List, Mapping, Optional + +import requests +from dotenv import dotenv_values +from langchain_core.language_models.llms import LLM + +from log.logger import logger + + +class SAPLLM(LLM): + model_name: str + deployment_url: str + temperature: float + + @property + def _llm_type(self) -> str: + return "custom" + + @property + def _identifying_params(self) -> Mapping[str, Any]: + """Get the identifying parameters.""" + return { + "model_name": self.model_name, + } + + def _call( + self, + prompt: str, + stop: Optional[List[str]] = None, + **kwargs: Any, + ) -> str: + """Run the LLM on the given input. + + Override this method to implement the LLM logic. + + Args: + prompt: The prompt to generate from. + stop: Stop words to use when generating. Model output is cut off at the + first occurrence of any of the stop substrings. + If stop tokens are not supported consider raising NotImplementedError. + run_manager: Callback manager for the run. + **kwargs: Arbitrary additional keyword arguments. These are usually passed + to the model provider API call. + + Returns: + The model output as a string. Actual completions SHOULD NOT include the prompt. + """ + if self.deployment_url is None: + raise ValueError( + "Deployment URL not set. Maybe you forgot to create the environment variable." + ) + if stop is not None: + raise ValueError("stop kwargs are not permitted.") + return "" + + +def get_headers(): + """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. + + Returns: + The headers object needed to send requests to the SAP AI Core. + """ + with open(dotenv_values()["AI_CORE_KEY_FILEPATH"]) as f: + sk = json.load(f) + + auth_url = f"{sk['url']}/oauth/token" + client_id = sk["clientid"] + client_secret = sk["clientsecret"] + # api_base_url = f"{sk['serviceurls']['AI_API_URL']}/v2" + + response = requests.post( + auth_url, + data={"grant_type": "client_credentials"}, + auth=(client_id, client_secret), + timeout=8000, + ) + + headers = { + "AI-Resource-Group": "default", + "Content-Type": "application/json", + "Authorization": f"Bearer {response.json()['access_token']}", + } + return headers diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index 3cde0a78d..0937f41ab 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -2,14 +2,8 @@ import requests from langchain_openai import ChatOpenAI -<<<<<<< HEAD:prospector/llm/test_llm.py -from llm.model_instantiation import create_model_instance -from llm.models import OpenAI -from llm.operations import get_repository_url -======= from llm.llm_service import create_model_instance, get_repository_url from llm.models import Gemini, Mistral, OpenAI ->>>>>>> 376db10 (basic refactoring mvp):prospector/llm/test_llm_service.py # Mock the llm_service configuration object From 76791d896e3fc66195759194808fa69976364fba Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 13 Jun 2024 15:57:40 +0000 Subject: [PATCH 17/83] updates tests --- prospector/llm/model_instantiation.py | 4 ++- prospector/llm/test_llm_service.py | 41 +++++++-------------------- 2 files changed, 14 insertions(+), 31 deletions(-) diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index 45dba28c8..fc8ffd41e 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -6,7 +6,9 @@ from langchain_mistralai import ChatMistralAI from langchain_openai import ChatOpenAI -from llm.models.models import Gemini, Mistral, OpenAI +from llm.models.gemini import Gemini +from llm.models.mistral import Mistral +from llm.models.openai import OpenAI from log.logger import logger diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index 0937f41ab..ca0a84473 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -1,9 +1,5 @@ -import pytest -import requests -from langchain_openai import ChatOpenAI - -from llm.llm_service import create_model_instance, get_repository_url -from llm.models import Gemini, Mistral, OpenAI +from llm.llm_service import LLMService # this is a singleton +from llm.models.openai import OpenAI # Mock the llm_service configuration object @@ -23,29 +19,14 @@ def __init__(self, type, model_name, temperature): class TestModel: - def test_sap_gpt35_instantiation(self): - config = Config("sap", "gpt-35-turbo", "0.0") - model = create_model_instance(config) - assert isinstance(model, OpenAI) - def test_sap_gpt4_instantiation(self): config = Config("sap", "gpt-4", "0.0") - model = create_model_instance(config) - assert isinstance(model, OpenAI) - - def test_thirdparty_gpt35_instantiation(self): - config = Config("third_party", "gpt-3.5-turbo", "0.0") - model = create_model_instance(config) - assert isinstance(model, ChatOpenAI) - - def test_thirdparty_gpt4_instantiation(self): - config = Config("third_party", "gpt-4", "0.0") - model = create_model_instance(config) - assert isinstance(model, ChatOpenAI) - - def test_invoke_fail(self): - with pytest.raises(SystemExit): - config = Config("sap", "gpt-35-turbo", "0.0") - model = create_model_instance(config) - vuln_id = "random" - get_repository_url(model=model, vuln_id=vuln_id) + llm_service = LLMService(config) + assert isinstance(llm_service._model, OpenAI) + + # def test_invoke_fail(self): + # with pytest.raises(SystemExit): + # config = Config("sap", "gpt-35-turbo", "0.0") + # model = create_model_instance(config) + # vuln_id = "random" + # get_repository_url(model=model, vuln_id=vuln_id) From 44ec1892ac5c044bf0c5152326176ecc0e7df287 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 07:30:22 +0000 Subject: [PATCH 18/83] corrects information about missing commits to be logged as info instead of error --- prospector/core/prospector.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index 509474ad0..f11817f72 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -325,7 +325,7 @@ def retrieve_preprocessed_commits( ) ] - logger.error(f"Missing {len(missing)} commits") + logger.info(f"{len(missing)} commits not found in backend") commits = [Commit.parse_obj(rc) for rc in retrieved_commits] # Sets the tags # for commit in commits: From 9a147086a0e4b4429bcef2331310fb1fa6012034 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 08:27:11 +0000 Subject: [PATCH 19/83] moves singleton metaclass into utils --- prospector/llm/llm_service.py | 18 +++++------------- prospector/util/singleton.py | 16 ++++++++++++++++ 2 files changed, 21 insertions(+), 13 deletions(-) create mode 100644 prospector/util/singleton.py diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 5e6144a8f..6fd01872e 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -8,23 +8,15 @@ from llm.model_instantiation import create_model_instance from llm.prompts import best_guess from log.logger import logger +from util.singleton import Singleton -class Singleton(object): - """Singleton class to ensure that any class inheriting from this one can only be instantiated once.""" +class LLMService(metaclass=Singleton): + """A wrapper class for all functions requiring an LLM. This class is also a singleton, as only one model + should be used throughout the program. + """ - def __new__(cls, *args, **kwargs): - # See if the instance is already in existence, and return it if yes - if not hasattr(cls, "_singleton_instance"): - cls._singleton_instance = super(Singleton, cls).__new__(cls) - return cls._singleton_instance - - -class LLMService(Singleton): def __init__(self, config): - if hasattr(self, "_instantiated"): - return - self._instantiated = True self._model: LLM = create_model_instance(config) def get_repository_url(self, advisory_description, advisory_references): diff --git a/prospector/util/singleton.py b/prospector/util/singleton.py new file mode 100644 index 000000000..51f88c53e --- /dev/null +++ b/prospector/util/singleton.py @@ -0,0 +1,16 @@ +from log.logger import logger + + +class Singleton(type): + """Singleton class to ensure that any class inheriting from this one can only be instantiated once.""" + + _instances = {} + + def __call__(cls, *args, **kwargs): + if cls not in cls._instances: + cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) + if cls not in cls._instances: + logger.error( + f"Cannot instantiate a Singleton twice. Returning already existing instance of class {cls}." + ) + return cls._instances[cls] From 8b351252692cf7be62f02fa5b95789b0fd8ec6fb Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 08:39:57 +0000 Subject: [PATCH 20/83] adds tests --- prospector/llm/test_llm_service.py | 73 ++++++++++++++++++++++++++---- 1 file changed, 63 insertions(+), 10 deletions(-) diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index ca0a84473..4030db275 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -1,5 +1,11 @@ +import pytest +from langchain_core.language_models.llms import LLM + from llm.llm_service import LLMService # this is a singleton +from llm.models.gemini import Gemini +from llm.models.mistral import Mistral from llm.models.openai import OpenAI +from util.singleton import Singleton # Mock the llm_service configuration object @@ -14,19 +20,66 @@ def __init__(self, type, model_name, temperature): self.temperature = temperature -# Vulnerability ID -vuln_id = "CVE-2024-32480" +test_vuln_id = "CVE-2024-32480" + + +@pytest.fixture(autouse=True) +def reset_singletons(): + # Clean up singleton instances after each test + Singleton._instances = {} class TestModel: - def test_sap_gpt4_instantiation(self): - config = Config("sap", "gpt-4", "0.0") + def test_sap_gpt_instantiation(self): + config = Config("sap", "gpt-4", 0.0) llm_service = LLMService(config) assert isinstance(llm_service._model, OpenAI) - # def test_invoke_fail(self): - # with pytest.raises(SystemExit): - # config = Config("sap", "gpt-35-turbo", "0.0") - # model = create_model_instance(config) - # vuln_id = "random" - # get_repository_url(model=model, vuln_id=vuln_id) + def test_sap_gemini_instantiation(self): + config = Config("sap", "gemini-1.0-pro", 0.0) + llm_service = LLMService(config) + assert isinstance(llm_service._model, Gemini) + + def test_sap_mistral_instantiation(self): + config = Config("sap", "mistralai--mixtral-8x7b-instruct-v01", 0.0) + llm_service = LLMService(config) + assert isinstance(llm_service._model, Mistral) + + def test_singleton_instance_creation(self): + """A second instantiation should return the exisiting instance.""" + config = Config("sap", "gpt-4", 0.0) + llm_service = LLMService(config) + same_service = LLMService(config) + assert ( + llm_service is same_service + ), "LLMService should return the same instance." + + def test_singleton_same_instance(self): + """A second instantiation with different parameters should return the existing instance unchanged.""" + config = Config("sap", "gpt-4", 0.0) + llm_service = LLMService(config) + config = Config( + "sap", "gpt-35-turbo", 0.0 + ) # This instantiation should not work, but instead return the already existing instance + same_service = LLMService(config) + assert llm_service is same_service + assert llm_service._model.model_name == "gpt-4" + + def test_singleton_retains_state(self): + """Reassigning a field variable of the instance should be allowed and reflected + across instantiations.""" + config = Config("sap", "gpt-4", 0.0) + service = LLMService(config) + + service._model = OpenAI( + model_name="gpt-35-turbo", + deployment_url="deployment_url_placeholder", + temperature=0.7, + ) + same_service = LLMService(config) + + assert same_service._model == OpenAI( + model_name="gpt-35-turbo", + deployment_url="deployment_url_placeholder", + temperature=0.7, + ), "LLMService should retain state between instantiations" From 17f42943467a8cbf273014eadb269c6f23e35dc9 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 08:40:40 +0000 Subject: [PATCH 21/83] changes variable name d to model_definition for clarity --- prospector/llm/model_instantiation.py | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index fc8ffd41e..6a8fbc3d9 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -59,14 +59,14 @@ def create_model_instance(llm_config) -> LLM: """ def create_sap_provider(model_name: str, temperature: float): - d = SAP_MAPPING.get(model_name, None) + model_definition = SAP_MAPPING.get(model_name, None) - if d is None: + if model_definition is None: raise ValueError(f"Model '{model_name}' is not available.") - model = d._class( + model = model_definition._class( model_name=model_name, - deployment_url=d.access_info, + deployment_url=model_definition.access_info, temperature=temperature, ) @@ -74,15 +74,15 @@ def create_sap_provider(model_name: str, temperature: float): def create_third_party_provider(model_name: str, temperature: float): # obtain definition from main mapping - d = THIRD_PARTY_MAPPING.get(model_name, None) + model_definition = THIRD_PARTY_MAPPING.get(model_name, None) - if d is None: + if model_definition is None: logger.error(f"Model '{model_name}' is not available.") raise ValueError(f"Model '{model_name}' is not available.") - model = d._class( + model = model_definition._class( model=model_name, - api_key=d.access_info, + api_key=model_definition.access_info, temperature=temperature, ) From 2b4ac1cc9b3a80115c08b002f3f05b0f469ca85a Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 09:11:29 +0000 Subject: [PATCH 22/83] moves ai core sk filepath to config.yaml --- prospector/config-sample.yaml | 3 ++- prospector/llm/llm_service.py | 3 +-- prospector/llm/model_instantiation.py | 9 +++++++-- prospector/llm/models/gemini.py | 2 +- prospector/llm/models/mistral.py | 2 +- prospector/llm/models/openai.py | 2 +- prospector/llm/models/sap_llm.py | 6 +++--- prospector/util/config_parser.py | 1 + 8 files changed, 17 insertions(+), 11 deletions(-) diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 7208bc3dd..8813c31ad 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -32,8 +32,9 @@ llm_service: type: sap # use "sap" or "third_party" model_name: gpt-4-turbo # temperature: 0.0 # optional, default is 0.0 + # ai_core_sk: # needed for type: sap -use_llm_repository_url: True # whether to use LLM's to obtain the repository URL + use_llm_repository_url: True # whether to use LLM's to obtain the repository URL # Report file format: "html", "json", "console" or "all" # and the file name diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 6fd01872e..ed780e754 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -1,5 +1,4 @@ import sys -from typing import Dict import validators from langchain_core.language_models.llms import LLM @@ -19,7 +18,7 @@ class LLMService(metaclass=Singleton): def __init__(self, config): self._model: LLM = create_model_instance(config) - def get_repository_url(self, advisory_description, advisory_references): + def get_repository_url(self, advisory_description, advisory_references) -> str: """Ask an LLM to obtain the repository URL given the advisory description and references. Args: diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index 6a8fbc3d9..0131e71c5 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -58,7 +58,9 @@ def create_model_instance(llm_config) -> LLM: LLM: An instance of the specified LLM model. """ - def create_sap_provider(model_name: str, temperature: float): + def create_sap_provider( + model_name: str, temperature: float, ai_core_sk_file_path: str + ): model_definition = SAP_MAPPING.get(model_name, None) if model_definition is None: @@ -68,6 +70,7 @@ def create_sap_provider(model_name: str, temperature: float): model_name=model_name, deployment_url=model_definition.access_info, temperature=temperature, + ai_core_sk_file_path=ai_core_sk_file_path, ) return model @@ -98,7 +101,9 @@ def create_third_party_provider(model_name: str, temperature: float): match llm_config.type: case "sap": model = create_sap_provider( - llm_config.model_name, llm_config.temperature + llm_config.model_name, + llm_config.temperature, + llm_config.ai_core_sk, ) case "third_party": model = create_third_party_provider( diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py index fd253d5de..a7141ea56 100644 --- a/prospector/llm/models/gemini.py +++ b/prospector/llm/models/gemini.py @@ -14,7 +14,7 @@ def _call( super()._call(prompt, stop, **kwargs) # Model specific request data endpoint = f"{self.deployment_url}/models/{self.model_name}:generateContent" - headers = get_headers() + headers = get_headers(self.ai_core_sk_file_path) data = { "generation_config": { "maxOutputTokens": 1000, diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py index 22757ea82..f4b2e3c6c 100644 --- a/prospector/llm/models/mistral.py +++ b/prospector/llm/models/mistral.py @@ -14,7 +14,7 @@ def _call( super()._call(prompt, stop, **kwargs) # Model specific request data endpoint = f"{self.deployment_url}/chat/completions" - headers = get_headers() + headers = get_headers(self.ai_core_sk_file_path) data = { "model": "mistralai--mixtral-8x7b-instruct-v01", "max_tokens": 100, diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py index e399683f9..9f0c936c5 100644 --- a/prospector/llm/models/openai.py +++ b/prospector/llm/models/openai.py @@ -14,7 +14,7 @@ def _call( super()._call(prompt, stop, **kwargs) # Model specific request data endpoint = f"{self.deployment_url}/chat/completions?api-version=2023-05-15" - headers = get_headers() + headers = get_headers(self.ai_core_sk_file_path) data = { "messages": [ { diff --git a/prospector/llm/models/sap_llm.py b/prospector/llm/models/sap_llm.py index 62ddbf28a..d5d29054a 100644 --- a/prospector/llm/models/sap_llm.py +++ b/prospector/llm/models/sap_llm.py @@ -12,6 +12,7 @@ class SAPLLM(LLM): model_name: str deployment_url: str temperature: float + ai_core_sk_file_path: str @property def _llm_type(self) -> str: @@ -55,19 +56,18 @@ def _call( return "" -def get_headers(): +def get_headers(ai_core_sk_file_path: str): """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. Returns: The headers object needed to send requests to the SAP AI Core. """ - with open(dotenv_values()["AI_CORE_KEY_FILEPATH"]) as f: + with open(ai_core_sk_file_path) as f: sk = json.load(f) auth_url = f"{sk['url']}/oauth/token" client_id = sk["clientid"] client_secret = sk["clientsecret"] - # api_base_url = f"{sk['serviceurls']['AI_API_URL']}/v2" response = requests.post( auth_url, diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index d110de1b8..7de68a15c 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -182,6 +182,7 @@ class LLMServiceConfig: type: str model_name: str use_llm_repository_url: bool + ai_core_sk: str temperature: float = 0.0 From fcda410f4d13b08896cd4fa761ecb9d9a9e0fe97 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 12:58:56 +0000 Subject: [PATCH 23/83] adds StrOutputParser to model chain and prints returned URL to console --- prospector/llm/llm_service.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index ed780e754..8b22e68d0 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -2,6 +2,7 @@ import validators from langchain_core.language_models.llms import LLM +from langchain_core.output_parsers import StrOutputParser from cli.console import ConsoleWriter, MessageStatus from llm.model_instantiation import create_model_instance @@ -34,7 +35,7 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: with ConsoleWriter("Invoking LLM") as console: try: - chain = best_guess | self._model + chain = best_guess | self._model | StrOutputParser() url = chain.invoke( { @@ -42,10 +43,12 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: "references": advisory_references, } ) + logger.info(f"LLM returned the following URL: {url}") + console.print(f"\n Repository URL: {url}", status=MessageStatus.OK) if not validators.url(url): logger.error(f"LLM returned invalid URL: {url}") console.print( - f"LLM returned invalid URL: {url}", + f"\n LLM returned invalid URL: {url}", status=MessageStatus.ERROR, ) sys.exit(1) From 1f24f07e74cafed730105bb0698f7a4669bc87f5 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 14 Jun 2024 13:54:11 +0000 Subject: [PATCH 24/83] streamlines error handling --- prospector/cli/main.py | 10 ++++++ prospector/core/prospector.py | 48 ++++++++++++++----------- prospector/llm/llm_service.py | 52 ++++++++++++--------------- prospector/llm/model_instantiation.py | 24 +++++-------- prospector/llm/test_llm_service.py | 52 ++++++++++++++++++++++----- 5 files changed, 113 insertions(+), 73 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 2621de1d7..dc73af881 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -55,6 +55,16 @@ def main(argv): # noqa: C901 # if config.ping: # return ping_backend(backend, get_level() < logging.INFO) + if not config.repository and not config.llm_service.use_llm_repository_url: + logger.error( + "Repository URL was neither specified nor allowed to obtain with LLM support. One must be set." + ) + console.print( + "Please set the `--repository` parameter or enable LLM support to infer the repository URL.", + status=MessageStatus.ERROR, + ) + return + config.pub_date = ( config.pub_date + "T00:00:00Z" if config.pub_date is not None else "" ) diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index f11817f72..dfc6d1e7b 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -472,26 +472,34 @@ def set_repository_url( Raises: System Exit if no configuration for the llm_service is given or the LLM returns an invalid URL. - """ # LASCHA: check error flow in this method - if not config: - logger.error( - "No configuration given for model in `config.yaml`.", - exc_info=get_level() < logging.INFO, - ) - sys.exit(1) - llm_service = LLMService(config) - url_from_llm = None - try: - url_from_llm = llm_service.get_repository_url( - advisory_description, advisory_references - ) - except Exception as e: - logger.error( - "URL returned by LLM was not valid.", - exc_info=get_level() < logging.INFO, - ) - sys.exit(1) - return url_from_llm + """ + with ConsoleWriter("LLM Usage (Repo URL)") as console: + if not config: + logger.error( + "No configuration given for model in `config.yaml`.", + exc_info=get_level() < logging.INFO, + ) + console.print( + "No configuration given for model in `config.yaml`.", + status=MessageStatus.ERROR, + ) + sys.exit(1) + + try: + llm_service = LLMService(config) + url_from_llm = llm_service.get_repository_url( + advisory_description, advisory_references + ) + console.print( + f"\n Repository URL: {url_from_llm}", status=MessageStatus.OK + ) + return url_from_llm + + except Exception as e: + # Any error that occurs in either LLMService or get_repository_url should be caught here + logger.error(e) + console.print(e, status=MessageStatus.ERROR) + sys.exit(1) # def prospector_find_twins( diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 8b22e68d0..6f14c7314 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -17,7 +17,10 @@ class LLMService(metaclass=Singleton): """ def __init__(self, config): - self._model: LLM = create_model_instance(config) + try: + self._model: LLM = create_model_instance(config) + except Exception: + raise def get_repository_url(self, advisory_description, advisory_references) -> str: """Ask an LLM to obtain the repository URL given the advisory description and references. @@ -32,32 +35,21 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: Raises: ValueError if advisory information cannot be obtained or there is an error in the model invocation. """ - with ConsoleWriter("Invoking LLM") as console: - - try: - chain = best_guess | self._model | StrOutputParser() - - url = chain.invoke( - { - "description": advisory_description, - "references": advisory_references, - } - ) - logger.info(f"LLM returned the following URL: {url}") - console.print(f"\n Repository URL: {url}", status=MessageStatus.OK) - if not validators.url(url): - logger.error(f"LLM returned invalid URL: {url}") - console.print( - f"\n LLM returned invalid URL: {url}", - status=MessageStatus.ERROR, - ) - sys.exit(1) - except Exception as e: - logger.error(f"Prompt-model chain could not be invoked: {e}") - console.print( - "Prompt-model chain could not be invoked.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - - return url + try: + chain = best_guess | self._model | StrOutputParser() + + url = chain.invoke( + { + "description": advisory_description, + "references": advisory_references, + } + ) + logger.info(f"LLM returned the following URL: {url}") + + if not validators.url(url): + raise TypeError(f"LLM returned invalid URL: {url}") + + except Exception as e: + raise RuntimeError(f"Prompt-model chain could not be invoked: {e}") + + return url diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index 0131e71c5..8a7315c37 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -9,7 +9,6 @@ from llm.models.gemini import Gemini from llm.models.mistral import Mistral from llm.models.openai import OpenAI -from log.logger import logger class ModelDef: @@ -56,6 +55,7 @@ def create_model_instance(llm_config) -> LLM: Returns: LLM: An instance of the specified LLM model. + Exits """ def create_sap_provider( @@ -66,6 +66,11 @@ def create_sap_provider( if model_definition is None: raise ValueError(f"Model '{model_name}' is not available.") + if ai_core_sk_file_path is None: + raise ValueError( + f"AI Core credentials file couldn't be found: '{ai_core_sk_file_path}'" + ) + model = model_definition._class( model_name=model_name, deployment_url=model_definition.access_info, @@ -76,11 +81,9 @@ def create_sap_provider( return model def create_third_party_provider(model_name: str, temperature: float): - # obtain definition from main mapping model_definition = THIRD_PARTY_MAPPING.get(model_name, None) if model_definition is None: - logger.error(f"Model '{model_name}' is not available.") raise ValueError(f"Model '{model_name}' is not available.") model = model_definition._class( @@ -91,11 +94,6 @@ def create_third_party_provider(model_name: str, temperature: float): return model - if llm_config is None: - raise ValueError( - "When using LLM support, please add necessary parameters to configuration file." - ) - # LLM Instantiation try: match llm_config.type: @@ -110,14 +108,10 @@ def create_third_party_provider(model_name: str, temperature: float): llm_config.model_name, llm_config.temperature ) case _: - logger.error( - f"Invalid LLM type specified, '{llm_config.type}' is not available." - ) raise ValueError( - f"Invalid LLM type specified, '{llm_config.type}' is not available." + f"Invalid LLM type specified (either sap or third_party). '{llm_config.type}' is not available." ) - except Exception as e: - logger.error(f"Problem when initialising model: {e}") - raise ValueError(f"Problem when initialising model: {e}") + except Exception: + raise # re-raise exceptions from create_[sap|third_party]_provider return model diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index 4030db275..9e0d99fc3 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -1,10 +1,14 @@ +from typing import Any, List + import pytest from langchain_core.language_models.llms import LLM +from requests_cache import Optional from llm.llm_service import LLMService # this is a singleton from llm.models.gemini import Gemini from llm.models.mistral import Mistral from llm.models.openai import OpenAI +from llm.models.sap_llm import SAPLLM from util.singleton import Singleton @@ -13,16 +17,31 @@ class Config: type: str = None model_name: str = None temperature: str = None + ai_core_sk: str = None - def __init__(self, type, model_name, temperature): + def __init__(self, type, model_name, temperature, ai_core_sk): self.type = type self.model_name = model_name self.temperature = temperature + self.ai_core_sk = ai_core_sk test_vuln_id = "CVE-2024-32480" +# Mock a SAP LLM +class MockLLM(SAPLLM): + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + # Call super() to make sure model_name is valid + super()._call(prompt, stop, **kwargs) + + url = "https://www.example.com" + + return url + + @pytest.fixture(autouse=True) def reset_singletons(): # Clean up singleton instances after each test @@ -31,23 +50,23 @@ def reset_singletons(): class TestModel: def test_sap_gpt_instantiation(self): - config = Config("sap", "gpt-4", 0.0) + config = Config("sap", "gpt-4", 0.0, "sk.json") llm_service = LLMService(config) assert isinstance(llm_service._model, OpenAI) def test_sap_gemini_instantiation(self): - config = Config("sap", "gemini-1.0-pro", 0.0) + config = Config("sap", "gemini-1.0-pro", 0.0, "sk.json") llm_service = LLMService(config) assert isinstance(llm_service._model, Gemini) def test_sap_mistral_instantiation(self): - config = Config("sap", "mistralai--mixtral-8x7b-instruct-v01", 0.0) + config = Config("sap", "mistralai--mixtral-8x7b-instruct-v01", 0.0, "sk.json") llm_service = LLMService(config) assert isinstance(llm_service._model, Mistral) def test_singleton_instance_creation(self): """A second instantiation should return the exisiting instance.""" - config = Config("sap", "gpt-4", 0.0) + config = Config("sap", "gpt-4", 0.0, "sk.json") llm_service = LLMService(config) same_service = LLMService(config) assert ( @@ -56,10 +75,10 @@ def test_singleton_instance_creation(self): def test_singleton_same_instance(self): """A second instantiation with different parameters should return the existing instance unchanged.""" - config = Config("sap", "gpt-4", 0.0) + config = Config("sap", "gpt-4", 0.0, "sk.json") llm_service = LLMService(config) config = Config( - "sap", "gpt-35-turbo", 0.0 + "sap", "gpt-35-turbo", 0.0, "sk.json" ) # This instantiation should not work, but instead return the already existing instance same_service = LLMService(config) assert llm_service is same_service @@ -68,7 +87,7 @@ def test_singleton_same_instance(self): def test_singleton_retains_state(self): """Reassigning a field variable of the instance should be allowed and reflected across instantiations.""" - config = Config("sap", "gpt-4", 0.0) + config = Config("sap", "gpt-4", 0.0, "sk.json") service = LLMService(config) service._model = OpenAI( @@ -83,3 +102,20 @@ def test_singleton_retains_state(self): deployment_url="deployment_url_placeholder", temperature=0.7, ), "LLMService should retain state between instantiations" + + def test_get_repository_url(self): + config = Config("sap", "gpt-4", 0.0, "sk.json") + service = LLMService(config) + # Reassign the mock model to the service + model = MockLLM( + model_name="gpt-4", + deployment_url="deployment_url_placeholder", + temperature=0.7, + ai_core_sk_file_path="sk.json", + ) + service._model = model + + assert ( + service.get_repository_url("advisory description", "advisory_references") + == "https://www.example.com" + ) From 33c83f6bbde8696297f8b1d57f10c91bbd46971c Mon Sep 17 00:00:00 2001 From: I748376 Date: Mon, 17 Jun 2024 08:16:17 +0000 Subject: [PATCH 25/83] updates README with explanation of llm_service parameters in config.yaml before pushing, make sure that this information about sk.json can be pushed! --- prospector/README.md | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/prospector/README.md b/prospector/README.md index 1d0cf72c5..5016c2540 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -57,7 +57,7 @@ To quickly set up Prospector, follow these steps. This will run Prospector in it ### 🤖 LLM Support -To use Prospector with LLM support, set the `use_llm_<...>` parameters in `config.yaml`. Additionally, you must specify required parameters for API access to the LLM. These parameters can vary depending on your choice of provider, please follow what fits your needs: +To use Prospector with LLM support, you must specify required parameters for API access to the LLM. These parameters can vary depending on your choice of provider, please follow what fits your needs:
Use SAP AI CORE SDK @@ -67,14 +67,21 @@ You will need the following parameters in `config.yaml`: llm_service: type: sap model_name: + temperature: 0.0 + ai_core_sk: ``` `` refers to the model names available in the Generative AI Hub in SAP AI Core. [Here](https://github.tools.sap/I343697/generative-ai-hub-readme#1-supported-models) you can find an overview of available models. In `.env`, you must set the deployment URL as an environment variable following this naming convention: ```yaml -_URL +_URL # model name in capitals, and "-" changed to "_" ``` +For example, for gpt-4's deployment URL, set an environment variable called `GPT_4_URL`. + +The `temperature` parameter is optional. The default value is 0.0, but you can change it to something else. + +You also need to point the `ai_core_sk` parameter to a file contianing the secret keys. This file is available in Passvault.
@@ -87,6 +94,7 @@ Implemented third party providers are **OpenAI**, **Google** and **Mistral**. llm_service: type: third_party model_name: + temperature: 0.0 ``` `` refers to the model names available, for example `gpt-4o` for OpenAI. You can find a lists of available models here: @@ -94,10 +102,20 @@ Implemented third party providers are **OpenAI**, **Google** and **Mistral**. 2. [Google](https://ai.google.dev/gemini-api/docs/models/gemini) 3. [Mistral](https://docs.mistral.ai/getting-started/models/) + The `temperature` parameter is optional. The default value is 0.0, but you can change it to something else. + 2. Make sure to add your OpenAI API key to your `.env` file as `[OPENAI|GOOGLE|MISTRAL]_API_KEY`. +#### + +You can set the `use_llm_<...>` parameters in `config.yaml` for fine-grained control over LLM support in various aspects of Prospector's phases. Each `use_llm_<...>` parameter allows you to enable or disable LLM support for a specific aspect: + +- **`use_llm_repository_url`**: Choose whether LLMs should be used to obtain the repository URL. When not using this option, please provide `--repository` as a command line argument. +- **`use_llm_commit_rule`**: Choose whether an additional rule should be applied after the other rules, which checks if a commit is security relevant. This rule invokes an LLM-powered service, which takes the diff of a commit and returns whether it is security-relevant or not. Whichever model and temperature is specified in `config.yaml`, will also be used in this rule. + + ## 👩‍💻 Development Setup Following these steps allows you to run Prospector's components individually: [Backend database and worker containers](#starting-the-backend-database-and-the-job-workers), [RESTful Server](#starting-the-restful-server) for API endpoints, [Prospector CLI](#running-the-cli-version) and [Tests](#testing). From af949eadf53f43c4e10bd69349af4e0f649be627 Mon Sep 17 00:00:00 2001 From: I748376 Date: Mon, 17 Jun 2024 08:21:15 +0000 Subject: [PATCH 26/83] removes internal URL --- prospector/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prospector/README.md b/prospector/README.md index 5016c2540..6fd03cc35 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -71,7 +71,7 @@ llm_service: ai_core_sk: ``` -`` refers to the model names available in the Generative AI Hub in SAP AI Core. [Here](https://github.tools.sap/I343697/generative-ai-hub-readme#1-supported-models) you can find an overview of available models. +`` refers to the model names available in the Generative AI Hub in SAP AI Core. You can find an overview of available models on the Generative AI Hub GitHub page. In `.env`, you must set the deployment URL as an environment variable following this naming convention: ```yaml From 96dcadc1f66a5a7329202dabda2ff4112e46a9b0 Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 09:47:27 +0000 Subject: [PATCH 27/83] corrects mistake in singleton logger value --- prospector/util/singleton.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prospector/util/singleton.py b/prospector/util/singleton.py index 51f88c53e..cbee215f0 100644 --- a/prospector/util/singleton.py +++ b/prospector/util/singleton.py @@ -9,8 +9,8 @@ class Singleton(type): def __call__(cls, *args, **kwargs): if cls not in cls._instances: cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) - if cls not in cls._instances: - logger.error( + else: + logger.info( f"Cannot instantiate a Singleton twice. Returning already existing instance of class {cls}." ) return cls._instances[cls] From 558c311a0c05d5b5c95f4fcab0bcab9dcab31f0c Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 11:56:56 +0000 Subject: [PATCH 28/83] adds singleton tests --- prospector/llm/test_llm_service.py | 48 ++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index 9e0d99fc3..49d7d908b 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -50,23 +50,25 @@ def reset_singletons(): class TestModel: def test_sap_gpt_instantiation(self): - config = Config("sap", "gpt-4", 0.0, "sk.json") + config = Config("sap", "gpt-4", 0.0, "example.json") llm_service = LLMService(config) - assert isinstance(llm_service._model, OpenAI) + assert isinstance(llm_service.model, OpenAI) def test_sap_gemini_instantiation(self): - config = Config("sap", "gemini-1.0-pro", 0.0, "sk.json") + config = Config("sap", "gemini-1.0-pro", 0.0, "example.json") llm_service = LLMService(config) - assert isinstance(llm_service._model, Gemini) + assert isinstance(llm_service.model, Gemini) def test_sap_mistral_instantiation(self): - config = Config("sap", "mistralai--mixtral-8x7b-instruct-v01", 0.0, "sk.json") + config = Config( + "sap", "mistralai--mixtral-8x7b-instruct-v01", 0.0, "example.json" + ) llm_service = LLMService(config) - assert isinstance(llm_service._model, Mistral) + assert isinstance(llm_service.model, Mistral) def test_singleton_instance_creation(self): """A second instantiation should return the exisiting instance.""" - config = Config("sap", "gpt-4", 0.0, "sk.json") + config = Config("sap", "gpt-4", 0.0, "example.json") llm_service = LLMService(config) same_service = LLMService(config) assert ( @@ -75,47 +77,61 @@ def test_singleton_instance_creation(self): def test_singleton_same_instance(self): """A second instantiation with different parameters should return the existing instance unchanged.""" - config = Config("sap", "gpt-4", 0.0, "sk.json") + config = Config("sap", "gpt-4", 0.0, "example.json") llm_service = LLMService(config) config = Config( - "sap", "gpt-35-turbo", 0.0, "sk.json" + "sap", "gpt-35-turbo", 0.0, "example.json" ) # This instantiation should not work, but instead return the already existing instance same_service = LLMService(config) assert llm_service is same_service - assert llm_service._model.model_name == "gpt-4" + assert llm_service.model.model_name == "gpt-4" def test_singleton_retains_state(self): """Reassigning a field variable of the instance should be allowed and reflected across instantiations.""" - config = Config("sap", "gpt-4", 0.0, "sk.json") + config = Config("sap", "gpt-4", 0.0, "example.json") service = LLMService(config) - service._model = OpenAI( + service.model = OpenAI( model_name="gpt-35-turbo", deployment_url="deployment_url_placeholder", temperature=0.7, + ai_core_sk_file_path="example.json", ) same_service = LLMService(config) - assert same_service._model == OpenAI( + assert same_service.model == OpenAI( model_name="gpt-35-turbo", deployment_url="deployment_url_placeholder", temperature=0.7, + ai_core_sk_file_path="example.json", ), "LLMService should retain state between instantiations" def test_get_repository_url(self): - config = Config("sap", "gpt-4", 0.0, "sk.json") + config = Config("sap", "gpt-4", 0.0, "example.json") service = LLMService(config) # Reassign the mock model to the service model = MockLLM( model_name="gpt-4", deployment_url="deployment_url_placeholder", temperature=0.7, - ai_core_sk_file_path="sk.json", + ai_core_sk_file_path="example.json", ) - service._model = model + service.model = model assert ( service.get_repository_url("advisory description", "advisory_references") == "https://www.example.com" ) + + def test_reuse_singleton_without_config(self): + config = Config("sap", "gpt-4", 0.0, "example.json") + service = LLMService(config) + + same_service = LLMService() + + assert service is same_service + + def test_fail_first_instantiation_without_config(self): + with pytest.raises(Exception): + LLMService() From 0440f882b4b835739542a32669b7f90f6332265f Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 13:32:16 +0000 Subject: [PATCH 29/83] refactors: LLMService is created in main, when there is access to the config.llm_service. When LLMService is needed again, just access the instance using LLMService() without having to supply the config. --- prospector/cli/main.py | 27 +++++++----- prospector/core/prospector.py | 75 +++++++++++--------------------- prospector/llm/llm_service.py | 7 ++- prospector/llm/models/sap_llm.py | 2 +- 4 files changed, 48 insertions(+), 63 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index dc73af881..d281cbe5c 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -6,6 +6,7 @@ from dotenv import load_dotenv +from llm.llm_service import LLMService from util.http import ping_backend path_root = os.getcwd() @@ -55,15 +56,21 @@ def main(argv): # noqa: C901 # if config.ping: # return ping_backend(backend, get_level() < logging.INFO) - if not config.repository and not config.llm_service.use_llm_repository_url: - logger.error( - "Repository URL was neither specified nor allowed to obtain with LLM support. One must be set." - ) - console.print( - "Please set the `--repository` parameter or enable LLM support to infer the repository URL.", - status=MessageStatus.ERROR, - ) - return + # Whether to use the LLMService + if config.llm_service: + if not config.repository and not config.llm_service.use_llm_repository_url: + logger.error( + "Repository URL was neither specified nor allowed to obtain with LLM support. One must be set." + ) + console.print( + "Please set the `--repository` parameter or enable LLM support to infer the repository URL.", + status=MessageStatus.ERROR, + ) + return + + # If at least one 'use_llm' option is set, then create an LLMService singleton + if any([True for x in dir(config.llm_service) if x.startswith("use_llm")]): + LLMService(config.llm_service) config.pub_date = ( config.pub_date + "T00:00:00Z" if config.pub_date is not None else "" @@ -89,7 +96,7 @@ def main(argv): # noqa: C901 git_cache=config.git_cache, limit_candidates=config.max_candidates, # ignore_adv_refs=config.ignore_refs, - llm_service_config=config.llm_service, + use_llm_repository_url=config.llm_service.use_llm_repository_url, ) if config.preprocess_only: diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index dfc6d1e7b..4356bda7b 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -69,7 +69,7 @@ def prospector( # noqa: C901 rules: List[str] = ["ALL"], tag_commits: bool = True, silent: bool = False, - llm_service_config=None, + use_llm_repository_url: bool = False, ) -> Tuple[List[Commit], AdvisoryRecord] | Tuple[int, int]: if silent: logger.disabled = True @@ -91,9 +91,26 @@ def prospector( # noqa: C901 if advisory_record is None: return None, -1 - repository_url = repository_url or set_repository_url( - llm_service_config, advisory_record.description, advisory_record.references - ) + if use_llm_repository_url: + with ConsoleWriter("LLM Usage (Repo URL)") as console: + try: + repository_url = LLMService().get_repository_url( + advisory_record.description, advisory_record.references + ) + console.print( + f"\n Repository URL: {repository_url}", + status=MessageStatus.OK, + ) + except Exception as e: + logger.error( + e, + exc_info=get_level() < logging.INFO, + ) + console.print( + e, + status=MessageStatus.ERROR, + ) + sys.exit(1) fixing_commit = advisory_record.get_fixing_commit() # print(advisory_record.references) @@ -183,7 +200,10 @@ def prospector( # noqa: C901 # preprocessed_commits += preprocess_commits(missing, timer) pbar = tqdm( - missing, desc="Processing commits", unit="commit", disable=silent + missing, + desc="Processing commits", + unit="commit", + disable=silent, ) start_time = time.time() with Counter( @@ -457,51 +477,6 @@ def is_correct_backend_url(backend_url: str) -> bool: return True -def set_repository_url( - config, advisory_description: str, advisory_references: DefaultDict[str, int] -): - """Returns the URL obtained through the LLM. - - Args: - config (LLMServiceConfig): The 'llm_service' configuration block in config.yaml - advisory_description (str): The description of the advisory - advisory_references (dict[str, int]): The references of the advisory - - Returns: - The respository URL as a string. - - Raises: - System Exit if no configuration for the llm_service is given or the LLM returns an invalid URL. - """ - with ConsoleWriter("LLM Usage (Repo URL)") as console: - if not config: - logger.error( - "No configuration given for model in `config.yaml`.", - exc_info=get_level() < logging.INFO, - ) - console.print( - "No configuration given for model in `config.yaml`.", - status=MessageStatus.ERROR, - ) - sys.exit(1) - - try: - llm_service = LLMService(config) - url_from_llm = llm_service.get_repository_url( - advisory_description, advisory_references - ) - console.print( - f"\n Repository URL: {url_from_llm}", status=MessageStatus.OK - ) - return url_from_llm - - except Exception as e: - # Any error that occurs in either LLMService or get_repository_url should be caught here - logger.error(e) - console.print(e, status=MessageStatus.ERROR) - sys.exit(1) - - # def prospector_find_twins( # advisory_record: AdvisoryRecord, # repository: Git, diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 6f14c7314..32ef119cc 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -16,9 +16,12 @@ class LLMService(metaclass=Singleton): should be used throughout the program. """ + config = None + def __init__(self, config): + self.config = config try: - self._model: LLM = create_model_instance(config) + self.model: LLM = create_model_instance(config) except Exception: raise @@ -36,7 +39,7 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: ValueError if advisory information cannot be obtained or there is an error in the model invocation. """ try: - chain = best_guess | self._model | StrOutputParser() + chain = best_guess | self.model | StrOutputParser() url = chain.invoke( { diff --git a/prospector/llm/models/sap_llm.py b/prospector/llm/models/sap_llm.py index d5d29054a..c2fc4cac8 100644 --- a/prospector/llm/models/sap_llm.py +++ b/prospector/llm/models/sap_llm.py @@ -12,7 +12,7 @@ class SAPLLM(LLM): model_name: str deployment_url: str temperature: float - ai_core_sk_file_path: str + ai_core_sk_file_path: str = None @property def _llm_type(self) -> str: From 085eea3840b750450947fc52a9c4c72792702cf3 Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 14:26:34 +0000 Subject: [PATCH 30/83] refactors: simplifies model_instantiation, gets rid of two layer inheritance (no more SAPLLM) --- prospector/llm/llm_service.py | 15 ++-- prospector/llm/model_instantiation.py | 108 ++++++++++++++++---------- prospector/llm/models/gemini.py | 41 +++++++--- prospector/llm/models/mistral.py | 31 ++++++-- prospector/llm/models/openai.py | 31 ++++++-- prospector/llm/models/sap_llm.py | 84 -------------------- 6 files changed, 156 insertions(+), 154 deletions(-) delete mode 100644 prospector/llm/models/sap_llm.py diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 32ef119cc..422374d24 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -1,13 +1,11 @@ -import sys - import validators from langchain_core.language_models.llms import LLM from langchain_core.output_parsers import StrOutputParser -from cli.console import ConsoleWriter, MessageStatus from llm.model_instantiation import create_model_instance from llm.prompts import best_guess from log.logger import logger +from util.config_parser import LLMServiceConfig from util.singleton import Singleton @@ -16,12 +14,17 @@ class LLMService(metaclass=Singleton): should be used throughout the program. """ - config = None + config: LLMServiceConfig = None - def __init__(self, config): + def __init__(self, config: LLMServiceConfig): self.config = config try: - self.model: LLM = create_model_instance(config) + self.model: LLM = create_model_instance( + config.type, + config.model_name, + config.temperature, + config.ai_core_sk, + ) except Exception: raise diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index 8a7315c37..5e9a56120 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -1,5 +1,7 @@ +import json from typing import Dict +import requests from dotenv import dotenv_values from langchain_core.language_models.llms import LLM from langchain_google_vertexai import ChatVertexAI @@ -10,40 +12,33 @@ from llm.models.mistral import Mistral from llm.models.openai import OpenAI - -class ModelDef: - def __init__(self, access_info: str, _class: LLM): - self.access_info = ( - access_info # either deployment_url (for SAP) or API key (for Third Party) - ) - self._class = _class - - env: Dict[str, str | None] = dotenv_values() + SAP_MAPPING = { - "gpt-35-turbo": ModelDef(env.get("GPT_35_TURBO_URL", None), OpenAI), - "gpt-35-turbo-16k": ModelDef(env.get("GPT_35_TURBO_16K_URL", None), OpenAI), - "gpt-35-turbo-0125": ModelDef(env.get("GPT_35_TURBO_0125_URL", None), OpenAI), - "gpt-4": ModelDef(env.get("GPT_4_URL", None), OpenAI), - "gpt-4-32k": ModelDef(env.get("GPT_4_32K_URL", None), OpenAI), - # "gpt-4-turbo": env.get("GPT_4_TURBO_URL", None), # currently TBD: https://github.tools.sap/I343697/generative-ai-hub-readme - # "gpt-4o": env.get("GPT_4O_URL", None), # currently TBD: https://github.tools.sap/I343697/generative-ai-hub-readme - "gemini-1.0-pro": ModelDef(env.get("GEMINI_1_0_PRO_URL", None), Gemini), - "mistralai--mixtral-8x7b-instruct-v01": ModelDef( - env.get("MISTRALAI_MIXTRAL_8X7B_INSTRUCT_V01", None), Mistral - ), + "gpt-35-turbo": OpenAI, + "gpt-35-turbo-16k": OpenAI, + "gpt-35-turbo-0125": OpenAI, + "gpt-4": OpenAI, + "gpt-4-32k": OpenAI, + # "gpt-4-turbo": OpenAI, # currently TBD + # "gpt-4o": OpenAI, # currently TBD + "gemini-1.0-pro": Gemini, + "mistralai--mixtral-8x7b-instruct-v01": Mistral, } + THIRD_PARTY_MAPPING = { - "gpt-4": ModelDef(env.get("OPENAI_API_KEY", None), ChatOpenAI), - "gpt-3.5-turbo": ModelDef(env.get("OPENAI_API_KEY", None), ChatOpenAI), - "gemini-pro": ModelDef(env.get("GOOGLE_API_KEY", None), ChatVertexAI), - "mistral-large-latest": ModelDef(env.get("MISTRAL_API_KEY", None), ChatMistralAI), + "gpt-4": ChatOpenAI, + "gpt-3.5-turbo": ChatOpenAI, + "gemini-pro": ChatVertexAI, + "mistral-large-latest": ChatMistralAI, } -def create_model_instance(llm_config) -> LLM: +def create_model_instance( + model_type: str, model_name: str, temperature, ai_core_sk_filepath +) -> LLM: """Creates and returns the model object given the user's configuration. Args: @@ -57,25 +52,30 @@ def create_model_instance(llm_config) -> LLM: LLM: An instance of the specified LLM model. Exits """ + # LASCHA: correct docstring def create_sap_provider( - model_name: str, temperature: float, ai_core_sk_file_path: str + model_name: str, temperature: float, ai_core_sk_filepath: str ): - model_definition = SAP_MAPPING.get(model_name, None) - if model_definition is None: + deployment_url = env.get("GPT_35_TURBO_URL", None) + if deployment_url is None: + raise ValueError(f"Deployment URL for {model_name} is not set.") + + model_class = SAP_MAPPING.get(model_name, None) + if model_class is None: raise ValueError(f"Model '{model_name}' is not available.") - if ai_core_sk_file_path is None: + if ai_core_sk_filepath is None: raise ValueError( - f"AI Core credentials file couldn't be found: '{ai_core_sk_file_path}'" + f"AI Core credentials file couldn't be found: '{ai_core_sk_filepath}'" ) - model = model_definition._class( + model = model_class( model_name=model_name, - deployment_url=model_definition.access_info, + deployment_url=deployment_url, temperature=temperature, - ai_core_sk_file_path=ai_core_sk_file_path, + ai_core_sk_filepath=ai_core_sk_filepath, ) return model @@ -96,22 +96,48 @@ def create_third_party_provider(model_name: str, temperature: float): # LLM Instantiation try: - match llm_config.type: + match model_type: case "sap": model = create_sap_provider( - llm_config.model_name, - llm_config.temperature, - llm_config.ai_core_sk, + model_name, + temperature, + ai_core_sk_filepath, ) case "third_party": - model = create_third_party_provider( - llm_config.model_name, llm_config.temperature - ) + model = create_third_party_provider(model_name, temperature) case _: raise ValueError( - f"Invalid LLM type specified (either sap or third_party). '{llm_config.type}' is not available." + f"Invalid LLM type specified (either sap or third_party). '{model_type}' is not available." ) except Exception: raise # re-raise exceptions from create_[sap|third_party]_provider return model + + +def get_headers(ai_core_sk_file_path: str): + """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. + + Returns: + The headers object needed to send requests to the SAP AI Core. + """ + with open(ai_core_sk_file_path) as f: + sk = json.load(f) + + auth_url = f"{sk['url']}/oauth/token" + client_id = sk["clientid"] + client_secret = sk["clientsecret"] + + response = requests.post( + auth_url, + data={"grant_type": "client_credentials"}, + auth=(client_id, client_secret), + timeout=8000, + ) + + headers = { + "AI-Resource-Group": "default", + "Content-Type": "application/json", + "Authorization": f"Bearer {response.json()['access_token']}", + } + return headers diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py index a7141ea56..0b00b8d33 100644 --- a/prospector/llm/models/gemini.py +++ b/prospector/llm/models/gemini.py @@ -1,20 +1,37 @@ -from typing import Any, List, Optional +from typing import Any, Dict, List, Optional import requests +from langchain_core.language_models.llms import LLM -from llm.models.sap_llm import SAPLLM, get_headers +import llm.model_instantiation as instantiation from log.logger import logger -class Gemini(SAPLLM): +class Gemini(LLM): + model_name: str + deployment_url: str + temperature: float + ai_core_sk_filepath: str + + @property + def _llm_type(self) -> str: + return "custom" + + @property + def _identifying_params(self) -> Dict[str, Any]: + """Return a dictionary of identifying parameters.""" + return { + "model_name": self.model_name, + "deployment_url": self.deployment_url, + "temperature": self.temperature, + "ai_core_sk_filepath": self.ai_core_sk_filepath, + } + def _call( self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) - # Model specific request data endpoint = f"{self.deployment_url}/models/{self.model_name}:generateContent" - headers = get_headers(self.ai_core_sk_file_path) + headers = instantiation.get_headers(self.ai_core_sk_filepath) data = { "generation_config": { "maxOutputTokens": 1000, @@ -30,8 +47,14 @@ def _call( "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE", }, - {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, - {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, + { + "category": "HARM_CATEGORY_HATE_SPEECH", + "threshold": "BLOCK_NONE", + }, + { + "category": "HARM_CATEGORY_HARASSMENT", + "threshold": "BLOCK_NONE", + }, ], } diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py index f4b2e3c6c..2fa6e1bb5 100644 --- a/prospector/llm/models/mistral.py +++ b/prospector/llm/models/mistral.py @@ -1,20 +1,37 @@ -from typing import Any, List, Optional +from typing import Any, Dict, List, Optional import requests +from langchain_core.language_models.llms import LLM -from llm.models.sap_llm import SAPLLM, get_headers +import llm.model_instantiation as instantiation from log.logger import logger -class Mistral(SAPLLM): +class Mistral(LLM): + model_name: str + deployment_url: str + temperature: float + ai_core_sk_filepath: str + + @property + def _llm_type(self) -> str: + return "custom" + + @property + def _identifying_params(self) -> Dict[str, Any]: + """Return a dictionary of identifying parameters.""" + return { + "model_name": self.model_name, + "deployment_url": self.deployment_url, + "temperature": self.temperature, + "ai_core_sk_filepath": self.ai_core_sk_filepath, + } + def _call( self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) - # Model specific request data endpoint = f"{self.deployment_url}/chat/completions" - headers = get_headers(self.ai_core_sk_file_path) + headers = instantiation.get_headers(self.ai_core_sk_filepath) data = { "model": "mistralai--mixtral-8x7b-instruct-v01", "max_tokens": 100, diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py index 9f0c936c5..ab466cc2c 100644 --- a/prospector/llm/models/openai.py +++ b/prospector/llm/models/openai.py @@ -1,20 +1,37 @@ -from typing import Any, List, Optional +from typing import Any, Dict, List, Optional import requests +from langchain_core.language_models.llms import LLM -from llm.models.sap_llm import SAPLLM, get_headers +import llm.model_instantiation as instantiation from log.logger import logger -class OpenAI(SAPLLM): +class OpenAI(LLM): + model_name: str + deployment_url: str + temperature: float + ai_core_sk_filepath: str + + @property + def _llm_type(self) -> str: + return "custom" + + @property + def _identifying_params(self) -> Dict[str, Any]: + """Return a dictionary of identifying parameters.""" + return { + "model_name": self.model_name, + "deployment_url": self.deployment_url, + "temperature": self.temperature, + "ai_core_sk_filepath": self.ai_core_sk_filepath, + } + def _call( self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) - # Model specific request data endpoint = f"{self.deployment_url}/chat/completions?api-version=2023-05-15" - headers = get_headers(self.ai_core_sk_file_path) + headers = instantiation.get_headers(self.ai_core_sk_filepath) data = { "messages": [ { diff --git a/prospector/llm/models/sap_llm.py b/prospector/llm/models/sap_llm.py deleted file mode 100644 index c2fc4cac8..000000000 --- a/prospector/llm/models/sap_llm.py +++ /dev/null @@ -1,84 +0,0 @@ -import json -from typing import Any, List, Mapping, Optional - -import requests -from dotenv import dotenv_values -from langchain_core.language_models.llms import LLM - -from log.logger import logger - - -class SAPLLM(LLM): - model_name: str - deployment_url: str - temperature: float - ai_core_sk_file_path: str = None - - @property - def _llm_type(self) -> str: - return "custom" - - @property - def _identifying_params(self) -> Mapping[str, Any]: - """Get the identifying parameters.""" - return { - "model_name": self.model_name, - } - - def _call( - self, - prompt: str, - stop: Optional[List[str]] = None, - **kwargs: Any, - ) -> str: - """Run the LLM on the given input. - - Override this method to implement the LLM logic. - - Args: - prompt: The prompt to generate from. - stop: Stop words to use when generating. Model output is cut off at the - first occurrence of any of the stop substrings. - If stop tokens are not supported consider raising NotImplementedError. - run_manager: Callback manager for the run. - **kwargs: Arbitrary additional keyword arguments. These are usually passed - to the model provider API call. - - Returns: - The model output as a string. Actual completions SHOULD NOT include the prompt. - """ - if self.deployment_url is None: - raise ValueError( - "Deployment URL not set. Maybe you forgot to create the environment variable." - ) - if stop is not None: - raise ValueError("stop kwargs are not permitted.") - return "" - - -def get_headers(ai_core_sk_file_path: str): - """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. - - Returns: - The headers object needed to send requests to the SAP AI Core. - """ - with open(ai_core_sk_file_path) as f: - sk = json.load(f) - - auth_url = f"{sk['url']}/oauth/token" - client_id = sk["clientid"] - client_secret = sk["clientsecret"] - - response = requests.post( - auth_url, - data={"grant_type": "client_credentials"}, - auth=(client_id, client_secret), - timeout=8000, - ) - - headers = { - "AI-Resource-Group": "default", - "Content-Type": "application/json", - "Authorization": f"Bearer {response.json()['access_token']}", - } - return headers From 4ab147a53096d767f1d3219b58d1a41adcb232c4 Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 14:29:53 +0000 Subject: [PATCH 31/83] updates tests --- prospector/llm/test_llm_service.py | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index 49d7d908b..2030e77aa 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -8,7 +8,6 @@ from llm.models.gemini import Gemini from llm.models.mistral import Mistral from llm.models.openai import OpenAI -from llm.models.sap_llm import SAPLLM from util.singleton import Singleton @@ -30,17 +29,19 @@ def __init__(self, type, model_name, temperature, ai_core_sk): # Mock a SAP LLM -class MockLLM(SAPLLM): +class MockLLM(LLM): def _call( self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any ) -> str: - # Call super() to make sure model_name is valid - super()._call(prompt, stop, **kwargs) url = "https://www.example.com" return url + @property + def _llm_type(self) -> str: + return "custom" + @pytest.fixture(autouse=True) def reset_singletons(): @@ -96,7 +97,7 @@ def test_singleton_retains_state(self): model_name="gpt-35-turbo", deployment_url="deployment_url_placeholder", temperature=0.7, - ai_core_sk_file_path="example.json", + ai_core_sk_filepath="example.json", ) same_service = LLMService(config) @@ -104,7 +105,7 @@ def test_singleton_retains_state(self): model_name="gpt-35-turbo", deployment_url="deployment_url_placeholder", temperature=0.7, - ai_core_sk_file_path="example.json", + ai_core_sk_filepath="example.json", ), "LLMService should retain state between instantiations" def test_get_repository_url(self): @@ -115,7 +116,7 @@ def test_get_repository_url(self): model_name="gpt-4", deployment_url="deployment_url_placeholder", temperature=0.7, - ai_core_sk_file_path="example.json", + ai_core_sk_filepath="example.json", ) service.model = model From dc0ada7ac6fb1c6e429c65324fb224cb8bffea7a Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 14:30:32 +0000 Subject: [PATCH 32/83] adds correct strings to return from _llm_type() --- prospector/llm/models/gemini.py | 2 +- prospector/llm/models/mistral.py | 2 +- prospector/llm/models/openai.py | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py index 0b00b8d33..1b7a5c29b 100644 --- a/prospector/llm/models/gemini.py +++ b/prospector/llm/models/gemini.py @@ -15,7 +15,7 @@ class Gemini(LLM): @property def _llm_type(self) -> str: - return "custom" + return "SAP Gemini" @property def _identifying_params(self) -> Dict[str, Any]: diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py index 2fa6e1bb5..71ecd6061 100644 --- a/prospector/llm/models/mistral.py +++ b/prospector/llm/models/mistral.py @@ -15,7 +15,7 @@ class Mistral(LLM): @property def _llm_type(self) -> str: - return "custom" + return "SAP Mistral" @property def _identifying_params(self) -> Dict[str, Any]: diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py index ab466cc2c..062a79a11 100644 --- a/prospector/llm/models/openai.py +++ b/prospector/llm/models/openai.py @@ -15,7 +15,7 @@ class OpenAI(LLM): @property def _llm_type(self) -> str: - return "custom" + return "SAP OpenAI" @property def _identifying_params(self) -> Dict[str, Any]: From f593f7660082ded26599e4801ee39f8ff741214f Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 14:35:10 +0000 Subject: [PATCH 33/83] corrects docstring --- prospector/llm/llm_service.py | 2 +- prospector/llm/model_instantiation.py | 17 +++++++++++------ 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 422374d24..d12d4bc50 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -22,8 +22,8 @@ def __init__(self, config: LLMServiceConfig): self.model: LLM = create_model_instance( config.type, config.model_name, - config.temperature, config.ai_core_sk, + config.temperature, ) except Exception: raise diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/model_instantiation.py index 5e9a56120..6653b6e9d 100644 --- a/prospector/llm/model_instantiation.py +++ b/prospector/llm/model_instantiation.py @@ -37,20 +37,25 @@ def create_model_instance( - model_type: str, model_name: str, temperature, ai_core_sk_filepath + model_type: str, + model_name: str, + ai_core_sk_filepath: str, + temperature: float = 0.0, ) -> LLM: """Creates and returns the model object given the user's configuration. Args: - llm_config (dict): A dictionary containing the configuration for the LLM. Expected keys are: - - 'type' (str): Method for accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for + model_type: the way of accessing the LLM API ('sap' for SAP's AI Core, 'third_party' for external providers). - - 'model_name' (str): Which model to use, e.g. gpt-4. - - 'temperature' (Optional(float)): The temperature for the model, default 0.0. + model_name: which model to use, e.g. gpt-4. + temperature: the temperature for the model, default 0.0. + ai_core_sk_filepath: The path to the file containing AI Core credentials Returns: LLM: An instance of the specified LLM model. - Exits + + Raises: + ValueError: if there is a problem with deploymenturl, model_name or AI Core credentials """ # LASCHA: correct docstring From b69ecc843a3368525b89e071804729ea9c58a84f Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 19 Jun 2024 14:37:32 +0000 Subject: [PATCH 34/83] renames file --- prospector/llm/{model_instantiation.py => instantiation.py} | 0 prospector/llm/llm_service.py | 2 +- prospector/llm/models/gemini.py | 2 +- prospector/llm/models/mistral.py | 2 +- prospector/llm/models/openai.py | 2 +- 5 files changed, 4 insertions(+), 4 deletions(-) rename prospector/llm/{model_instantiation.py => instantiation.py} (100%) diff --git a/prospector/llm/model_instantiation.py b/prospector/llm/instantiation.py similarity index 100% rename from prospector/llm/model_instantiation.py rename to prospector/llm/instantiation.py diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index d12d4bc50..5ee631427 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -2,7 +2,7 @@ from langchain_core.language_models.llms import LLM from langchain_core.output_parsers import StrOutputParser -from llm.model_instantiation import create_model_instance +from llm.instantiation import create_model_instance from llm.prompts import best_guess from log.logger import logger from util.config_parser import LLMServiceConfig diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py index 1b7a5c29b..4ab85c652 100644 --- a/prospector/llm/models/gemini.py +++ b/prospector/llm/models/gemini.py @@ -3,7 +3,7 @@ import requests from langchain_core.language_models.llms import LLM -import llm.model_instantiation as instantiation +import llm.instantiation as instantiation from log.logger import logger diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py index 71ecd6061..a413fc316 100644 --- a/prospector/llm/models/mistral.py +++ b/prospector/llm/models/mistral.py @@ -3,7 +3,7 @@ import requests from langchain_core.language_models.llms import LLM -import llm.model_instantiation as instantiation +import llm.instantiation as instantiation from log.logger import logger diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py index 062a79a11..f3e132dfe 100644 --- a/prospector/llm/models/openai.py +++ b/prospector/llm/models/openai.py @@ -3,7 +3,7 @@ import requests from langchain_core.language_models.llms import LLM -import llm.model_instantiation as instantiation +import llm.instantiation as instantiation from log.logger import logger From 6f652625a78b1dd4945ceb843436313e1297d932 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 20 Jun 2024 12:55:19 +0000 Subject: [PATCH 35/83] final touch-up --- prospector/cli/main.py | 10 +++++++- prospector/llm/instantiation.py | 39 +++++++++++++++++------------- prospector/llm/llm_service.py | 27 +++++++++++++-------- prospector/llm/prompts.py | 14 +++++------ prospector/llm/test_llm_service.py | 23 +++++++++++++++--- 5 files changed, 75 insertions(+), 38 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index d281cbe5c..fd1c8b138 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -70,7 +70,15 @@ def main(argv): # noqa: C901 # If at least one 'use_llm' option is set, then create an LLMService singleton if any([True for x in dir(config.llm_service) if x.startswith("use_llm")]): - LLMService(config.llm_service) + try: + LLMService(config.llm_service) + except Exception as e: + logger.error(f"Problem with LLMService instantiation: {e}") + console.print( + "LLMService could not be created. Check logs.", + status=MessageStatus.ERROR, + ) + return config.pub_date = ( config.pub_date + "T00:00:00Z" if config.pub_date is not None else "" diff --git a/prospector/llm/instantiation.py b/prospector/llm/instantiation.py index 6653b6e9d..1794f9cb8 100644 --- a/prospector/llm/instantiation.py +++ b/prospector/llm/instantiation.py @@ -24,15 +24,15 @@ # "gpt-4-turbo": OpenAI, # currently TBD # "gpt-4o": OpenAI, # currently TBD "gemini-1.0-pro": Gemini, - "mistralai--mixtral-8x7b-instruct-v01": Mistral, + "mistral-large": Mistral, } THIRD_PARTY_MAPPING = { - "gpt-4": ChatOpenAI, - "gpt-3.5-turbo": ChatOpenAI, - "gemini-pro": ChatVertexAI, - "mistral-large-latest": ChatMistralAI, + "gpt-4": (ChatOpenAI, "OPENAI_API_KEY"), + "gpt-3.5-turbo": (ChatOpenAI, "OPENAI_API_KEY"), + "gemini-pro": (ChatVertexAI, "GOOGLE_API_KEY"), + "mistral-large-latest": (ChatMistralAI, "MISTRAL_API_KEY"), } @@ -55,17 +55,18 @@ def create_model_instance( LLM: An instance of the specified LLM model. Raises: - ValueError: if there is a problem with deploymenturl, model_name or AI Core credentials + ValueError: if there is a problem with deployment_url, model_name or AI Core credentials """ - # LASCHA: correct docstring def create_sap_provider( model_name: str, temperature: float, ai_core_sk_filepath: str - ): + ) -> LLM: - deployment_url = env.get("GPT_35_TURBO_URL", None) + deployment_url = env.get(model_name.upper().replace("-", "_") + "_URL", None) if deployment_url is None: - raise ValueError(f"Deployment URL for {model_name} is not set.") + raise ValueError( + f"Deployment URL ({model_name.upper().replace('-', '_')}_URL) for {model_name} is not set." + ) model_class = SAP_MAPPING.get(model_name, None) if model_class is None: @@ -85,21 +86,22 @@ def create_sap_provider( return model - def create_third_party_provider(model_name: str, temperature: float): - model_definition = THIRD_PARTY_MAPPING.get(model_name, None) + def create_third_party_provider(model_name: str, temperature: float) -> LLM: + model_class = THIRD_PARTY_MAPPING.get(model_name, None)[0] - if model_definition is None: + if model_class is None: raise ValueError(f"Model '{model_name}' is not available.") - model = model_definition._class( + api_key_variable = THIRD_PARTY_MAPPING.get(model_name, None)[1] + + model = model_class( model=model_name, - api_key=model_definition.access_info, + api_key=api_key_variable, temperature=temperature, ) return model - # LLM Instantiation try: match model_type: case "sap": @@ -120,9 +122,12 @@ def create_third_party_provider(model_name: str, temperature: float): return model -def get_headers(ai_core_sk_file_path: str): +def get_headers(ai_core_sk_file_path: str) -> Dict[str, str]: """Generate the request headers to use SAP AI Core. This method generates the authentication token and returns a Dict with headers. + Params: + ai_core_sk_file_path (str): the path to the file containing the SAP AI Core credentials. + Returns: The headers object needed to send requests to the SAP AI Core. """ diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 5ee631427..3804a1012 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -3,27 +3,34 @@ from langchain_core.output_parsers import StrOutputParser from llm.instantiation import create_model_instance -from llm.prompts import best_guess +from llm.prompts import prompt_best_guess from log.logger import logger from util.config_parser import LLMServiceConfig from util.singleton import Singleton class LLMService(metaclass=Singleton): - """A wrapper class for all functions requiring an LLM. This class is also a singleton, as only one model - should be used throughout the program. + """A wrapper class for all functions requiring an LLM. This class is also a singleton, as only a + single model should be used throughout the program. """ config: LLMServiceConfig = None - def __init__(self, config: LLMServiceConfig): - self.config = config + def __init__(self, config: LLMServiceConfig = None): + + if self.config is None and config is not None: + self.config = config + elif self.config is None and config is None: + raise ValueError( + "On the first instantiation, a configuration object must be passed." + ) + try: self.model: LLM = create_model_instance( - config.type, - config.model_name, - config.ai_core_sk, - config.temperature, + self.config.type, + self.config.model_name, + self.config.ai_core_sk, + self.config.temperature, ) except Exception: raise @@ -42,7 +49,7 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: ValueError if advisory information cannot be obtained or there is an error in the model invocation. """ try: - chain = best_guess | self.model | StrOutputParser() + chain = prompt_best_guess | self.model | StrOutputParser() url = chain.invoke( { diff --git a/prospector/llm/prompts.py b/prospector/llm/prompts.py index 57fd2444a..f9c083599 100644 --- a/prospector/llm/prompts.py +++ b/prospector/llm/prompts.py @@ -1,7 +1,7 @@ from langchain.prompts import FewShotPromptTemplate, PromptTemplate -# example output for few-shot prompting -examples_without_num = [ +# Get Repository URL, few-shot prompting examples +examples_data = [ { "cve_description": "Apache Olingo versions 4.0.0 to 4.7.0 provide the AsyncRequestWrapperImpl class which reads a URL from the Location header, and then sends a GET or DELETE request to this URL. It may allow to implement a SSRF attack. If an attacker tricks a client to connect to a malicious server, the server can make the client call any URL including internal resources which are not directly accessible by the attacker.", "cve_references": "https://www.zerodayinitiative.com/advisories/ZDI-24-196/", @@ -20,7 +20,7 @@ ] # Formatter for the few-shot examples without CVE numbers -examples_prompt_without_num = PromptTemplate( +examples_formatted = PromptTemplate( input_variables=["cve_references", "result"], template=""" {cve_description} {cve_references} @@ -28,12 +28,12 @@ {result} """, ) -best_guess = FewShotPromptTemplate( +prompt_best_guess = FewShotPromptTemplate( prefix="""You will be provided with the ID, description and references of a vulnerability advisory (CVE). Return nothing but the URL of the repository the given CVE is concerned with.'. Here are a few examples delimited with XML tags:""", - examples=examples_without_num, - example_prompt=examples_prompt_without_num, + examples=examples_data, + example_prompt=examples_formatted, suffix="""Here is the CVE information: {description} {references} @@ -41,5 +41,5 @@ If you cannot find the URL, return your best guess of what the repository URL could be. Use any hints (eg. the mention of GitHub or GitLab) in the CVE description and references. Return nothing but the URL. """, input_variables=["description", "references"], - metadata={"name": "best_guess"}, + metadata={"name": "prompt_best_guess"}, ) diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/test_llm_service.py index 2030e77aa..fdd82f208 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/test_llm_service.py @@ -2,6 +2,9 @@ import pytest from langchain_core.language_models.llms import LLM +from langchain_google_vertexai import ChatVertexAI +from langchain_mistralai import ChatMistralAI +from langchain_openai import ChatOpenAI from requests_cache import Optional from llm.llm_service import LLMService # this is a singleton @@ -61,12 +64,26 @@ def test_sap_gemini_instantiation(self): assert isinstance(llm_service.model, Gemini) def test_sap_mistral_instantiation(self): - config = Config( - "sap", "mistralai--mixtral-8x7b-instruct-v01", 0.0, "example.json" - ) + config = Config("sap", "mistral-large", 0.0, "example.json") llm_service = LLMService(config) assert isinstance(llm_service.model, Mistral) + def test_gpt_instantiation(self): + config = Config("third_party", "gpt-4", 0.0, "example.json") + llm_service = LLMService(config) + assert isinstance(llm_service.model, ChatOpenAI) + + # Google throws an error on creation, when no account is found + # def test_gemini_instantiation(self): + # config = Config("third_party", "gemini-pro", 0.0, "example.json") + # llm_service = LLMService(config) + # assert isinstance(llm_service.model, ChatVertexAI) + + def test_mistral_instantiation(self): + config = Config("third_party", "mistral-large-latest", 0.0, "example.json") + llm_service = LLMService(config) + assert isinstance(llm_service.model, ChatMistralAI) + def test_singleton_instance_creation(self): """A second instantiation should return the exisiting instance.""" config = Config("sap", "gpt-4", 0.0, "example.json") From 914fc8b263450c36af9db3275506676609a710f2 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 20 Jun 2024 14:53:00 +0000 Subject: [PATCH 36/83] fixes bug where a LLMService could not be created. Check logs. error was raised even though the use of llms was set to false This happened because the LLMService singleton was initialised even though it would not be needed later on. This happened because of an error in an if statement --- prospector/cli/main.py | 2 +- prospector/config-sample.yaml | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index fd1c8b138..d8760ccb7 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -69,7 +69,7 @@ def main(argv): # noqa: C901 return # If at least one 'use_llm' option is set, then create an LLMService singleton - if any([True for x in dir(config.llm_service) if x.startswith("use_llm")]): + if any([config.llm_service.use_llm_repository_url]): try: LLMService(config.llm_service) except Exception as e: diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 8813c31ad..105aa23f6 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -31,10 +31,10 @@ redis_url: redis://redis:6379/0 llm_service: type: sap # use "sap" or "third_party" model_name: gpt-4-turbo - # temperature: 0.0 # optional, default is 0.0 - # ai_core_sk: # needed for type: sap + temperature: 0.0 # optional, default is 0.0 + ai_core_sk: # needed for type: sap - use_llm_repository_url: True # whether to use LLM's to obtain the repository URL + use_llm_repository_url: False # whether to use LLM's to obtain the repository URL # Report file format: "html", "json", "console" or "all" # and the file name From c9a7163636ea99488afad4200fb40e612dbee98c Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 21 Jun 2024 07:25:56 +0000 Subject: [PATCH 37/83] fixes tests fixes the environment variables iin the LLM tets and fixes sometest files unrelated to LLMss --- prospector/core/report_test.py | 2 +- prospector/git/git_test.py | 3 ++- prospector/git/raw_commit_test.py | 2 +- prospector/llm/instantiation.py | 13 ++++++++----- .../{test_llm_service.py => llm_service_test.py} | 7 +++++++ 5 files changed, 19 insertions(+), 8 deletions(-) rename prospector/llm/{test_llm_service.py => llm_service_test.py} (96%) diff --git a/prospector/core/report_test.py b/prospector/core/report_test.py index 7658a0209..387dc2304 100644 --- a/prospector/core/report_test.py +++ b/prospector/core/report_test.py @@ -2,7 +2,7 @@ import os.path from random import randint -import prospector.core.report as report +import core.report as report from datamodel.advisory import build_advisory_record from datamodel.commit import Commit from util.sample_data_generation import ( # random_list_of_url, diff --git a/prospector/git/git_test.py b/prospector/git/git_test.py index 97bf640b5..4116a6e97 100644 --- a/prospector/git/git_test.py +++ b/prospector/git/git_test.py @@ -42,7 +42,8 @@ def test_get_tags_for_commit(repository: Git): commit = commits.get(OPENCAST_COMMIT) if commit is not None: tags = commit.find_tags() - assert len(tags) == 75 + print(tags) + assert len(tags) == 106 assert "10.2" in tags and "11.3" in tags and "9.4" in tags diff --git a/prospector/git/raw_commit_test.py b/prospector/git/raw_commit_test.py index 4eb28dc95..c48f454f1 100644 --- a/prospector/git/raw_commit_test.py +++ b/prospector/git/raw_commit_test.py @@ -26,7 +26,7 @@ def commit(): def test_find_tags(commit: RawCommit): tags = commit.find_tags() - assert len(tags) == 75 + assert len(tags) == 106 assert "10.2" in tags and "11.3" in tags and "9.4" in tags diff --git a/prospector/llm/instantiation.py b/prospector/llm/instantiation.py index 1794f9cb8..045caa4e8 100644 --- a/prospector/llm/instantiation.py +++ b/prospector/llm/instantiation.py @@ -1,8 +1,9 @@ import json +import os from typing import Dict import requests -from dotenv import dotenv_values +from dotenv import load_dotenv from langchain_core.language_models.llms import LLM from langchain_google_vertexai import ChatVertexAI from langchain_mistralai import ChatMistralAI @@ -12,7 +13,7 @@ from llm.models.mistral import Mistral from llm.models.openai import OpenAI -env: Dict[str, str | None] = dotenv_values() +load_dotenv() SAP_MAPPING = { @@ -62,7 +63,7 @@ def create_sap_provider( model_name: str, temperature: float, ai_core_sk_filepath: str ) -> LLM: - deployment_url = env.get(model_name.upper().replace("-", "_") + "_URL", None) + deployment_url = os.getenv(model_name.upper().replace("-", "_") + "_URL", None) if deployment_url is None: raise ValueError( f"Deployment URL ({model_name.upper().replace('-', '_')}_URL) for {model_name} is not set." @@ -88,15 +89,17 @@ def create_sap_provider( def create_third_party_provider(model_name: str, temperature: float) -> LLM: model_class = THIRD_PARTY_MAPPING.get(model_name, None)[0] - if model_class is None: raise ValueError(f"Model '{model_name}' is not available.") api_key_variable = THIRD_PARTY_MAPPING.get(model_name, None)[1] + api_key = os.getenv(api_key_variable) + if api_key is None: + raise ValueError(f"API key for {model_name} is not set.") model = model_class( model=model_name, - api_key=api_key_variable, + api_key=api_key, temperature=temperature, ) diff --git a/prospector/llm/test_llm_service.py b/prospector/llm/llm_service_test.py similarity index 96% rename from prospector/llm/test_llm_service.py rename to prospector/llm/llm_service_test.py index fdd82f208..768f9ecad 100644 --- a/prospector/llm/test_llm_service.py +++ b/prospector/llm/llm_service_test.py @@ -52,6 +52,13 @@ def reset_singletons(): Singleton._instances = {} +@pytest.fixture(autouse=True) +def mock_environment_variables(): + mp = pytest.MonkeyPatch() + mp.setenv("GPT_4_URL", "https://deployment.url.com") + mp.setenv("GEMINI_1.0_PRO_URL", "https://deployment.url.com") + + class TestModel: def test_sap_gpt_instantiation(self): config = Config("sap", "gpt-4", 0.0, "example.json") From 4903a85cd19928fa91fd48ba8caa6129e7f1eae4 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 21 Jun 2024 09:26:21 +0000 Subject: [PATCH 38/83] removes extra index urls --- prospector/requirements.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/prospector/requirements.txt b/prospector/requirements.txt index dc864b5d7..21075aed6 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -4,9 +4,9 @@ # # pip-compile --no-annotate --strip-extras # ---extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/deploy-releases-pypi/simple ---extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/proxy-deploy-releases-hyperspace-pypi/simple ---trusted-host int.repositories.cloud.sap +# --extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/deploy-releases-pypi/simple +# --extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/proxy-deploy-releases-hyperspace-pypi/simple +# --trusted-host int.repositories.cloud.sap aiohttp==3.9.5 aiosignal==1.3.1 From 7ec6afb744e48cb580e7138f4a11c688c62e84d2 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 21 Jun 2024 12:25:09 +0000 Subject: [PATCH 39/83] more env variables mocking for tests --- prospector/core/prospector.py | 4 +--- prospector/llm/llm_service_test.py | 21 ++++----------------- 2 files changed, 5 insertions(+), 20 deletions(-) diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index 4356bda7b..0fd22f199 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -163,9 +163,7 @@ def prospector( # noqa: C901 if len(candidates) > limit_candidates: logger.error(f"Number of candidates exceeds {limit_candidates}, aborting.") - ConsoleWriter.print( - f"Candidates limit exceeded: {len(candidates)}.", - ) + ConsoleWriter.print(f"Candidates limitlimit exceeded: {len(candidates)}.") return None, len(candidates) with ExecutionTimer( diff --git a/prospector/llm/llm_service_test.py b/prospector/llm/llm_service_test.py index 768f9ecad..e3aeb61c8 100644 --- a/prospector/llm/llm_service_test.py +++ b/prospector/llm/llm_service_test.py @@ -56,7 +56,11 @@ def reset_singletons(): def mock_environment_variables(): mp = pytest.MonkeyPatch() mp.setenv("GPT_4_URL", "https://deployment.url.com") + mp.setenv("MISTRAL_LARGE_URL", "https://deployment.url.com") mp.setenv("GEMINI_1.0_PRO_URL", "https://deployment.url.com") + mp.setenv("OPENAI_API_KEY", "https://deployment.url.com") + mp.setenv("GOOGLE_API_KEY", "https://deployment.url.com") + mp.setenv("MISTRAL_API_KEY", "https://deployment.url.com") class TestModel: @@ -132,23 +136,6 @@ def test_singleton_retains_state(self): ai_core_sk_filepath="example.json", ), "LLMService should retain state between instantiations" - def test_get_repository_url(self): - config = Config("sap", "gpt-4", 0.0, "example.json") - service = LLMService(config) - # Reassign the mock model to the service - model = MockLLM( - model_name="gpt-4", - deployment_url="deployment_url_placeholder", - temperature=0.7, - ai_core_sk_filepath="example.json", - ) - service.model = model - - assert ( - service.get_repository_url("advisory description", "advisory_references") - == "https://www.example.com" - ) - def test_reuse_singleton_without_config(self): config = Config("sap", "gpt-4", 0.0, "example.json") service = LLMService(config) From 2c78f81fdd25537df3d3140876daf93ade6b3c12 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 21 Jun 2024 14:58:18 +0000 Subject: [PATCH 40/83] small modification to prompt --- prospector/llm/prompts.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prospector/llm/prompts.py b/prospector/llm/prompts.py index f9c083599..dd1c9c663 100644 --- a/prospector/llm/prompts.py +++ b/prospector/llm/prompts.py @@ -38,7 +38,7 @@ {description} {references} -If you cannot find the URL, return your best guess of what the repository URL could be. Use any hints (eg. the mention of GitHub or GitLab) in the CVE description and references. Return nothing but the URL. +If you cannot find the URL, return your best guess of what the repository URL could be. Use any hints (eg. the mention of GitHub or GitLab) in the CVE description and references. Do not return the delimiters. Do not return delimiters. Return nothing but the URL. """, input_variables=["description", "references"], metadata={"name": "prompt_best_guess"}, From 509c80fdd63eb72a42d9cf68ea25cc229c2367d1 Mon Sep 17 00:00:00 2001 From: I748376 Date: Fri, 21 Jun 2024 15:24:14 +0000 Subject: [PATCH 41/83] removes pip options --- prospector/requirements.txt | 3 --- 1 file changed, 3 deletions(-) diff --git a/prospector/requirements.txt b/prospector/requirements.txt index 21075aed6..0ca435446 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -4,9 +4,6 @@ # # pip-compile --no-annotate --strip-extras # -# --extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/deploy-releases-pypi/simple -# --extra-index-url https://int.repositories.cloud.sap/artifactory/api/pypi/proxy-deploy-releases-hyperspace-pypi/simple -# --trusted-host int.repositories.cloud.sap aiohttp==3.9.5 aiosignal==1.3.1 From e7de349054a6bb4bf6f424de071a8804de33fd00 Mon Sep 17 00:00:00 2001 From: I748376 Date: Mon, 24 Jun 2024 08:19:38 +0000 Subject: [PATCH 42/83] adds more OpenAI models for third party providers --- prospector/llm/instantiation.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/prospector/llm/instantiation.py b/prospector/llm/instantiation.py index 045caa4e8..db924e722 100644 --- a/prospector/llm/instantiation.py +++ b/prospector/llm/instantiation.py @@ -30,7 +30,10 @@ THIRD_PARTY_MAPPING = { + "gpt-4-turbo": (ChatOpenAI, "OPENAI_API_KEY"), + "gpt-4o": (ChatOpenAI, "OPENAI_API_KEY"), "gpt-4": (ChatOpenAI, "OPENAI_API_KEY"), + "gpt-3.5-turbo-0125": (ChatOpenAI, "OPENAI_API_KEY"), "gpt-3.5-turbo": (ChatOpenAI, "OPENAI_API_KEY"), "gemini-pro": (ChatVertexAI, "GOOGLE_API_KEY"), "mistral-large-latest": (ChatMistralAI, "MISTRAL_API_KEY"), From 1566f8202126892cdb10d30e2c4f898c16f3207b Mon Sep 17 00:00:00 2001 From: I748376 Date: Mon, 24 Jun 2024 08:19:57 +0000 Subject: [PATCH 43/83] removes unused mock class --- prospector/llm/llm_service_test.py | 18 ------------------ 1 file changed, 18 deletions(-) diff --git a/prospector/llm/llm_service_test.py b/prospector/llm/llm_service_test.py index e3aeb61c8..0b43af860 100644 --- a/prospector/llm/llm_service_test.py +++ b/prospector/llm/llm_service_test.py @@ -28,24 +28,6 @@ def __init__(self, type, model_name, temperature, ai_core_sk): self.ai_core_sk = ai_core_sk -test_vuln_id = "CVE-2024-32480" - - -# Mock a SAP LLM -class MockLLM(LLM): - def _call( - self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any - ) -> str: - - url = "https://www.example.com" - - return url - - @property - def _llm_type(self) -> str: - return "custom" - - @pytest.fixture(autouse=True) def reset_singletons(): # Clean up singleton instances after each test From 568030dbe54a973f02dcc158cb16516e9527f4cb Mon Sep 17 00:00:00 2001 From: I748376 Date: Mon, 24 Jun 2024 09:13:03 +0000 Subject: [PATCH 44/83] makes sure that --repository flag overrides config.yaml file --- prospector/util/config_parser.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index 7de68a15c..aace08cbf 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -266,6 +266,9 @@ def get_configuration(argv): sys.exit( "No configuration file found, or error in configuration file. Check logs." ) + # --repository in CL overrides config.yaml settings for LLM usage + if args.repository: + conf.llm_service.use_llm_repository_url = False try: config = Config( vuln_id=args.vuln_id, From 4ea23b15133a7924eb8fae010e05fe85d916f176 Mon Sep 17 00:00:00 2001 From: I748376 Date: Mon, 24 Jun 2024 09:42:19 +0000 Subject: [PATCH 45/83] conceptional change: when llm_service is set in config.yaml, prospector assumes that LLM support is wanted. To not use LLM support, comment out the llm_service block in config.yaml --- prospector/README.md | 15 +++++++-------- prospector/cli/main.py | 21 ++++++++++----------- prospector/config-sample.yaml | 12 ++++++------ 3 files changed, 23 insertions(+), 25 deletions(-) diff --git a/prospector/README.md b/prospector/README.md index 6fd03cc35..968e9be13 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -57,11 +57,11 @@ To quickly set up Prospector, follow these steps. This will run Prospector in it ### 🤖 LLM Support -To use Prospector with LLM support, you must specify required parameters for API access to the LLM. These parameters can vary depending on your choice of provider, please follow what fits your needs: +To use Prospector with LLM support, you simply set required parameters for the API access to the LLM in *config.yaml*. These parameters can vary depending on your choice of provider, please follow what fits your needs (drop-downs below). If you do not want to use LLM support, keep the `llm_service` block in your *config.yaml* file commented out.
Use SAP AI CORE SDK -You will need the following parameters in `config.yaml`: +You will need the following parameters in *config.yaml*: ```yaml llm_service: @@ -81,7 +81,7 @@ For example, for gpt-4's deployment URL, set an environment variable called `GPT The `temperature` parameter is optional. The default value is 0.0, but you can change it to something else. -You also need to point the `ai_core_sk` parameter to a file contianing the secret keys. This file is available in Passvault. +You also need to point the `ai_core_sk` parameter to a file contianing the secret keys.
@@ -89,7 +89,7 @@ You also need to point the `ai_core_sk` parameter to a file contianing the secre Implemented third party providers are **OpenAI**, **Google** and **Mistral**. -1. You will need the following parameters in `config.yaml`: +1. You will need the following parameters in *config.yaml*: ```yaml llm_service: type: third_party @@ -110,10 +110,9 @@ Implemented third party providers are **OpenAI**, **Google** and **Mistral**. #### -You can set the `use_llm_<...>` parameters in `config.yaml` for fine-grained control over LLM support in various aspects of Prospector's phases. Each `use_llm_<...>` parameter allows you to enable or disable LLM support for a specific aspect: +You can set the `use_llm_<...>` parameters in *config.yaml* for fine-grained control over LLM support in various aspects of Prospector's phases. Each `use_llm_<...>` parameter allows you to enable or disable LLM support for a specific aspect: -- **`use_llm_repository_url`**: Choose whether LLMs should be used to obtain the repository URL. When not using this option, please provide `--repository` as a command line argument. -- **`use_llm_commit_rule`**: Choose whether an additional rule should be applied after the other rules, which checks if a commit is security relevant. This rule invokes an LLM-powered service, which takes the diff of a commit and returns whether it is security-relevant or not. Whichever model and temperature is specified in `config.yaml`, will also be used in this rule. +- **`use_llm_repository_url`**: Choose whether LLMs should be used to obtain the repository URL. When using this option, you can omit the `--repository` flag as a command line argument and run prospector with `./run_prospector.sh CVE-2020-1925`. ## 👩‍💻 Development Setup @@ -143,7 +142,7 @@ Afterwards, you will just have to set the environment variables using the `.env` set -a; source .env; set +a ``` -You can configure prospector from CLI or from the `config.yaml` file. The (recommended) API Keys for Github and the NVD can be configured from the `.env` file (which must then be sourced with `set -a; source .env; set +a`) +You can configure prospector from CLI or from the *config.yaml* file. The (recommended) API Keys for Github and the NVD can be configured from the `.env` file (which must then be sourced with `set -a; source .env; set +a`) If at any time you wish to use a different version of the python interpreter, beware that the `requirements.txt` file contains the exact versioning for `python 3.10.6`. diff --git a/prospector/cli/main.py b/prospector/cli/main.py index d8760ccb7..696a51e06 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -68,17 +68,16 @@ def main(argv): # noqa: C901 ) return - # If at least one 'use_llm' option is set, then create an LLMService singleton - if any([config.llm_service.use_llm_repository_url]): - try: - LLMService(config.llm_service) - except Exception as e: - logger.error(f"Problem with LLMService instantiation: {e}") - console.print( - "LLMService could not be created. Check logs.", - status=MessageStatus.ERROR, - ) - return + # Create the LLMService singleton for later use + try: + LLMService(config.llm_service) + except Exception as e: + logger.error(f"Problem with LLMService instantiation: {e}") + console.print( + "LLMService could not be created. Check logs.", + status=MessageStatus.ERROR, + ) + return config.pub_date = ( config.pub_date + "T00:00:00Z" if config.pub_date is not None else "" diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 105aa23f6..4faa61c8a 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -28,13 +28,13 @@ database: redis_url: redis://redis:6379/0 # LLM Usage (check README for help) -llm_service: - type: sap # use "sap" or "third_party" - model_name: gpt-4-turbo - temperature: 0.0 # optional, default is 0.0 - ai_core_sk: # needed for type: sap +# llm_service: +# type: sap # use "sap" or "third_party" +# model_name: gpt-4-turbo +# temperature: 0.0 # optional, default is 0.0 +# ai_core_sk: # needed for type: sap - use_llm_repository_url: False # whether to use LLM's to obtain the repository URL +# use_llm_repository_url: False # whether to use LLM's to obtain the repository URL # Report file format: "html", "json", "console" or "all" # and the file name From d218d06540de4ae9ff2152d95391a86ce99a6bcf Mon Sep 17 00:00:00 2001 From: I748376 Date: Tue, 2 Jul 2024 12:27:23 +0000 Subject: [PATCH 46/83] adds more fine-grained request error raising --- prospector/llm/models/gemini.py | 28 +++++++++++++++++++++------- prospector/llm/models/mistral.py | 28 +++++++++++++++++++++------- prospector/llm/models/openai.py | 28 +++++++++++++++++++++------- 3 files changed, 63 insertions(+), 21 deletions(-) diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py index 4ab85c652..147086254 100644 --- a/prospector/llm/models/gemini.py +++ b/prospector/llm/models/gemini.py @@ -58,15 +58,29 @@ def _call( ], } - response = requests.post(endpoint, headers=headers, json=data) - - if not response.status_code == 200: + try: + response = requests.post(endpoint, headers=headers, json=data) + return self.parse(response.json()) + except requests.exceptions.HTTPError as http_error: logger.error( - f"Invalid response from AI Core API with error code {response.status_code}" + f"HTTP error occurred when sending a request through AI Core: {http_error}" ) - raise Exception("Invalid response from AI Core API.") - - return self.parse(response.json()) + raise + except requests.exceptions.Timeout as timeout_err: + logger.error( + f"Timeout error occured when sending a request through AI Core: {timeout_err}" + ) + raise + except requests.exceptions.ConnectionError as conn_err: + logger.error( + f"Connection error occurred when sending a request through AI Core: {conn_err}" + ) + raise + except requests.exceptions.RequestException as req_err: + logger.error( + f"A request error occured when sending a request through AI Core: {req_err}" + ) + raise def parse(self, message) -> str: """Parse the returned JSON object from OpenAI.""" diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py index a413fc316..9708d8e31 100644 --- a/prospector/llm/models/mistral.py +++ b/prospector/llm/models/mistral.py @@ -39,15 +39,29 @@ def _call( "messages": [{"role": "user", "content": prompt}], } - response = requests.post(endpoint, headers=headers, json=data) - - if not response.status_code == 200: + try: + response = requests.post(endpoint, headers=headers, json=data) + return self.parse(response.json()) + except requests.exceptions.HTTPError as http_error: logger.error( - f"Invalid response from AI Core API with error code {response.status_code}" + f"HTTP error occurred when sending a request through AI Core: {http_error}" ) - raise Exception("Invalid response from AI Core API.") - - return self.parse(response.json()) + raise + except requests.exceptions.Timeout as timeout_err: + logger.error( + f"Timeout error occured when sending a request through AI Core: {timeout_err}" + ) + raise + except requests.exceptions.ConnectionError as conn_err: + logger.error( + f"Connection error occurred when sending a request through AI Core: {conn_err}" + ) + raise + except requests.exceptions.RequestException as req_err: + logger.error( + f"A request error occured when sending a request through AI Core: {req_err}" + ) + raise def parse(self, message) -> str: """Parse the returned JSON object from OpenAI.""" diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py index f3e132dfe..76d95ef5b 100644 --- a/prospector/llm/models/openai.py +++ b/prospector/llm/models/openai.py @@ -42,15 +42,29 @@ def _call( "temperature": self.temperature, } - response = requests.post(endpoint, headers=headers, json=data) - - if not response.status_code == 200: + try: + response = requests.post(endpoint, headers=headers, json=data) + return self.parse(response.json()) + except requests.exceptions.HTTPError as http_error: logger.error( - f"Invalid response from AI Core API with error code {response.status_code}" + f"HTTP error occurred when sending a request through AI Core: {http_error}" ) - raise Exception("Invalid response from AI Core API.") - - return self.parse(response.json()) + raise + except requests.exceptions.Timeout as timeout_err: + logger.error( + f"Timeout error occured when sending a request through AI Core: {timeout_err}" + ) + raise + except requests.exceptions.ConnectionError as conn_err: + logger.error( + f"Connection error occurred when sending a request through AI Core: {conn_err}" + ) + raise + except requests.exceptions.RequestException as req_err: + logger.error( + f"A request error occured when sending a request through AI Core: {req_err}" + ) + raise def parse(self, message) -> str: """Parse the returned JSON object from OpenAI.""" From 5a806df057bbaec190bbdae2cec4b532e05361ff Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 4 Jul 2024 12:43:01 +0000 Subject: [PATCH 47/83] adds a check using regex to remove delimiters if they are returned by the LLM --- prospector/llm/llm_service.py | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index 3804a1012..cbc4e69e4 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -1,3 +1,5 @@ +import re + import validators from langchain_core.language_models.llms import LLM from langchain_core.output_parsers import StrOutputParser @@ -59,6 +61,12 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: ) logger.info(f"LLM returned the following URL: {url}") + # delimiters are often returned by the LLM, remove them, if the case + pattern = r"\s*(https?://[^\s]+)\s*" + match = re.search(pattern, url) + if match: + return match.group(1) + if not validators.url(url): raise TypeError(f"LLM returned invalid URL: {url}") From 26336957795447abdb52ef8bc17e907a3e65b6a6 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 4 Jul 2024 13:00:25 +0000 Subject: [PATCH 48/83] updates tests to take new version tags into account --- prospector/git/git_test.py | 2 +- prospector/git/raw_commit_test.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/prospector/git/git_test.py b/prospector/git/git_test.py index 4116a6e97..fbfdbcc86 100644 --- a/prospector/git/git_test.py +++ b/prospector/git/git_test.py @@ -43,7 +43,7 @@ def test_get_tags_for_commit(repository: Git): if commit is not None: tags = commit.find_tags() print(tags) - assert len(tags) == 106 + assert len(tags) >= 106 assert "10.2" in tags and "11.3" in tags and "9.4" in tags diff --git a/prospector/git/raw_commit_test.py b/prospector/git/raw_commit_test.py index c48f454f1..534431e94 100644 --- a/prospector/git/raw_commit_test.py +++ b/prospector/git/raw_commit_test.py @@ -26,7 +26,7 @@ def commit(): def test_find_tags(commit: RawCommit): tags = commit.find_tags() - assert len(tags) == 106 + assert len(tags) >= 106 assert "10.2" in tags and "11.3" in tags and "9.4" in tags From eb8e552a5b162d6f9e919a6da2e4c4b289848f5e Mon Sep 17 00:00:00 2001 From: Laura Schauer Date: Thu, 11 Jul 2024 14:49:59 +0200 Subject: [PATCH 49/83] Implements Rule Phases (#395) 1. A set of rules to apply can now be selected in config.yaml. Initially, it is set to all rules except the ones requiring llm_service (Phase 2 rules). 2. Rules are now applied in phases. All original Prospector rules are applied in "Phase 1" to all commits. Phase 2 applies its rules only to a subset of the ranked commits from Phase 1. --- prospector/cli/main.py | 3 +- prospector/config-sample.yaml | 21 ++++ prospector/core/prospector.py | 32 ++++-- prospector/datamodel/commit.py | 1 + prospector/llm/models/gemini.py | 1 + prospector/llm/models/mistral.py | 1 + prospector/llm/models/openai.py | 1 + prospector/requirements.in | 1 + prospector/requirements.txt | 2 +- prospector/rules/rules.py | 87 +++++++-------- prospector/rules/rules_test.py | 180 +++++++++++++++---------------- prospector/stats/collection.py | 6 ++ prospector/util/config_parser.py | 6 +- 13 files changed, 188 insertions(+), 154 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 696a51e06..2cdeac7d4 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -68,7 +68,7 @@ def main(argv): # noqa: C901 ) return - # Create the LLMService singleton for later use + # Create the LLMService Singleton for later use try: LLMService(config.llm_service) except Exception as e: @@ -104,6 +104,7 @@ def main(argv): # noqa: C901 limit_candidates=config.max_candidates, # ignore_adv_refs=config.ignore_refs, use_llm_repository_url=config.llm_service.use_llm_repository_url, + enabled_rules=config.enabled_rules, ) if config.preprocess_only: diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 4faa61c8a..f89f4b369 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -36,6 +36,27 @@ redis_url: redis://redis:6379/0 # use_llm_repository_url: False # whether to use LLM's to obtain the repository URL +enabled_rules: + # Phase 1 Rules + - VULN_ID_IN_MESSAGE + - XREF_BUG + - XREF_GH + - COMMIT_IN_REFERENCE + - VULN_ID_IN_LINKED_ISSUE + - CHANGES_RELEVANT_FILES + - CHANGES_RELEVANT_CODE + - RELEVANT_WORDS_IN_MESSAGE + - ADV_KEYWORDS_IN_FILES + - ADV_KEYWORDS_IN_MSG + - SEC_KEYWORDS_IN_MESSAGE + - SEC_KEYWORDS_IN_LINKED_GH + - SEC_KEYWORDS_IN_LINKED_BUG + - GITHUB_ISSUE_IN_MESSAGE + - BUG_IN_MESSAGE + - COMMIT_HAS_TWINS + # Phase 2 Rules (llm_service required!): + # - COMMIT_IS_SECURITY_RELEVANT + # Report file format: "html", "json", "console" or "all" # and the file name report: diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index 0fd22f199..3576fb749 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -2,10 +2,9 @@ import logging import os -import re import sys import time -from typing import DefaultDict, Dict, List, Set, Tuple +from typing import Dict, List, Set, Tuple from urllib.parse import urlparse import requests @@ -13,14 +12,14 @@ from cli.console import ConsoleWriter, MessageStatus from datamodel.advisory import AdvisoryRecord, build_advisory_record -from datamodel.commit import Commit, apply_ranking, make_from_raw_commit +from datamodel.commit import Commit, make_from_raw_commit from filtering.filter import filter_commits from git.git import Git from git.raw_commit import RawCommit from git.version_to_tag import get_possible_tags from llm.llm_service import LLMService from log.logger import get_level, logger, pretty_log -from rules.rules import apply_rules +from rules.rules import RULES_PHASE_1, apply_rules from stats.execution import ( Counter, ExecutionTimer, @@ -66,7 +65,7 @@ def prospector( # noqa: C901 use_backend: str = USE_BACKEND_ALWAYS, git_cache: str = "/tmp/git_cache", limit_candidates: int = MAX_CANDIDATES, - rules: List[str] = ["ALL"], + enabled_rules: List[str] = [rule.id for rule in RULES_PHASE_1], tag_commits: bool = True, silent: bool = False, use_llm_repository_url: bool = False, @@ -231,7 +230,9 @@ def prospector( # noqa: C901 else: logger.warning("Preprocessed commits are not being sent to backend") - ranked_candidates = evaluate_commits(preprocessed_commits, advisory_record, rules) + ranked_candidates = evaluate_commits( + preprocessed_commits, advisory_record, enabled_rules + ) # ConsoleWriter.print("Commit ranking and aggregation...") ranked_candidates = remove_twins(ranked_candidates) @@ -267,11 +268,26 @@ def filter(commits: Dict[str, RawCommit]) -> Dict[str, RawCommit]: def evaluate_commits( - commits: List[Commit], advisory: AdvisoryRecord, rules: List[str] + commits: List[Commit], advisory: AdvisoryRecord, enabled_rules: List[str] ) -> List[Commit]: + """This method applies the rule phases. Each phase is associated with a set of rules: + - Phase 1: Original rules + - Phase 2: Rules using the LLMService + + Args: + commits: the list of candidate commits that rules hsould be applied to + advisory: the object contianing all information about the advisory + enabled_rules: a (sub)set of rules to run (to set in config.yaml) + + Returns: + a list of commits ranked according to their relevance score + + Raises: + MissingMandatoryValue: if there is an error in the LLM configuration object + """ with ExecutionTimer(core_statistics.sub_collection("candidates analysis")): with ConsoleWriter("Candidate analysis") as _: - ranked_commits = apply_ranking(apply_rules(commits, advisory, rules=rules)) + ranked_commits = apply_rules(commits, advisory, enabled_rules=enabled_rules) return ranked_commits diff --git a/prospector/datamodel/commit.py b/prospector/datamodel/commit.py index 070a31921..0f1fd1fe8 100644 --- a/prospector/datamodel/commit.py +++ b/prospector/datamodel/commit.py @@ -52,6 +52,7 @@ def __eq__(self, other: "Commit") -> bool: return self.relevance == other.relevance def add_match(self, rule: Dict[str, Any]): + """Adds a rule to the commit's matched rules. Makes sure that the rule is added in order of relevance.""" for i, r in enumerate(self.matched_rules): if rule["relevance"] == r["relevance"]: self.matched_rules.insert(i, rule) diff --git a/prospector/llm/models/gemini.py b/prospector/llm/models/gemini.py index 147086254..ab8135729 100644 --- a/prospector/llm/models/gemini.py +++ b/prospector/llm/models/gemini.py @@ -60,6 +60,7 @@ def _call( try: response = requests.post(endpoint, headers=headers, json=data) + response.raise_for_status() return self.parse(response.json()) except requests.exceptions.HTTPError as http_error: logger.error( diff --git a/prospector/llm/models/mistral.py b/prospector/llm/models/mistral.py index 9708d8e31..42a90dcc3 100644 --- a/prospector/llm/models/mistral.py +++ b/prospector/llm/models/mistral.py @@ -41,6 +41,7 @@ def _call( try: response = requests.post(endpoint, headers=headers, json=data) + response.raise_for_status() return self.parse(response.json()) except requests.exceptions.HTTPError as http_error: logger.error( diff --git a/prospector/llm/models/openai.py b/prospector/llm/models/openai.py index 76d95ef5b..ae78fbc28 100644 --- a/prospector/llm/models/openai.py +++ b/prospector/llm/models/openai.py @@ -44,6 +44,7 @@ def _call( try: response = requests.post(endpoint, headers=headers, json=data) + response.raise_for_status() return self.parse(response.json()) except requests.exceptions.HTTPError as http_error: logger.error( diff --git a/prospector/requirements.in b/prospector/requirements.in index 6d7d7f4b3..720c5295e 100644 --- a/prospector/requirements.in +++ b/prospector/requirements.in @@ -3,6 +3,7 @@ beautifulsoup4 colorama datasketch fastapi +google-cloud-aiplatform==1.49.0 Jinja2 langchain langchain_openai diff --git a/prospector/requirements.txt b/prospector/requirements.txt index 0ca435446..1a5a11dee 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -39,7 +39,7 @@ frozenlist==1.4.1 fsspec==2024.6.0 google-api-core==2.19.0 google-auth==2.29.0 -google-cloud-aiplatform==1.53.0 +google-cloud-aiplatform==1.49.0 google-cloud-bigquery==3.24.0 google-cloud-core==2.4.1 google-cloud-resource-manager==1.12.3 diff --git a/prospector/rules/rules.py b/prospector/rules/rules.py index fb190c9ff..80496c812 100644 --- a/prospector/rules/rules.py +++ b/prospector/rules/rules.py @@ -2,23 +2,31 @@ from abc import abstractmethod from typing import List, Tuple +import requests + from datamodel.advisory import AdvisoryRecord -from datamodel.commit import Commit -from datamodel.nlp import clean_string, find_similar_words +from datamodel.commit import Commit, apply_ranking +from llm.llm_service import LLMService from rules.helpers import extract_security_keywords from stats.execution import Counter, execution_statistics from util.lsh import build_lsh_index, decode_minhash +NUM_COMMITS_PHASE_2 = ( + 10 # Determines how many candidates the second rule phase is applied to +) + + rule_statistics = execution_statistics.sub_collection("rules") class Rule: lsh_index = None + llm_service: LLMService = None def __init__(self, id: str, relevance: int): self.id = id - self.message = "" self.relevance = relevance + self.message = "" @abstractmethod def apply(self, candidate: Commit, advisory_record: AdvisoryRecord) -> bool: @@ -37,54 +45,50 @@ def as_dict(self): def get_rule_as_tuple(self) -> Tuple[str, str, int]: return (self.id, self.message, self.relevance) + def get_id(self): + return self.id + def apply_rules( candidates: List[Commit], advisory_record: AdvisoryRecord, - rules=["ALL"], + enabled_rules: List[str] = [], ) -> List[Commit]: - enabled_rules = get_enabled_rules(rules) + """Applies the selected set of rules and returns the ranked list of commits.""" - rule_statistics.collect("active", len(enabled_rules), unit="rules") + phase_1_rules = [rule for rule in RULES_PHASE_1 if rule.get_id() in enabled_rules] + phase_2_rules = [rule for rule in RULES_PHASE_2 if rule.get_id() in enabled_rules] - Rule.lsh_index = build_lsh_index() + if phase_2_rules: + Rule.llm_service = LLMService() + rule_statistics.collect( + "active", len(phase_1_rules) + len(phase_2_rules), unit="rules" + ) + + Rule.lsh_index = build_lsh_index() for candidate in candidates: Rule.lsh_index.insert(candidate.commit_id, decode_minhash(candidate.minhash)) with Counter(rule_statistics) as counter: counter.initialize("matches", unit="matches") for candidate in candidates: - for rule in enabled_rules: + for rule in phase_1_rules: if rule.apply(candidate, advisory_record): counter.increment("matches") candidate.add_match(rule.as_dict()) candidate.compute_relevance() - # for candidate in candidates: - # if candidate.has_twin(): - # for twin in candidate.twins: - # for other_candidate in candidates: - # if ( - # other_candidate.commit_id == twin[1] - # and other_candidate.relevance > candidate.relevance - # ): - # candidate.relevance = other_candidate.relevance - # # Add a reason on why we are doing this. + candidates = apply_ranking(candidates) - return candidates - - -def get_enabled_rules(rules: List[str]) -> List[Rule]: - if "ALL" in rules: - return RULES - - enabled_rules = [] - for r in RULES: - if r.id in rules: - enabled_rules.append(r) + for candidate in candidates[:NUM_COMMITS_PHASE_2]: + for rule in phase_2_rules: + if rule.apply(candidate): + counter.increment("matches") + candidate.add_match(rule.as_dict()) + candidate.compute_relevance() - return enabled_rules + return apply_ranking(candidates) # TODO: This could include issues, PRs, etc. @@ -409,7 +413,7 @@ def apply(self, candidate: Commit, advisory_record: AdvisoryRecord): return False -RULES: List[Rule] = [ +RULES_PHASE_1: List[Rule] = [ VulnIdInMessage("VULN_ID_IN_MESSAGE", 64), # CommitMentionedInAdv("COMMIT_IN_ADVISORY", 64), CrossReferencedBug("XREF_BUG", 32), @@ -429,23 +433,4 @@ def apply(self, candidate: Commit, advisory_record: AdvisoryRecord): CommitHasTwins("COMMIT_HAS_TWINS", 2), ] -rules_list = [ - "COMMIT_IN_REFERENCE", - "VULN_ID_IN_MESSAGE", - "VULN_ID_IN_LINKED_ISSUE", - "XREF_BUG", - "XREF_GH", - "CHANGES_RELEVANT_FILES", - "CHANGES_RELEVANT_CODE", - "RELEVANT_WORDS_IN_MESSAGE", - "ADV_KEYWORDS_IN_FILES", - "ADV_KEYWORDS_IN_MSG", - "SEC_KEYWORDS_IN_MESSAGE", - "SEC_KEYWORDS_IN_LINKED_GH", - "SEC_KEYWORDS_IN_LINKED_BUG", - "GITHUB_ISSUE_IN_MESSAGE", - "BUG_IN_MESSAGE", - "COMMIT_HAS_TWINS", -] - -# print(" & ".join([f"\\rot{{{x}}}" for x in rules_list])) +RULES_PHASE_2: List[Rule] = [] diff --git a/prospector/rules/rules_test.py b/prospector/rules/rules_test.py index 63ff21e16..230c351e0 100644 --- a/prospector/rules/rules_test.py +++ b/prospector/rules/rules_test.py @@ -1,38 +1,87 @@ from typing import List import pytest +from requests_cache import Optional from datamodel.advisory import AdvisoryRecord from datamodel.commit import Commit -from rules.rules import apply_rules +from rules.rules import RULES_PHASE_1, apply_rules +from util.lsh import get_encoded_minhash # from datamodel.commit_features import CommitWithFeatures +MOCK_CVE_ID = "CVE-2020-26258" + +enabled_rules_from_config = [ + "VULN_ID_IN_MESSAGE", + "XREF_BUG", + "XREF_GH", + "COMMIT_IN_REFERENCE", + "VULN_ID_IN_LINKED_ISSUE", + "CHANGES_RELEVANT_FILES", + "CHANGES_RELEVANT_CODE", + "RELEVANT_WORDS_IN_MESSAGE", + "ADV_KEYWORDS_IN_FILES", + "ADV_KEYWORDS_IN_MSG", + "SEC_KEYWORDS_IN_MESSAGE", + "SEC_KEYWORDS_IN_LINKED_GH", + "SEC_KEYWORDS_IN_LINKED_BUG", + "GITHUB_ISSUE_IN_MESSAGE", + "BUG_IN_MESSAGE", + "COMMIT_HAS_TWINS", +] + + +def get_msg(text, limit_length: Optional[int] = None) -> str: + return text[:limit_length] if limit_length else text + @pytest.fixture def candidates(): return [ + # Should match: VulnIdInMessage, ReferencesGhIssue Commit( repository="repo1", - commit_id="1", - message="Blah blah blah fixes CVE-2020-26258 and a few other issues", + commit_id="1234567890", + message=f"Blah blah blah fixes {MOCK_CVE_ID} and a few other issues", ghissue_refs={"example": ""}, changed_files={"foo/bar/otherthing.xml", "pom.xml"}, - cve_refs=["CVE-2020-26258"], + cve_refs=[f"{MOCK_CVE_ID}"], + minhash=get_encoded_minhash( + get_msg( + f"Blah blah blah fixes {MOCK_CVE_ID} and a few other issues", + 50, + ) + ), ), - Commit(repository="repo2", commit_id="2", cve_refs=["CVE-2020-26258"]), + Commit( + repository="repo2", + commit_id="2234567890", + message="", + minhash=get_encoded_minhash(get_msg("")), + ), + # Should match: VulnIdInMessage, ReferencesGhIssue Commit( repository="repo3", - commit_id="3", - message="Another commit that fixes CVE-2020-26258", + commit_id="3234567890", + message=f"Another commit that fixes {MOCK_CVE_ID}", ghissue_refs={"example": ""}, + cve_refs=[f"{MOCK_CVE_ID}"], + minhash=get_encoded_minhash( + get_msg(f"Another commit that fixes {MOCK_CVE_ID}", 50) + ), ), + # Should match: SecurityKeywordsInMsg Commit( repository="repo4", - commit_id="4", + commit_id="4234567890", message="Endless loop causes DoS vulnerability", changed_files={"foo/bar/otherthing.xml", "pom.xml"}, + minhash=get_encoded_minhash( + get_msg("Endless loop causes DoS vulnerability", 50) + ), ), + # Should match: AdvKeywordsInFiles, SecurityKeywordsInMsg, CommitMentionedInReference Commit( repository="repo5", commit_id="7532d2fb0d6081a12c2a48ec854a81a8b718be62", @@ -40,111 +89,58 @@ def candidates(): changed_files={ "core/src/main/java/org/apache/cxf/workqueue/AutomaticWorkQueueImpl.java" }, + minhash=get_encoded_minhash(get_msg("Insecure deserialization", 50)), ), + # TODO: Not matched by existing tests: GHSecurityAdvInMessage, ReferencesBug, ChangesRelevantCode, TwinMentionedInAdv, VulnIdInLinkedIssue, SecurityKeywordInLinkedGhIssue, SecurityKeywordInLinkedBug, CrossReferencedBug, CrossReferencedGh, CommitHasTwins, ChangesRelevantFiles, CommitMentionedInAdv, RelevantWordsInMessage ] @pytest.fixture def advisory_record(): return AdvisoryRecord( - vulnerability_id="CVE-2020-26258", + cve_id=f"{MOCK_CVE_ID}", repository_url="https://github.com/apache/struts", published_timestamp=1607532756, - references=["https://reference.to/some/commit/7532d2fb0d60"], + references={ + "https://reference.to/some/commit/7532d2fb0d60": 1, + }, keywords=["AutomaticWorkQueueImpl"], - paths=["pom.xml"], + # paths=["pom.xml"], ) -def test_apply_rules_all(candidates: List[Commit], advisory_record: AdvisoryRecord): - annotated_candidates = apply_rules(candidates, advisory_record) - - assert len(annotated_candidates[0].matched_rules) == 4 - assert annotated_candidates[0].matched_rules[0][0] == "CVE_ID_IN_MESSAGE" - assert "CVE-2020-26258" in annotated_candidates[0].matched_rules[0][1] - - # assert len(annotated_candidates[0].annotations) > 0 - # assert "REF_ADV_VULN_ID" in annotated_candidates[0].annotations - # assert "REF_GH_ISSUE" in annotated_candidates[0].annotations - # assert "CH_REL_PATH" in annotated_candidates[0].annotations - - # assert len(annotated_candidates[1].annotations) > 0 - # assert "REF_ADV_VULN_ID" in annotated_candidates[1].annotations - # assert "REF_GH_ISSUE" not in annotated_candidates[1].annotations - # assert "CH_REL_PATH" not in annotated_candidates[1].annotations - - # assert len(annotated_candidates[2].annotations) > 0 - # assert "REF_ADV_VULN_ID" not in annotated_candidates[2].annotations - # assert "REF_GH_ISSUE" in annotated_candidates[2].annotations - # assert "CH_REL_PATH" not in annotated_candidates[2].annotations - - # assert len(annotated_candidates[3].annotations) > 0 - # assert "REF_ADV_VULN_ID" not in annotated_candidates[3].annotations - # assert "REF_GH_ISSUE" not in annotated_candidates[3].annotations - # assert "CH_REL_PATH" in annotated_candidates[3].annotations - # assert "SEC_KEYWORD_IN_COMMIT_MSG" in annotated_candidates[3].annotations - - # assert "SEC_KEYWORD_IN_COMMIT_MSG" in annotated_candidates[4].annotations - # assert "TOKENS_IN_MODIFIED_PATHS" in annotated_candidates[4].annotations - # assert "COMMIT_MENTIONED_IN_ADV" in annotated_candidates[4].annotations - - -def test_apply_rules_selected( - candidates: List[Commit], advisory_record: AdvisoryRecord -): +def test_apply_phase_1_rules(candidates: List[Commit], advisory_record: AdvisoryRecord): annotated_candidates = apply_rules( - candidates=candidates, - advisory_record=advisory_record, - rules=[ - "REF_ADV_VULN_ID", - "REF_GH_ISSUE", - "CH_REL_PATH", - "SEC_KEYWORD_IN_COMMIT_MSG", - "TOKENS_IN_MODIFIED_PATHS", - "COMMIT_MENTIONED_IN_ADV", - ], + candidates, advisory_record, enabled_rules=enabled_rules_from_config ) - assert len(annotated_candidates[0].annotations) > 0 - assert "REF_ADV_VULN_ID" in annotated_candidates[0].annotations - assert "REF_GH_ISSUE" in annotated_candidates[0].annotations - assert "CH_REL_PATH" in annotated_candidates[0].annotations - - assert len(annotated_candidates[1].annotations) > 0 - assert "REF_ADV_VULN_ID" in annotated_candidates[1].annotations - assert "REF_GH_ISSUE" not in annotated_candidates[1].annotations - assert "CH_REL_PATH" not in annotated_candidates[1].annotations + # Repo 5: Should match: AdvKeywordsInFiles, SecurityKeywordsInMsg, CommitMentionedInReference + assert len(annotated_candidates[0].matched_rules) == 3 - assert len(annotated_candidates[2].annotations) > 0 - assert "REF_ADV_VULN_ID" not in annotated_candidates[2].annotations - assert "REF_GH_ISSUE" in annotated_candidates[2].annotations - assert "CH_REL_PATH" not in annotated_candidates[2].annotations + matched_rules_names = [item["id"] for item in annotated_candidates[0].matched_rules] + assert "ADV_KEYWORDS_IN_FILES" in matched_rules_names + assert "COMMIT_IN_REFERENCE" in matched_rules_names + assert "SEC_KEYWORDS_IN_MESSAGE" in matched_rules_names - assert len(annotated_candidates[3].annotations) > 0 - assert "REF_ADV_VULN_ID" not in annotated_candidates[3].annotations - assert "REF_GH_ISSUE" not in annotated_candidates[3].annotations - assert "CH_REL_PATH" in annotated_candidates[3].annotations - assert "SEC_KEYWORD_IN_COMMIT_MSG" in annotated_candidates[3].annotations + # Repo 1: Should match: VulnIdInMessage, ReferencesGhIssue + assert len(annotated_candidates[1].matched_rules) == 2 - assert "SEC_KEYWORD_IN_COMMIT_MSG" in annotated_candidates[4].annotations - assert "TOKENS_IN_MODIFIED_PATHS" in annotated_candidates[4].annotations - assert "COMMIT_MENTIONED_IN_ADV" in annotated_candidates[4].annotations + matched_rules_names = [item["id"] for item in annotated_candidates[1].matched_rules] + assert "VULN_ID_IN_MESSAGE" in matched_rules_names + assert "GITHUB_ISSUE_IN_MESSAGE" in matched_rules_names + # Repo 3: Should match: VulnIdInMessage, ReferencesGhIssue + assert len(annotated_candidates[2].matched_rules) == 2 -def test_apply_rules_selected_rules( - candidates: List[Commit], advisory_record: AdvisoryRecord -): - annotated_candidates = apply_rules( - candidates=candidates, - advisory_record=advisory_record, - rules=["ALL", "-REF_ADV_VULN_ID"], - ) + matched_rules_names = [item["id"] for item in annotated_candidates[2].matched_rules] + assert "VULN_ID_IN_MESSAGE" in matched_rules_names + assert "GITHUB_ISSUE_IN_MESSAGE" in matched_rules_names - assert len(annotated_candidates[0].annotations) > 0 - assert "REF_ADV_VULN_ID" not in annotated_candidates[0].annotations - assert "REF_GH_ISSUE" in annotated_candidates[0].annotations - assert "CH_REL_PATH" in annotated_candidates[0].annotations + # Repo 4: Should match: SecurityKeywordsInMsg + assert len(annotated_candidates[3].matched_rules) == 1 + matched_rules_names = [item["id"] for item in annotated_candidates[3].matched_rules] + assert "SEC_KEYWORDS_IN_MESSAGE" in matched_rules_names -def test_sec_keywords_in_linked_issue(): - print("TODO") + # Repo 2: Matches nothing + assert len(annotated_candidates[4].matched_rules) == 0 diff --git a/prospector/stats/collection.py b/prospector/stats/collection.py index 2c4a8f81e..642490f95 100644 --- a/prospector/stats/collection.py +++ b/prospector/stats/collection.py @@ -8,6 +8,8 @@ class ForbiddenDuplication(ValueError): + """Custom Error for Collections""" + ... @@ -54,6 +56,10 @@ def _summarize_list(collection, unit: Optional[str] = None): class StatisticCollection(dict): + """The StatisticCollection can contain nested sub-collections, and each entry in the + collection can hold a list of values along with an optional unit. + """ + def __init__(self): super().__init__() self.units = {} diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index aace08cbf..a53a109b0 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -2,7 +2,7 @@ import os import sys from dataclasses import MISSING, dataclass -from typing import Optional +from typing import List, Optional from omegaconf import OmegaConf from omegaconf.errors import ( @@ -199,6 +199,7 @@ class ConfigSchema: report: ReportConfig = MISSING log_level: str = MISSING git_cache: str = MISSING + enabled_rules: List[str] = MISSING nvd_token: Optional[str] = None database: DatabaseConfig = DatabaseConfig( user="postgres", password="example", host="db", port=5432, dbname="postgres" @@ -232,6 +233,7 @@ def __init__( ping: bool, log_level: str, git_cache: str, + enabled_rules: List[str], ignore_refs: bool, llm_service: LLMServiceConfig, ): @@ -256,6 +258,7 @@ def __init__( self.ping = ping self.log_level = log_level self.git_cache = git_cache + self.enabled_rules = enabled_rules self.ignore_refs = ignore_refs @@ -291,6 +294,7 @@ def get_configuration(argv): report_filename=args.report_filename or conf.report.name, ping=args.ping, git_cache=conf.git_cache, + enabled_rules=conf.enabled_rules, log_level=args.log_level or conf.log_level, ignore_refs=args.ignore_refs, ) From 53446b0b7d4fd5d77e98bc1855015ab093bac62f Mon Sep 17 00:00:00 2001 From: Laura Schauer Date: Thu, 11 Jul 2024 14:50:37 +0200 Subject: [PATCH 50/83] Adds anthropic model (#396) Adds claude 3 opus for both sap and third party providers. --- prospector/README.md | 8 ++-- prospector/llm/instantiation.py | 4 ++ prospector/llm/models/anthropic.py | 74 ++++++++++++++++++++++++++++++ prospector/requirements.in | 1 + prospector/requirements.txt | 4 ++ 5 files changed, 88 insertions(+), 3 deletions(-) create mode 100644 prospector/llm/models/anthropic.py diff --git a/prospector/README.md b/prospector/README.md index 968e9be13..d1230216b 100644 --- a/prospector/README.md +++ b/prospector/README.md @@ -55,6 +55,7 @@ To quickly set up Prospector, follow these steps. This will run Prospector in it By default, Prospector saves the results in a HTML file named *prospector-report.html*. Open this file in a web browser to view what Prospector was able to find! + ### 🤖 LLM Support To use Prospector with LLM support, you simply set required parameters for the API access to the LLM in *config.yaml*. These parameters can vary depending on your choice of provider, please follow what fits your needs (drop-downs below). If you do not want to use LLM support, keep the `llm_service` block in your *config.yaml* file commented out. @@ -87,7 +88,7 @@ You also need to point the `ai_core_sk` parameter to a file contianing the secre
Use personal third party provider -Implemented third party providers are **OpenAI**, **Google** and **Mistral**. +Implemented third party providers are **OpenAI**, **Google**, **Mistral**, and **Anthropic**. 1. You will need the following parameters in *config.yaml*: ```yaml @@ -101,14 +102,15 @@ Implemented third party providers are **OpenAI**, **Google** and **Mistral**. 1. [OpenAI](https://platform.openai.com/docs/models) 2. [Google](https://ai.google.dev/gemini-api/docs/models/gemini) 3. [Mistral](https://docs.mistral.ai/getting-started/models/) + 4. [Anthropic](https://docs.anthropic.com/en/docs/about-claude/models) The `temperature` parameter is optional. The default value is 0.0, but you can change it to something else. -2. Make sure to add your OpenAI API key to your `.env` file as `[OPENAI|GOOGLE|MISTRAL]_API_KEY`. +2. Make sure to add your OpenAI API key to your `.env` file as `[OPENAI|GOOGLE|MISTRAL|ANTHROPIC]_API_KEY`.
-#### +#### How to use LLM Support for different things You can set the `use_llm_<...>` parameters in *config.yaml* for fine-grained control over LLM support in various aspects of Prospector's phases. Each `use_llm_<...>` parameter allows you to enable or disable LLM support for a specific aspect: diff --git a/prospector/llm/instantiation.py b/prospector/llm/instantiation.py index db924e722..6331e439c 100644 --- a/prospector/llm/instantiation.py +++ b/prospector/llm/instantiation.py @@ -4,11 +4,13 @@ import requests from dotenv import load_dotenv +from langchain_anthropic import ChatAnthropic from langchain_core.language_models.llms import LLM from langchain_google_vertexai import ChatVertexAI from langchain_mistralai import ChatMistralAI from langchain_openai import ChatOpenAI +from llm.models.anthropic import Anthropic from llm.models.gemini import Gemini from llm.models.mistral import Mistral from llm.models.openai import OpenAI @@ -26,6 +28,7 @@ # "gpt-4o": OpenAI, # currently TBD "gemini-1.0-pro": Gemini, "mistral-large": Mistral, + "claude-3-opus": Anthropic, } @@ -37,6 +40,7 @@ "gpt-3.5-turbo": (ChatOpenAI, "OPENAI_API_KEY"), "gemini-pro": (ChatVertexAI, "GOOGLE_API_KEY"), "mistral-large-latest": (ChatMistralAI, "MISTRAL_API_KEY"), + "claude-3-opus-20240229": (ChatAnthropic, "ANTRHOPIC_API_KEY"), } diff --git a/prospector/llm/models/anthropic.py b/prospector/llm/models/anthropic.py new file mode 100644 index 000000000..3fdee294d --- /dev/null +++ b/prospector/llm/models/anthropic.py @@ -0,0 +1,74 @@ +from typing import Any, Dict, List, Optional + +import requests +from langchain_core.language_models.llms import LLM + +import llm.instantiation as instantiation +from log.logger import logger + + +class Anthropic(LLM): + model_name: str + deployment_url: str + temperature: float + ai_core_sk_filepath: str + + @property + def _llm_type(self) -> str: + return "SAP Anthropic" + + @property + def _identifying_params(self) -> Dict[str, Any]: + """Return a dictionary of identifying parameters.""" + return { + "model_name": self.model_name, + "deployment_url": self.deployment_url, + "temperature": self.temperature, + "ai_core_sk_filepath": self.ai_core_sk_filepath, + } + + def _call( + self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any + ) -> str: + endpoint = f"{self.deployment_url}/invoke" + headers = instantiation.get_headers(self.ai_core_sk_filepath) + data = { + "anthropic_version": "bedrock-2023-05-31", + "max_tokens": 100, + "messages": [ + { + "role": "user", + "content": f"{prompt}", + } + ], + "temperature": self.temperature, + } + + try: + response = requests.post(endpoint, headers=headers, json=data) + response.raise_for_status() + return self.parse(response.json()) + except requests.exceptions.HTTPError as http_error: + logger.error( + f"HTTP error occurred when sending a request through AI Core: {http_error}" + ) + raise + except requests.exceptions.Timeout as timeout_err: + logger.error( + f"Timeout error occured when sending a request through AI Core: {timeout_err}" + ) + raise + except requests.exceptions.ConnectionError as conn_err: + logger.error( + f"Connection error occurred when sending a request through AI Core: {conn_err}" + ) + raise + except requests.exceptions.RequestException as req_err: + logger.error( + f"A request error occured when sending a request through AI Core: {req_err}" + ) + raise + + def parse(self, message) -> str: + """Parse the returned JSON object from OpenAI.""" + return message["content"][0]["text"] diff --git a/prospector/requirements.in b/prospector/requirements.in index 720c5295e..23febfc2b 100644 --- a/prospector/requirements.in +++ b/prospector/requirements.in @@ -6,6 +6,7 @@ fastapi google-cloud-aiplatform==1.49.0 Jinja2 langchain +langchain_anthropic langchain_openai langchain_google_vertexai langchain_mistralai diff --git a/prospector/requirements.txt b/prospector/requirements.txt index 1a5a11dee..b965949e1 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -8,6 +8,7 @@ aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 +anthropic==0.30.1 antlr4-python3-runtime==4.9.3 anyio==4.4.0 appdirs==1.4.4 @@ -27,6 +28,7 @@ confection==0.1.5 cymem==2.0.8 dataclasses-json==0.6.6 datasketch==1.6.5 +defusedxml==0.7.1 distro==1.9.0 dnspython==2.6.1 docstring-parser==0.16 @@ -60,9 +62,11 @@ huggingface-hub==0.23.3 idna==3.7 iniconfig==2.0.0 jinja2==3.1.4 +jiter==0.5.0 jsonpatch==1.33 jsonpointer==2.4 langchain==0.2.2 +langchain-anthropic==0.1.15 langchain-community==0.2.3 langchain-core==0.2.4 langchain-google-vertexai==1.0.5 From b8f600f1db3e0455ef8d821587cadb846abfd179 Mon Sep 17 00:00:00 2001 From: Laura Schauer Date: Wed, 17 Jul 2024 10:44:27 +0200 Subject: [PATCH 51/83] Adds commit classification rule (#397) This PR adds a new rule using the `LLMService`. It sends the diff of a commit to the LLM and asks if this commit is security relevant or not. Relevance of the rule is set to 32 for now, but this value can be adjusted after evaluation. Thanks to @tommasoaiello --- prospector/llm/llm_service.py | 54 ++++++++++++++++++- prospector/llm/prompts/classify_commit.py | 16 ++++++ .../get_repository_url.py} | 0 prospector/rules/rules.py | 16 +++++- prospector/rules/rules_test.py | 24 ++++++--- 5 files changed, 102 insertions(+), 8 deletions(-) create mode 100644 prospector/llm/prompts/classify_commit.py rename prospector/llm/{prompts.py => prompts/get_repository_url.py} (100%) diff --git a/prospector/llm/llm_service.py b/prospector/llm/llm_service.py index cbc4e69e4..685bf79a0 100644 --- a/prospector/llm/llm_service.py +++ b/prospector/llm/llm_service.py @@ -3,9 +3,11 @@ import validators from langchain_core.language_models.llms import LLM from langchain_core.output_parsers import StrOutputParser +from requests import HTTPError from llm.instantiation import create_model_instance -from llm.prompts import prompt_best_guess +from llm.prompts.classify_commit import zero_shot as cc_zero_shot +from llm.prompts.get_repository_url import prompt_best_guess from log.logger import logger from util.config_parser import LLMServiceConfig from util.singleton import Singleton @@ -74,3 +76,53 @@ def get_repository_url(self, advisory_description, advisory_references) -> str: raise RuntimeError(f"Prompt-model chain could not be invoked: {e}") return url + + def classify_commit( + self, diff: str, repository_name: str, commit_message: str + ) -> bool: + """Ask an LLM whether a commit is security relevant or not. The response will be either True or False. + + Args: + candidate (Commit): The commit to input into the LLM + + Returns: + True if the commit is deemed security relevant, False if not. + + Raises: + ValueError if there is an error in the model invocation or the response was not valid. + """ + try: + chain = cc_zero_shot | self.model | StrOutputParser() + + is_relevant = chain.invoke( + { + "diff": diff, + "repository_name": repository_name, + "commit_message": commit_message, + } + ) + logger.info(f"LLM returned is_relevant={is_relevant}") + + except HTTPError as e: + # if the diff is too big, a 400 error is returned -> silently ignore by returning False for this commit + status_code = e.response.status_code + if status_code == 400: + return False + raise RuntimeError(f"Prompt-model chain could not be invoked: {e}") + except Exception as e: + raise RuntimeError(f"Prompt-model chain could not be invoked: {e}") + + if is_relevant in [ + "True", + "ANSWER:True", + "```ANSWER:True```", + ]: + return True + elif is_relevant in [ + "False", + "ANSWER:False", + "```ANSWER:False```", + ]: + return False + else: + raise RuntimeError(f"The model returned an invalid response: {is_relevant}") diff --git a/prospector/llm/prompts/classify_commit.py b/prospector/llm/prompts/classify_commit.py new file mode 100644 index 000000000..80a99afe9 --- /dev/null +++ b/prospector/llm/prompts/classify_commit.py @@ -0,0 +1,16 @@ +from langchain.prompts import PromptTemplate + +zero_shot = PromptTemplate.from_template( + """Is the following commit security relevant or not? +Please provide the output as a boolean value, either True or False. +If it is security relevant just answer True otherwise answer False. Do not return anything else. + +To provide you with some context, the name of the repository is: {repository_name}, and the +commit message is: {commit_message}. + +Finally, here is the diff of the commit: +{diff}\n + + +Your answer:\n""" +) diff --git a/prospector/llm/prompts.py b/prospector/llm/prompts/get_repository_url.py similarity index 100% rename from prospector/llm/prompts.py rename to prospector/llm/prompts/get_repository_url.py diff --git a/prospector/rules/rules.py b/prospector/rules/rules.py index 80496c812..2ba5a16e9 100644 --- a/prospector/rules/rules.py +++ b/prospector/rules/rules.py @@ -413,6 +413,18 @@ def apply(self, candidate: Commit, advisory_record: AdvisoryRecord): return False +class CommitIsSecurityRelevant(Rule): + """Matches commits that are deemed security relevant by the commit classification service.""" + + def apply( + self, + candidate: Commit, + ) -> bool: + return LLMService().classify_commit( + candidate.diff, candidate.repository, candidate.message + ) + + RULES_PHASE_1: List[Rule] = [ VulnIdInMessage("VULN_ID_IN_MESSAGE", 64), # CommitMentionedInAdv("COMMIT_IN_ADVISORY", 64), @@ -433,4 +445,6 @@ def apply(self, candidate: Commit, advisory_record: AdvisoryRecord): CommitHasTwins("COMMIT_HAS_TWINS", 2), ] -RULES_PHASE_2: List[Rule] = [] +RULES_PHASE_2: List[Rule] = [ + CommitIsSecurityRelevant("COMMIT_IS_SECURITY_RELEVANT", 32) +] diff --git a/prospector/rules/rules_test.py b/prospector/rules/rules_test.py index 230c351e0..93c246ef4 100644 --- a/prospector/rules/rules_test.py +++ b/prospector/rules/rules_test.py @@ -89,7 +89,9 @@ def candidates(): changed_files={ "core/src/main/java/org/apache/cxf/workqueue/AutomaticWorkQueueImpl.java" }, - minhash=get_encoded_minhash(get_msg("Insecure deserialization", 50)), + minhash=get_encoded_minhash( + get_msg("Insecure deserialization", 50) + ), ), # TODO: Not matched by existing tests: GHSecurityAdvInMessage, ReferencesBug, ChangesRelevantCode, TwinMentionedInAdv, VulnIdInLinkedIssue, SecurityKeywordInLinkedGhIssue, SecurityKeywordInLinkedBug, CrossReferencedBug, CrossReferencedGh, CommitHasTwins, ChangesRelevantFiles, CommitMentionedInAdv, RelevantWordsInMessage ] @@ -109,7 +111,9 @@ def advisory_record(): ) -def test_apply_phase_1_rules(candidates: List[Commit], advisory_record: AdvisoryRecord): +def test_apply_phase_1_rules( + candidates: List[Commit], advisory_record: AdvisoryRecord +): annotated_candidates = apply_rules( candidates, advisory_record, enabled_rules=enabled_rules_from_config ) @@ -117,7 +121,9 @@ def test_apply_phase_1_rules(candidates: List[Commit], advisory_record: Advisory # Repo 5: Should match: AdvKeywordsInFiles, SecurityKeywordsInMsg, CommitMentionedInReference assert len(annotated_candidates[0].matched_rules) == 3 - matched_rules_names = [item["id"] for item in annotated_candidates[0].matched_rules] + matched_rules_names = [ + item["id"] for item in annotated_candidates[0].matched_rules + ] assert "ADV_KEYWORDS_IN_FILES" in matched_rules_names assert "COMMIT_IN_REFERENCE" in matched_rules_names assert "SEC_KEYWORDS_IN_MESSAGE" in matched_rules_names @@ -125,21 +131,27 @@ def test_apply_phase_1_rules(candidates: List[Commit], advisory_record: Advisory # Repo 1: Should match: VulnIdInMessage, ReferencesGhIssue assert len(annotated_candidates[1].matched_rules) == 2 - matched_rules_names = [item["id"] for item in annotated_candidates[1].matched_rules] + matched_rules_names = [ + item["id"] for item in annotated_candidates[1].matched_rules + ] assert "VULN_ID_IN_MESSAGE" in matched_rules_names assert "GITHUB_ISSUE_IN_MESSAGE" in matched_rules_names # Repo 3: Should match: VulnIdInMessage, ReferencesGhIssue assert len(annotated_candidates[2].matched_rules) == 2 - matched_rules_names = [item["id"] for item in annotated_candidates[2].matched_rules] + matched_rules_names = [ + item["id"] for item in annotated_candidates[2].matched_rules + ] assert "VULN_ID_IN_MESSAGE" in matched_rules_names assert "GITHUB_ISSUE_IN_MESSAGE" in matched_rules_names # Repo 4: Should match: SecurityKeywordsInMsg assert len(annotated_candidates[3].matched_rules) == 1 - matched_rules_names = [item["id"] for item in annotated_candidates[3].matched_rules] + matched_rules_names = [ + item["id"] for item in annotated_candidates[3].matched_rules + ] assert "SEC_KEYWORDS_IN_MESSAGE" in matched_rules_names # Repo 2: Matches nothing From 194c90a379d621e3fa67c712ab4f04c326c17293 Mon Sep 17 00:00:00 2001 From: matteogreek Date: Mon, 12 Jun 2023 09:37:18 +0200 Subject: [PATCH 52/83] Implemented new db tables and improved pipeline functions --- prospector/commitdb/postgres.py | 206 ++++++++++++++++++ prospector/data/project_metadata.json | 26 ++- prospector/data_sources/nvd/filter_entries.py | 135 +++++++++--- prospector/data_sources/nvd/job_creation.py | 123 +++++++++-- prospector/data_sources/nvd/nvd_test.py | 35 ++- .../nvd/version_extraction_test.py | 147 +++++++++++++ .../data_sources/nvd/versions_extraction.py | 108 ++++++--- prospector/datamodel/nlp.py | 34 ++- prospector/datamodel/nlp_test.py | 6 + prospector/ddl/30_vulnerability.sql | 16 ++ prospector/ddl/40_processed_vuln.sql | 13 ++ prospector/ddl/50_job.sql | 19 ++ prospector/service/api/routers/endpoints.py | 21 +- .../service/static/job_configuration.css | 5 +- .../service/static/job_configuration.html | 18 +- 15 files changed, 779 insertions(+), 133 deletions(-) create mode 100644 prospector/data_sources/nvd/version_extraction_test.py create mode 100644 prospector/ddl/30_vulnerability.sql create mode 100644 prospector/ddl/40_processed_vuln.sql create mode 100644 prospector/ddl/50_job.sql diff --git a/prospector/commitdb/postgres.py b/prospector/commitdb/postgres.py index 3a58dd106..7ea831b0c 100644 --- a/prospector/commitdb/postgres.py +++ b/prospector/commitdb/postgres.py @@ -44,6 +44,7 @@ def connect(self): host=self.host, port=self.port, ) + print("Connected to the database") except Exception: self.host = "localhost" self.connection = psycopg2.connect( @@ -54,6 +55,14 @@ def connect(self): port=self.port, ) + def disconnect(self): + if self.connection: + self.connection.close() + print("Disconnected from the database") + self.connection = None + else: + print("No active database connection") + def lookup(self, repository: str, commit_id: str = None) -> List[Dict[str, Any]]: if not self.connection: raise Exception("Invalid connection") @@ -115,6 +124,203 @@ def run_sql_script(self, script_file): cursor.close() + def lookup_vuln_id(self, vuln_id: str, last_modified_date): + if not self.connection: + raise Exception("Invalid connection") + results = None + try: + cur = self.connection.cursor() + cur.execute( + "SELECT COUNT(*) FROM vulnerability WHERE vuln_id = %s AND last_modified_date = %s", + (vuln_id, last_modified_date), + ) + results = cur.fetchone() + self.connection.commit() + except Exception: + self.connection.rollback() + logger.error("Could not lookup vulnerability in database", exc_info=True) + finally: + cur.close() + return results + + def save_vuln( + self, vuln_id, published_date, last_modified_date, raw_record, source, url + ): + if not self.connection: + raise Exception("Invalid connection") + + try: + cur = self.connection.cursor() + cur.execute( + "INSERT INTO vulnerability (vuln_id, published_date, last_modified_date, raw_record, source, url) VALUES (%s,%s,%s,%s,%s,%s)", + (vuln_id, published_date, last_modified_date, raw_record, source, url), + ) + self.connection.commit() + cur.close() + except Exception: + logger.error("Could not save vulnerability to database", exc_info=True) + cur.close() + + def update_vuln(self, vuln_id, descr, published_date, last_modified_date): + if not self.connection: + raise Exception("Invalid connection") + + try: + cur = self.connection.cursor() + cur.execute( + "UPDATE entries SET descr = %s, published_date = %s, last_modified_date = %s WHERE vuln_id = %s", + (descr, published_date, last_modified_date, vuln_id), + ) + self.connection.commit() + cur.close() + except Exception: + logger.error("Could not update vulnerability in database", exc_info=True) + cur.close() + + def save_job( + self, + _id, + pv_id, + params, + enqueued_at, + started_at, + finished_at, + results, + created_by, + created_from, + status, + ): + if not self.connection: + raise Exception("Invalid connection") + try: + cur = self.connection.cursor() + cur.execute( + "INSERT INTO job (_id, pv_id, params, enqueued_at, started_at, finished_at, results, created_by, created_from, status) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", + ( + _id, + pv_id, + params, + enqueued_at, + started_at, + finished_at, + results, + created_by, + created_from, + status, + ), + ) + self.connection.commit() + cur.close() + except Exception: + logger.error("Could not save job entry to database", exc_info=True) + cur.close() + + def lookup_job(self): + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=DictCursor) + cur.execute("SELECT * FROM job") + results = cur.fetchall() + except Exception: + logger.error("Could not retrieve jobs from database", exc_info=True) + finally: + cur.close() + return results + + def lookup_processed_no_job(self): + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=DictCursor) + cur.execute( + "SELECT _id FROM processed_vuln WHERE _id NOT IN ( SELECT pv_id FROM job)" + ) + results = cur.fetchall() + except Exception: + logger.error("Could not retrieve jobs from database", exc_info=True) + finally: + cur.close() + return results + + def get_processed_vulns(self): # every entry + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=DictCursor) + cur.execute( + "SELECT pv.*, v.vuln_id FROM processed_vuln pv JOIN vulnerability v ON v._id = pv.fk_vulnerability" + ) + results = cur.fetchall() + except Exception: + logger.error( + "Could not retrieve processed vulnerabilities from database", + exc_info=True, + ) + finally: + cur.close() + return results + + def get_processed_vulns_not_in_job( + self, + ): # entries in processed vuln excluding the ones already in the job table + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=DictCursor) + cur.execute( + "SELECT pv._id, pv.repository, pv.versions, v.vuln_id FROM processed_vuln pv JOIN vulnerability v ON v._id = pv.fk_vulnerability WHERE pv._id NOT IN (SELECT pv_id FROM job)" + ) + results = cur.fetchall() + except Exception: + logger.error( + "Could not retrieve processed vulnerabilities from database", + exc_info=True, + ) + finally: + cur.close() + return results + + def get_unprocessed_vulns(self): + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=DictCursor) + cur.execute( + "SELECT _id, raw_record FROM vulnerability WHERE _id NOT IN ( SELECT fk_vulnerability FROM processed_vuln)" + ) + results = cur.fetchall() + except Exception: + logger.error( + "Could not retrieve unprocessed vulnerabilities from database", + exc_info=True, + ) + finally: + cur.close() + return results + + def save_processed_vuln(self, fk_vuln, repository, versions): + if not self.connection: + raise Exception("Invalid connection") + try: + cur = self.connection.cursor() + cur.execute( + "INSERT INTO processed_vuln (fk_vulnerability, repository,versions) VALUES (%s,%s,%s)", + (fk_vuln, repository, versions), + ) + self.connection.commit() + cur.close() + except Exception: + logger.error( + "Could not save processed vulnerability to database", exc_info=True + ) + cur.close() + def parse_connect_string(connect_string): try: diff --git a/prospector/data/project_metadata.json b/prospector/data/project_metadata.json index 5349f5180..170b416eb 100644 --- a/prospector/data/project_metadata.json +++ b/prospector/data/project_metadata.json @@ -390,9 +390,7 @@ "spring framework", "spring-framework", "Spring Framework", - "Spring-Framework", - "Spring", - "spring" + "Spring-Framework" ] }, "tomcat": { @@ -489,5 +487,27 @@ "Flatpak", "flatpak" ] + }, + "Tuleap": { + "mailing_list_archives": "", + "jira_prefix": "", + "security_advisory_page": "", + "jira_link_template": "", + "git": "https://github.com/Enalean/tuleap", + "search keywords": [ + "Tuleap Community", + "Tuleap" + ] + }, + "spring-boot": { + "mailing_list_archives": "", + "jira_prefix": "", + "security_advisory_page": "", + "jira_link_template": "", + "git": "https://github.com/spring-projects/spring-boot", + "search keywords": [ + "spring-boot", + "spring boot" + ] } } diff --git a/prospector/data_sources/nvd/filter_entries.py b/prospector/data_sources/nvd/filter_entries.py index 08647ac19..4a02258ce 100644 --- a/prospector/data_sources/nvd/filter_entries.py +++ b/prospector/data_sources/nvd/filter_entries.py @@ -2,22 +2,47 @@ import datetime import json +import psycopg2 import requests -from versions_extraction import extract_version_ranges_cpe, process_ranges +from psycopg2.extensions import parse_dsn +from psycopg2.extras import DictCursor, DictRow, Json +from versions_extraction import ( + extract_version_range, + extract_version_ranges_cpe, + process_versions, +) + +from commitdb.postgres import PostgresCommitDB +from datamodel.nlp import extract_products +from util.config_parser import parse_config_file + +config = parse_config_file() + + +def connect_to_db(): + db = PostgresCommitDB( + config.database.user, + config.database.password, + config.database.host, + config.database.port, + config.database.dbname, + ) + db.connect() + return db + + +def disconnect_from_database(db): + db.disconnect() + +def retrieve_vulns(d_time): + + start_date, end_date = get_time_range(d_time) -def get_cves(d_time): data = "" # Set up the URL to retrieve the latest CVE entries from NVD nvd_url = "https://services.nvd.nist.gov/rest/json/cves/2.0?" - # calculate the date to retrieve new entries (%Y-%m-%dT%H:%M:%S.%f%2B01:00) - date_now = datetime.datetime.now() - start_date = (date_now - datetime.timedelta(days=d_time)).strftime( - "%Y-%m-%dT%H:%M:%S" - ) - end_date = date_now.strftime("%Y-%m-%dT%H:%M:%S") - nvd_url += f"lastModStartDate={start_date}&lastModEndDate={end_date}" # Retrieve the data from NVD @@ -33,9 +58,29 @@ def get_cves(d_time): else: print("Error while trying to retrieve entries") + # save to db + save_vuln_to_db(data) + return data +def save_vuln_to_db(vulns): + db = connect_to_db() + for vuln in vulns["vulnerabilities"]: + vuln_id = vuln["cve"]["id"] + pub_date = vuln["cve"]["published"] + mod_date = vuln["cve"]["lastModified"] + raw_record = json.dumps(vuln) + source = "NVD" + url = f"https://services.nvd.nist.gov/rest/json/cves/2.0?cveID={vuln_id}" + + res = db.lookup_vuln_id(vuln_id, mod_date) + if res[0] == 0: + print(f"Saving vuln: {vuln_id} in database") + db.save_vuln(vuln_id, pub_date, mod_date, raw_record, source, url) + db.disconnect() + + def get_cve_by_id(id): nvd_url = f"https://services.nvd.nist.gov/rest/json/cves/2.0?cveID={id}" @@ -77,27 +122,59 @@ def csv_to_json(csv_file_path): return json_data -def find_matching_entries_test(data): +def get_time_range(d_time): + # calculate the date to retrieve new entries (%Y-%m-%dT%H:%M:%S.%f%2B01:00) + date_now = datetime.datetime.now() + start_date = (date_now - datetime.timedelta(days=d_time)).strftime( + "%Y-%m-%dT%H:%M:%S" + ) + end_date = date_now.strftime("%Y-%m-%dT%H:%M:%S") + return start_date, end_date + + +def process_entries(): + # start_date,end_date=get_time_range(d_time) + db = connect_to_db() + + # Retrieve unprocessed entries from the vulnerability table + unprocessed_vulns = db.get_unprocessed_vulns() + + # Process each entry + processed_vulns = [] + for unprocessed_vuln in unprocessed_vulns: + entry_id = unprocessed_vuln[0] + raw_record = unprocessed_vuln[1] + + processed_vuln = map_entry(raw_record) + if processed_vuln is not None: + processed_vulns.append(processed_vuln) + db.save_processed_vuln( + entry_id, processed_vuln["repo_url"], processed_vuln["version_interval"] + ) + db.disconnect() + return processed_vulns + + +def map_entry(vuln): + # TODO: improve mapping technique with open("./data/project_metadata.json", "r") as f: match_list = json.load(f) - filtered_cves = [] - - for vuln in data["vulnerabilities"]: + project_names = extract_products(vuln["cve"]["descriptions"][0]["value"]) + # print(project_names) + for project_name in project_names: for data in match_list.values(): - keywords = data["search keywords"] - for keyword in keywords: - if keyword in vuln["cve"]["descriptions"][0]["value"]: - lst_version_ranges = extract_version_ranges_cpe(vuln["cve"]) - version = process_ranges(lst_version_ranges) - filtered_cves.append( - { - "nvd_info": vuln, - "repo_url": data["git"], - "version_interval": version, - } - ) - print(vuln["cve"]["id"]) - break - - return filtered_cves + keywords = [kw.lower() for kw in data["search keywords"]] + if project_name.lower() in keywords: + version = extract_version_range( + vuln["cve"], vuln["cve"]["descriptions"][0]["value"] + ) + filtered_vuln = { + "nvd_info": vuln, + "repo_url": data["git"], + "version_interval": version, + } + print(vuln["cve"]["id"]) + return filtered_vuln + + return None diff --git a/prospector/data_sources/nvd/job_creation.py b/prospector/data_sources/nvd/job_creation.py index bf80d3149..935afbbbd 100644 --- a/prospector/data_sources/nvd/job_creation.py +++ b/prospector/data_sources/nvd/job_creation.py @@ -5,6 +5,7 @@ from rq import Connection, Queue from rq.job import Job +from commitdb.postgres import PostgresCommitDB from core.prospector import prospector from core.report import generate_report from util.config_parser import parse_config_file @@ -36,34 +37,114 @@ def run_prospector(vuln_id, repo_url, v_int): return results, advisory_record -def create_prospector_job(entry): - # data = json.loads(entry) +# def create_prospector_job(entry): +# # data = json.loads(entry) +# +# id = entry["nvd_info"]["cve"]["id"] +# repo = entry["repo_url"] +# version = entry["version_interval"] +# +# with Connection(redis.from_url(redis_url)): +# queue = Queue(default_timeout=300) +# +# job = Job.create( +# run_prospector, +# args=(id, repo, version), +# description="prospector job", +# id=id, +# ) +# queue.enqueue_job(job) +# +# #response_object = { +# # "job_data": { +# # "job_id": job.get_id(), +# # "job_status": job.get_status(), +# # "job_queue_position": job.get_position(), +# # "job_description": job.description, +# # "job_enqueued_at": job.created_at, +# # "job_started_at": job.started_at, +# # "job_finished_at": job.ended_at, +# # "job_result": job.result, +# # "job_args": job.args +# # } +# #} +# return job +# - id = entry["nvd_info"]["cve"]["id"] - repo = entry["repo_url"] - version = entry["version_interval"] + +def create_prospector_job(vuln_id, repo, version): with Connection(redis.from_url(redis_url)): - queue = Queue() + queue = Queue(default_timeout=500) job = Job.create( run_prospector, - args=(id, repo, version), + args=(vuln_id, repo, version), description="prospector job", - id=id, + id=vuln_id, ) queue.enqueue_job(job) + return job + + +def connect_to_db(): + db = PostgresCommitDB( + config.database.user, + config.database.password, + config.database.host, + config.database.port, + config.database.dbname, + ) + db.connect() + return db + + +def disconnect_from_database(db): + db.disconnect() + + +# def save_job_to_db(job): +# db = connect_to_db() +# results="" +# created_from="Auto" +# processed_vulns = db.lookup_processed_no_job() +# pv_id=processed_vulns[0] +# +# +# +# +# db.save_job(job.get_id(),pv_id,job.args,job.created_at,job.started_at,job.ended_at,job.result,job.origin,created_from, job.get_status(refresh=True)) +# +# db.disconnect() + + +# separate job creation task +# retrieve processed vulns and cve_id, +# save_job using id from retrieved processed vulns +def enqueue_jobs(): + db = connect_to_db() + processed_vulns = db.get_processed_vulns_not_in_job() + print(processed_vulns) + created_from = "Auto" + for processed_vuln in processed_vulns: + pv_id = processed_vuln[0] + pv_repository = processed_vuln[1] + pv_versions = processed_vuln[2] + v_vuln_id = processed_vuln[3] + + job = create_prospector_job(v_vuln_id, pv_repository, pv_versions) + + db.save_job( + job.get_id(), + pv_id, + job.args, + job.created_at, + job.started_at, + job.ended_at, + job.result, + job.origin, + created_from, + job.get_status(refresh=True), + ) - response_object = { - "job_data": { - "job_id": job.get_id(), - "job_status": job.get_status(), - "job_queue_position": job.get_position(), - "job_description": job.description, - "job_created_at": job.created_at, - "job_started_at": job.started_at, - "job_ended_at": job.ended_at, - "job_result": job.result, - } - } - return response_object + db.disconnect() diff --git a/prospector/data_sources/nvd/nvd_test.py b/prospector/data_sources/nvd/nvd_test.py index 1d516701c..cecb6d21c 100644 --- a/prospector/data_sources/nvd/nvd_test.py +++ b/prospector/data_sources/nvd/nvd_test.py @@ -1,28 +1,23 @@ -from filter_entries import find_matching_entries_test, get_cves -from job_creation import create_prospector_job +from filter_entries import process_entries, retrieve_vulns +from job_creation import enqueue_jobs -# request new cves entries through NVD API -cves = get_cves(5) - -# filter out undesired cves based on mathcing rules -filtered_cves = find_matching_entries_test(cves) +# request new cves entries through NVD API and save to db +cves = retrieve_vulns(7) """with open("filtered_cves.json", "w") as outfile: json.dump(filtered_cves, outfile)""" -print("matched cves") -print(filtered_cves) +print("retrieved cves") +# print(cves) +# get entry from db and process +processed_vulns = process_entries() +print("ready to be enqueued: ") +print(processed_vulns) -# test entry for job creation -# entry = """ -# { -# "id": "CVE-2014-0050", -# "repository": "https://github.com/apache/commons-fileupload", -# "version": "1.3:1.3.1" -# } -# """ +# if processed_vulns: +# for entry in processed_vulns: +# job_info = create_prospector_job(entry) +# save_job_to_db(job_info) -if filtered_cves: - for entry in filtered_cves: - create_prospector_job(entry) +enqueue_jobs() diff --git a/prospector/data_sources/nvd/version_extraction_test.py b/prospector/data_sources/nvd/version_extraction_test.py new file mode 100644 index 000000000..2e71ac10a --- /dev/null +++ b/prospector/data_sources/nvd/version_extraction_test.py @@ -0,0 +1,147 @@ +import pytest + +from data_sources.nvd.versions_extraction import ( + extract_version_ranges_cpe, + extract_version_ranges_description, + process_versions, +) + +ADVISORY_TEXT_1 = "In Eclipse Jetty versions 9.4.21.v20190926, 9.4.22.v20191022, and 9.4.23.v20191118, the generation of default unhandled Error response content (in text/html and text/json Content-Type) does not escape Exception messages in stacktraces included in error output." +ADVISORY_TEXT_2 = "Apache Olingo versions 4.0.0 to 4.7.0 provide the AsyncRequestWrapperImpl class which reads a URL from the Location header, and then sends a GET or DELETE request to this URL. It may allow to implement a SSRF attack. If an attacker tricks a client to connect to a malicious server, the server can make the client call any URL including internal resources which are not directly accessible by the attacker." +ADVISORY_TEXT_3 = "Pivotal Spring Framework through 5.3.16 suffers from a potential remote code execution (RCE) issue if used for Java deserialization of untrusted data. Depending on how the library is implemented within a product, this issue may or not occur, and authentication may be required. NOTE: the vendor's position is that untrusted data is not an intended use case. The product's behavior will not be changed because some users rely on deserialization of trusted data." +ADVISORY_TEXT_4 = "Integer overflow in java/org/apache/tomcat/util/buf/Ascii.java in Apache Tomcat before 6.0.40, 7.x before 7.0.53, and 8.x before 8.0.4, when operated behind a reverse proxy, allows remote attackers to conduct HTTP request smuggling attacks via a crafted Content-Length HTTP header." +ADVISORY_TEXT_5 = "FasterXML jackson-databind through 2.8.10 and 2.9.x through 2.9.3 allows unauthenticated remote code execution because of an incomplete fix for the CVE-2017-7525 deserialization flaw. This is exploitable by sending maliciously crafted JSON input to the readValue method of the ObjectMapper, bypassing a blacklist that is ineffective if the Spring libraries are available in the classpath." +JSON_DATA_1 = { + "configurations": [ + { + "nodes": [ + { + "cpeMatch": [ + {"versionStartIncluding": "1.0", "versionEndIncluding": "2.0"}, + {"versionStartExcluding": "2.0", "versionEndExcluding": "3.0"}, + {"versionStartIncluding": "4.0", "versionEndIncluding": "5.0"}, + ] + }, + ] + } + ] +} + +JSON_DATA_2 = { + "configurations": [ + { + "nodes": [ + { + "cpeMatch": [ + { + "criteria": "cpe:2.3:a:fasterxml:jackson-databind:*:*:*:*:*:*:*:*", + "versionEndExcluding": "2.6.7.3", + "matchCriteriaId": "1DF0B092-75D2-4A01-9CDC-B3AB2F4CF2C3", + }, + { + "criteria": "cpe:2.3:a:fasterxml:jackson-databind:*:*:*:*:*:*:*:*", + "versionStartIncluding": "2.7.0", + "versionEndExcluding": "2.7.9.2", + "matchCriteriaId": "5BBA4A48-37C7-4165-B422-652EFD99B05B", + }, + { + "criteria": "cpe:2.3:a:fasterxml:jackson-databind:*:*:*:*:*:*:*:*", + "versionStartIncluding": "2.8.0", + "versionEndExcluding": "2.8.11", + "matchCriteriaId": "2D1029A9-A17E-43FE-BE78-DF2DEEBFBAAF", + }, + { + "criteria": "cpe:2.3:a:fasterxml:jackson-databind:*:*:*:*:*:*:*:*", + "versionStartIncluding": "2.9.0", + "versionEndExcluding": "2.9.4", + "matchCriteriaId": "603345A2-FA66-4B4C-9143-AE710EF6626F", + }, + ] + } + ] + }, + { + "nodes": [ + { + "cpeMatch": [ + { + "criteria": "cpe:2.3:o:debian:debian_linux:8.0:*:*:*:*:*:*:*", + "matchCriteriaId": "C11E6FB0-C8C0-4527-9AA0-CB9B316F8F43", + }, + { + "criteria": "cpe:2.3:o:debian:debian_linux:9.0:*:*:*:*:*:*:*", + "matchCriteriaId": "DEECE5FC-CACF-4496-A3E7-164736409252", + }, + ] + } + ] + }, + ] +} + +JSON_DATA_3 = { + "configurations": [ + { + "nodes": [ + { + "cpeMatch": [ + { + "criteria": "cpe:2.3:a:jenkins:pipeline_utility_steps:*:*:*:*:*:jenkins:*:*", + "versionEndIncluding": "2.15.2", + "matchCriteriaId": "C6754B3C-6C9D-4EE8-A27F-7EA327B90CB6", + } + ] + } + ] + } + ] +} + +VERSION_RANGES = ["[1.0:2.0]", "(2.0:3.0)", "[2.1:None)", "[4.0:5.0]"] + + +def test_extract_version_ranges_description(): + affected_version, fixed_version = extract_version_ranges_description( + ADVISORY_TEXT_1 + ) + assert affected_version == "9.4.23" + assert fixed_version is None + + affected_version, fixed_version = extract_version_ranges_description( + ADVISORY_TEXT_2 + ) + assert affected_version == "4.7.0" + assert fixed_version is None + + affected_version, fixed_version = extract_version_ranges_description( + ADVISORY_TEXT_3 + ) + assert affected_version == "5.3.16" + assert fixed_version is None + + affected_version, fixed_version = extract_version_ranges_description( + ADVISORY_TEXT_4 + ) + assert affected_version is None + assert fixed_version == "8.0.4" + + +def test_extract_version_ranges_cpe(): + version_ranges = extract_version_ranges_cpe(JSON_DATA_1) + assert version_ranges == ["[1.0:2.0]", "(2.0:3.0)", "[4.0:5.0]"] + + version_ranges = extract_version_ranges_cpe(JSON_DATA_2) + assert version_ranges == [ + "(None:2.6.7.3)", + "[2.7.0:2.7.9.2)", + "[2.8.0:2.8.11)", + "[2.9.0:2.9.4)", + ] + + version_ranges = extract_version_ranges_cpe(JSON_DATA_3) + assert version_ranges == ["(None:2.15.2]"] + + +def test_process_ranges(): + version_ranges = process_versions(VERSION_RANGES) + assert version_ranges == "4.0:5.1" diff --git a/prospector/data_sources/nvd/versions_extraction.py b/prospector/data_sources/nvd/versions_extraction.py index c82bcc954..fa8f01c77 100644 --- a/prospector/data_sources/nvd/versions_extraction.py +++ b/prospector/data_sources/nvd/versions_extraction.py @@ -24,42 +24,57 @@ def extract_version_ranges_cpe(json_data): # json_data = json.loads(json_data) version_ranges = [] if "configurations" in json_data: - for configuration in json_data["configurations"]: - for node in configuration["nodes"]: - for cpe_match in node["cpeMatch"]: - if "versionStartIncluding" in cpe_match: - version_range = "[" + cpe_match["versionStartIncluding"] + ":" - elif "versionStartExcluding" in cpe_match: - version_range = "(" + cpe_match["versionStartExcluding"] + ":" - elif "criteria" in cpe_match: - if re.match( - r"\d+\.(?:\d+\.*)*\d", cpe_match["criteria"].split(":")[5] - ): - version_range = ( - "[" + cpe_match["criteria"].split(":")[5] + ":" - ) - else: - version_range = "None:" + configuration = json_data["configurations"][0] + for node in configuration["nodes"]: + for cpe_match in node["cpeMatch"]: + if "versionStartIncluding" in cpe_match: + version_range = "[" + cpe_match["versionStartIncluding"] + ":" + elif "versionStartExcluding" in cpe_match: + version_range = "(" + cpe_match["versionStartExcluding"] + ":" + elif "criteria" in cpe_match: + if re.match( + r"\d+\.(?:\d+\.*)*\d", cpe_match["criteria"].split(":")[5] + ): + version_range = "[" + cpe_match["criteria"].split(":")[5] + ":" else: version_range = "(None:" + else: + version_range = "(None:" + if "versionEndIncluding" in cpe_match: + version_range += cpe_match["versionEndIncluding"] + "]" + elif "versionEndExcluding" in cpe_match: + version_range += cpe_match["versionEndExcluding"] + ")" + else: + version_range += "None)" + version_ranges.append(version_range) + return version_ranges - if "versionEndIncluding" in cpe_match: - version_range += cpe_match["versionEndIncluding"] + "]" - elif "versionEndExcluding" in cpe_match: - version_range += cpe_match["versionEndExcluding"] + ")" - else: - version_range += "None)" - version_ranges.append(version_range) - return version_ranges +# def process_ranges(ranges_list): +# if ranges_list: +# last_entry = ranges_list[-1] +# version_range = last_entry.strip("[]()") +# else: +# version_range = "None:None" +# return version_range -def process_ranges(ranges_list): +def process_versions(ranges_list): + version_range = "None:None" if ranges_list: - last_entry = ranges_list[-1] - version_range = last_entry.strip("[]()") - else: - version_range = "None:None" + last_range = ranges_list[-1] # take the last range of the list + start, end = last_range[1:].split(":") + if "]" in end: + end_components = end[:-1].split(".") + end_components[-1] = str( + int(end_components[-1]) + 1 + ) # Increment the last component + end = ".".join(end_components) + else: + end = end.strip(")") + + version_range = f"{start}:{end}" + return version_range @@ -97,7 +112,10 @@ def extract_version_ranges_desc(doc): # New method. Need validation -def extract_version_ranges_description(doc): +def extract_version_ranges_description(description): + nlp = spacy.load("en_core_web_sm") + doc = nlp(description) + fixed_version = None affected_version = None for sent in doc.sents: @@ -125,3 +143,33 @@ def extract_version_ranges_description(doc): affected_version = version return affected_version, fixed_version + + +def extract_version_range(json_data, description): + version_range = extract_version_ranges_cpe(json_data) + if not version_range: + # try using the description + version_range = extract_version_ranges_description(description) + else: + version_range = process_versions(version_range) + return version_range + + +def retrieve_repository(project_name): + """ + Retrieve the GitHub repository URL for a given project name + """ + # GitHub API endpoint for searching repositories + url = "https://api.github.com/search/repositories" + + query_params = {"q": project_name, "sort": "stars", "order": "desc"} + + response = requests.get(url, params=query_params) + + if response.status_code == 200: + data = response.json() + if data["total_count"] > 0: + repository_url = data["items"][0]["html_url"] + return repository_url + + return None diff --git a/prospector/datamodel/nlp.py b/prospector/datamodel/nlp.py index 150f4203f..573165494 100644 --- a/prospector/datamodel/nlp.py +++ b/prospector/datamodel/nlp.py @@ -1,4 +1,5 @@ import re +from collections import OrderedDict from typing import Dict, List, Set, Tuple from spacy import load @@ -70,17 +71,28 @@ def extract_products(text: str) -> List[str]: """ Extract product names from advisory text """ - return list( - set( - [ - token.text - for token in nlp(text) - if token.pos_ in ("PROPN") - and token.text.isalpha() - and len(token.text) > 2 - ] - ) # "NOUN", - ) + # return list( + # set( + # [ + # token.text + # for token in nlp(text) + # if token.pos_ in ("PROPN") + # and token.text.isalpha() + # and len(token.text) > 2 + # ] + # ) # "NOUN", + # ) + products = [ + token.text + for token in nlp(text) + if token.pos_ in ("PROPN") + and token.text.isalpha() + and len(token.text) > 2 + and token.text + != "Apache" # exclude Apache string (depends on how we perform matching) + ] + + return list(OrderedDict.fromkeys(products)) # TODO: add list of non-relevant or relevant extensions diff --git a/prospector/datamodel/nlp_test.py b/prospector/datamodel/nlp_test.py index 9b3602336..2c3c04685 100644 --- a/prospector/datamodel/nlp_test.py +++ b/prospector/datamodel/nlp_test.py @@ -2,6 +2,7 @@ extract_affected_filenames, extract_ghissue_references, extract_jira_references, + extract_products, find_similar_words, ) @@ -57,3 +58,8 @@ def test_extract_gh_issues(): def test_extract_filenames_single(): fn, ext = extract_affected_filenames(ADVISORY_TEXT_6) assert "Content-Length" in fn + + +def test_extract_products(): + result = extract_products(ADVISORY_TEXT_5) + assert ["JsonMapObjectReaderWriter", "CXF"] == result diff --git a/prospector/ddl/30_vulnerability.sql b/prospector/ddl/30_vulnerability.sql new file mode 100644 index 000000000..7d2da9bd0 --- /dev/null +++ b/prospector/ddl/30_vulnerability.sql @@ -0,0 +1,16 @@ +-- public.vulnerability definition + +-- Drop table + +DROP TABLE IF EXISTS public.vulnerability; + +CREATE TABLE public.vulnerability ( + _id SERIAL PRIMARY KEY, + vuln_id varchar NOT NULL, + published_date DATE NOT NULL, + last_modified_date DATE NOT NULL, + raw_record JSON, + source varchar, + url varchar, + UNIQUE (vuln_id,last_modified_date) +); diff --git a/prospector/ddl/40_processed_vuln.sql b/prospector/ddl/40_processed_vuln.sql new file mode 100644 index 000000000..6022a352c --- /dev/null +++ b/prospector/ddl/40_processed_vuln.sql @@ -0,0 +1,13 @@ +-- public.processed_vuln definition + +-- Drop table + +DROP TABLE IF EXISTS public.processed_vuln; + +CREATE TABLE public.processed_vuln ( + _id SERIAL PRIMARY KEY, + fk_vulnerability INT NOT NULL UNIQUE, + repository varchar NOT NULL, + versions varchar NOT NULL, + FOREIGN KEY (fk_vulnerability) REFERENCES public.vulnerability (_id) +); diff --git a/prospector/ddl/50_job.sql b/prospector/ddl/50_job.sql new file mode 100644 index 000000000..e1a675740 --- /dev/null +++ b/prospector/ddl/50_job.sql @@ -0,0 +1,19 @@ +-- public.job definition + +-- Drop table + +DROP TABLE IF EXISTS public.job; + +CREATE TABLE public.job ( + _id varchar NOT null PRIMARY KEY, + pv_id INT NOT NULL, + params varchar NOT NULL, + enqueued_at timestamp, + started_at timestamp, + finished_at timestamp, + results varchar, + created_by varchar, + created_from varchar, + status varchar, + FOREIGN KEY (pv_id) REFERENCES public.processed_vuln (_id) +); diff --git a/prospector/service/api/routers/endpoints.py b/prospector/service/api/routers/endpoints.py index 9f446f86c..88bf8357f 100644 --- a/prospector/service/api/routers/endpoints.py +++ b/prospector/service/api/routers/endpoints.py @@ -3,7 +3,7 @@ from datetime import datetime import redis -from fastapi import APIRouter, FastAPI, Request +from fastapi import APIRouter, FastAPI, HTTPException, Request from fastapi.responses import HTMLResponse from fastapi.templating import Jinja2Templates from rq import Connection, Queue @@ -60,16 +60,21 @@ async def get_report(job_id): # queue = Queue() # job = queue.fetch_job(job_id) # get and redirect to the html page of the generated report - with open( - f"/app/data_sources/reports/{job_id}.html", - "r", - ) as f: - html_report = f.read() - return HTMLResponse(content=html_report, status_code=200) + report_path = f"/app/data_sources/reports/{job_id}.html" + if os.path.exists(report_path): + with open( + report_path, + "r", + ) as f: + html_report = f.read() + return HTMLResponse(content=html_report, status_code=200) + return {"message": "report not found"} # endpoint for opening the settings page of the selected job -@router.get("/get_settings/{job_id}", tags=["jobs"], response_class=HTMLResponse) +@router.get( + "/get_settings/{job_id}", tags=["jobs"], response_class=HTMLResponse +) async def get_settings(job_id, request: Request): with Connection(redis.from_url(redis_url)): queue = Queue() diff --git a/prospector/service/static/job_configuration.css b/prospector/service/static/job_configuration.css index 5e4cec38e..a50eaa324 100644 --- a/prospector/service/static/job_configuration.css +++ b/prospector/service/static/job_configuration.css @@ -29,6 +29,7 @@ label { font-weight: bold; display: inline-block; align-items: center; + justify-content: center; width: 100%; max-width: 400px; margin-right: auto; @@ -43,6 +44,8 @@ input[type="text"] { width: 100%; max-width: 400px; box-sizing: border-box; + margin-left: auto; + margin-right: auto; } form input[value]:not([value=""]) { @@ -62,4 +65,4 @@ button { button:hover { background-color: #3e8e41; -} \ No newline at end of file +} diff --git a/prospector/service/static/job_configuration.html b/prospector/service/static/job_configuration.html index 110ea5f3e..f92f9ef4d 100644 --- a/prospector/service/static/job_configuration.html +++ b/prospector/service/static/job_configuration.html @@ -24,21 +24,19 @@

{{job.args[0]}}

{{job.get_status()}}
- + + - + + + + + -
- \ No newline at end of file + From 8703b079fb7ec878386706e158ab4e5a2d4ff091 Mon Sep 17 00:00:00 2001 From: matteogreek Date: Fri, 23 Jun 2023 16:10:27 +0200 Subject: [PATCH 53/83] Implemented backend with new db tables and endpoints. Developed simple frontend --- prospector/{commitdb => backenddb}/README.md | 0 .../{commitdb => backenddb}/__init__.py | 2 +- .../{commitdb => backenddb}/commitdb_test.py | 8 +- .../{commitdb => backenddb}/postgres.py | 253 ++++++++++++++++-- prospector/data_sources/nvd/filter_entries.py | 72 +++-- prospector/data_sources/nvd/job_creation.py | 182 +++++-------- .../nvd/version_extraction_test.py | 47 ++-- .../data_sources/nvd/versions_extraction.py | 29 +- prospector/ddl/30_vulnerability.sql | 7 +- prospector/ddl/50_job.sql | 5 +- prospector/docker-compose.yml | 2 +- prospector/docker/Dockerfile | 5 +- prospector/docker/cli/Dockerfile | 14 +- prospector/docker/service/Dockerfile | 6 +- prospector/docker/worker/Dockerfile | 6 +- .../etc_supervisor_confd_rqworker.conf.j2 | 4 +- prospector/requirements.txt | 19 +- prospector/run_prospector.sh | 20 +- prospector/service/api/routers/endpoints.py | 2 +- prospector/service/api/routers/feeds.py | 121 +++++++++ prospector/service/api/routers/home.py | 2 +- prospector/service/api/routers/jobs.py | 176 +++++++----- .../service/api/routers/preprocessed.py | 6 +- prospector/service/api/rq_utils.py | 45 ---- prospector/service/main.py | 19 +- prospector/service/static/feed.html | 47 ++++ prospector/service/static/feed.js | 78 ++++++ prospector/service/static/index.css | 3 +- prospector/service/static/index.html | 24 +- prospector/service/static/index.js | 59 ++++ .../service/static/job_configuration.css | 10 +- .../service/static/job_configuration.html | 20 +- .../service/static/job_configuration.js | 57 +++- prospector/service/static/report_list.html | 4 +- 34 files changed, 947 insertions(+), 407 deletions(-) rename prospector/{commitdb => backenddb}/README.md (100%) rename prospector/{commitdb => backenddb}/__init__.py (84%) rename prospector/{commitdb => backenddb}/commitdb_test.py (79%) rename prospector/{commitdb => backenddb}/postgres.py (59%) create mode 100644 prospector/service/api/routers/feeds.py delete mode 100644 prospector/service/api/rq_utils.py create mode 100644 prospector/service/static/feed.html create mode 100644 prospector/service/static/feed.js create mode 100644 prospector/service/static/index.js diff --git a/prospector/commitdb/README.md b/prospector/backenddb/README.md similarity index 100% rename from prospector/commitdb/README.md rename to prospector/backenddb/README.md diff --git a/prospector/commitdb/__init__.py b/prospector/backenddb/__init__.py similarity index 84% rename from prospector/commitdb/__init__.py rename to prospector/backenddb/__init__.py index 9189849fa..7460b3efc 100644 --- a/prospector/commitdb/__init__.py +++ b/prospector/backenddb/__init__.py @@ -1,3 +1,3 @@ -class CommitDB: +class BackendDB: def connect(self, connect_string): raise NotImplementedError("Unimplemented") diff --git a/prospector/commitdb/commitdb_test.py b/prospector/backenddb/commitdb_test.py similarity index 79% rename from prospector/commitdb/commitdb_test.py rename to prospector/backenddb/commitdb_test.py index 9b3a07f05..32ef221fe 100644 --- a/prospector/commitdb/commitdb_test.py +++ b/prospector/backenddb/commitdb_test.py @@ -1,18 +1,18 @@ import pytest -from commitdb.postgres import PostgresCommitDB, parse_connect_string +from backenddb.postgres import PostgresBackendDB, parse_connect_string from datamodel.commit import Commit @pytest.fixture def setupdb(): - db = PostgresCommitDB("postgres", "example", "localhost", "5432", "postgres") + db = PostgresBackendDB("postgres", "example", "localhost", "5432", "postgres") db.connect() # db.reset() return db -def test_save_lookup(setupdb: PostgresCommitDB): +def test_save_lookup(setupdb: PostgresBackendDB): commit = Commit( commit_id="42423b2423", repository="https://fasfasdfasfasd.com/rewrwe/rwer", @@ -37,7 +37,7 @@ def test_save_lookup(setupdb: PostgresCommitDB): assert commit.commit_id == retrieved_commit.commit_id -def test_lookup_nonexisting(setupdb: PostgresCommitDB): +def test_lookup_nonexisting(setupdb: PostgresBackendDB): result = setupdb.lookup( "https://fasfasdfasfasd.com/rewrwe/rwer", "42423b242342423b2423", diff --git a/prospector/commitdb/postgres.py b/prospector/backenddb/postgres.py similarity index 59% rename from prospector/commitdb/postgres.py rename to prospector/backenddb/postgres.py index 7ea831b0c..b2f0a029d 100644 --- a/prospector/commitdb/postgres.py +++ b/prospector/backenddb/postgres.py @@ -7,9 +7,9 @@ import psycopg2 from psycopg2.extensions import parse_dsn -from psycopg2.extras import DictCursor, DictRow, Json +from psycopg2.extras import DictCursor, DictRow, Json, RealDictCursor -from commitdb import CommitDB +from backenddb import BackendDB from log.logger import logger # DB_CONNECT_STRING = "postgresql://{}:{}@{}:{}/{}".format( @@ -21,7 +21,7 @@ # ).lower() -class PostgresCommitDB(CommitDB): +class PostgresBackendDB(BackendDB): """ This class implements the database abstraction layer for PostgreSQL @@ -129,11 +129,13 @@ def lookup_vuln_id(self, vuln_id: str, last_modified_date): raise Exception("Invalid connection") results = None try: - cur = self.connection.cursor() + cur = self.connection.cursor(cursor_factory=DictCursor) cur.execute( "SELECT COUNT(*) FROM vulnerability WHERE vuln_id = %s AND last_modified_date = %s", (vuln_id, last_modified_date), ) + # cur = self.connection.cursor() + # cur.execute("SELECT * FROM vulnerability WHERE vuln_id = %s", (vuln_id,)) results = cur.fetchone() self.connection.commit() except Exception: @@ -143,8 +145,47 @@ def lookup_vuln_id(self, vuln_id: str, last_modified_date): cur.close() return results + def lookup_vuln(self, vuln_id: str): + if not self.connection: + raise Exception("Invalid connection") + results = None + try: + cur = self.connection.cursor(cursor_factory=RealDictCursor) + cur.execute("SELECT * FROM vulnerability WHERE vuln_id = %s", (vuln_id,)) + results = cur.fetchone() + self.connection.commit() + except Exception: + self.connection.rollback() + logger.error("Could not lookup vulnerability in database", exc_info=True) + finally: + cur.close() + return results + + def lookup_vulnList(self): + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=RealDictCursor) + cur.execute("SELECT * FROM vulnerability") + results = cur.fetchall() + except Exception: + logger.error( + "Could not retrieve vulns from database", + exc_info=True, + ) + finally: + cur.close() + return results + def save_vuln( - self, vuln_id, published_date, last_modified_date, raw_record, source, url + self, + vuln_id: str, + published_date: str, + last_modified_date: str, + raw_record: Json, + source: str, + url: str, ): if not self.connection: raise Exception("Invalid connection") @@ -161,44 +202,141 @@ def save_vuln( logger.error("Could not save vulnerability to database", exc_info=True) cur.close() - def update_vuln(self, vuln_id, descr, published_date, last_modified_date): + def save_job( + self, + _id: str, + pv_id: int, + params: str, + enqueued_at: str, + started_at: str, + finished_at: str, + results: str, + created_by: str, + status: str, + ): if not self.connection: raise Exception("Invalid connection") + try: + cur = self.connection.cursor() + cur.execute( + "INSERT INTO job (_id, pv_id, params, enqueued_at, started_at, finished_at, results, created_by, status)" + "VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)", + ( + _id, + pv_id, + params, + enqueued_at, + started_at, + finished_at, + results, + created_by, + status, + ), + ) + self.connection.commit() + cur.close() + except Exception: + logger.error("Could not save job entry to database", exc_info=True) + cur.close() + def save_manual_job( + self, + _id: str, + params: str, + enqueued_at: str, + started_at: str, + finished_at: str, + results: str, + created_by: str, + status: str, + ): + if not self.connection: + raise Exception("Invalid connection") + params try: cur = self.connection.cursor() + + # Divide job.args into separate elements + vuln_id, repo, versions = params + + # Insert into vulnerability table + cur.execute( + "INSERT INTO vulnerability (vuln_id, last_modified_date, source) " + "VALUES (%s, %s, %s)", + (vuln_id, enqueued_at, created_by), + ) + + # Retrieve the newly inserted vulnerability ID + cur.execute("SELECT _id FROM vulnerability WHERE vuln_id = %s", (vuln_id,)) + vulnerability_id = cur.fetchone()[0] + + # Insert into processed_vuln table + cur.execute( + "INSERT INTO processed_vuln (fk_vulnerability, repository, versions) " + "VALUES (%s, %s, %s)", + (vulnerability_id, repo, versions), + ) + + # Retrieve the newly inserted processed_vuln ID cur.execute( - "UPDATE entries SET descr = %s, published_date = %s, last_modified_date = %s WHERE vuln_id = %s", - (descr, published_date, last_modified_date, vuln_id), + "SELECT _id FROM processed_vuln WHERE fk_vulnerability = %s", + (vulnerability_id,), ) + processed_vuln_id = cur.fetchone()[0] + + # Insert into job table + cur.execute( + "INSERT INTO job (_id, pv_id, params, enqueued_at, started_at, finished_at, results, created_by, status) " + "VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)", + ( + _id, + processed_vuln_id, + params, + enqueued_at, + started_at, + finished_at, + results, + created_by, + status, + ), + ) + self.connection.commit() cur.close() except Exception: - logger.error("Could not update vulnerability in database", exc_info=True) + logger.error("Could not save job entry to database", exc_info=True) cur.close() - def save_job( + def save_dependent_job( self, - _id, - pv_id, - params, - enqueued_at, - started_at, - finished_at, - results, - created_by, - created_from, - status, + parent_id: str, + _id: str, + params: str, + enqueued_at: str, + started_at: str, + finished_at: str, + results: str, + created_by: str, + status: str, ): if not self.connection: raise Exception("Invalid connection") + params try: cur = self.connection.cursor() + + # retrieve parent job + parent_job = self.lookup_job_id(parent_id) + created_from = parent_job["_id"] + parent_job_pv_id = parent_job["pv_id"] + + # Insert child job into job table cur.execute( - "INSERT INTO job (_id, pv_id, params, enqueued_at, started_at, finished_at, results, created_by, created_from, status) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", + "INSERT INTO job (_id, pv_id, params, enqueued_at, started_at, finished_at, results, created_by, created_from, status) " + "VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", ( _id, - pv_id, + parent_job_pv_id, params, enqueued_at, started_at, @@ -209,6 +347,7 @@ def save_job( status, ), ) + self.connection.commit() cur.close() except Exception: @@ -245,7 +384,7 @@ def lookup_processed_no_job(self): cur.close() return results - def get_processed_vulns(self): # every entry + def get_processed_vulns(self): if not self.connection: raise Exception("Invalid connection") results = [] @@ -273,7 +412,7 @@ def get_processed_vulns_not_in_job( try: cur = self.connection.cursor(cursor_factory=DictCursor) cur.execute( - "SELECT pv._id, pv.repository, pv.versions, v.vuln_id FROM processed_vuln pv JOIN vulnerability v ON v._id = pv.fk_vulnerability WHERE pv._id NOT IN (SELECT pv_id FROM job)" + "SELECT pv._id, pv.repository, pv.versions, v.vuln_id FROM processed_vuln pv JOIN vulnerability v ON v._id = pv.fk_vulnerability LEFT JOIN job j ON pv._id = j.pv_id WHERE j.pv_id IS NULL" ) results = cur.fetchall() except Exception: @@ -304,7 +443,7 @@ def get_unprocessed_vulns(self): cur.close() return results - def save_processed_vuln(self, fk_vuln, repository, versions): + def save_processed_vuln(self, fk_vuln: int, repository: str, versions: str): if not self.connection: raise Exception("Invalid connection") try: @@ -321,6 +460,70 @@ def save_processed_vuln(self, fk_vuln, repository, versions): ) cur.close() + def get_all_jobs(self): + if not self.connection: + raise Exception("Invalid connection") + results = [] + try: + cur = self.connection.cursor(cursor_factory=RealDictCursor) + cur.execute("SELECT * FROM job") + results = cur.fetchall() + except Exception: + logger.error( + "Could not retrieve jobs from database", + exc_info=True, + ) + finally: + cur.close() + return results + + def lookup_job_id(self, job_id: str): + if not self.connection: + raise Exception("Invalid connection") + results = None + try: + cur = self.connection.cursor(cursor_factory=DictCursor) + cur.execute("SELECT * FROM job WHERE _id = %s", (job_id,)) + results = cur.fetchone() + self.connection.commit() + logger.error(f"Job {job_id} retrieved correctly") + except Exception: + self.connection.rollback() + logger.error("Could not lookup job in database", exc_info=True) + finally: + cur.close() + return results + + def update_job( + self, + job_id: str, + status: str, + started_at: str = None, + ended_at: str = None, + results: str = None, + ): + if not self.connection: + raise Exception("Invalid connection") + + try: + cur = self.connection.cursor() + + if ended_at is None: + cur.execute( + "UPDATE job SET status = %s, started_at = %s WHERE _id = %s", + (status, started_at, job_id), + ) + else: + cur.execute( + "UPDATE job SET status = %s, finished_at = %s, results = %s WHERE _id = %s", + (status, ended_at, results, job_id), + ) + self.connection.commit() + cur.close() + except Exception: + logger.error("Could not update job status in database", exc_info=True) + cur.close() + def parse_connect_string(connect_string): try: diff --git a/prospector/data_sources/nvd/filter_entries.py b/prospector/data_sources/nvd/filter_entries.py index 4a02258ce..dc6898f06 100644 --- a/prospector/data_sources/nvd/filter_entries.py +++ b/prospector/data_sources/nvd/filter_entries.py @@ -1,26 +1,33 @@ +import asyncio import csv import datetime import json +import aiohttp import psycopg2 import requests from psycopg2.extensions import parse_dsn from psycopg2.extras import DictCursor, DictRow, Json -from versions_extraction import ( + +from backenddb.postgres import PostgresBackendDB +from data_sources.nvd.versions_extraction import ( extract_version_range, extract_version_ranges_cpe, process_versions, ) - -from commitdb.postgres import PostgresCommitDB from datamodel.nlp import extract_products +from log.logger import logger from util.config_parser import parse_config_file config = parse_config_file() +with open("./data/project_metadata.json", "r") as f: + global match_list + match_list = json.load(f) + def connect_to_db(): - db = PostgresCommitDB( + db = PostgresBackendDB( config.database.user, config.database.password, config.database.host, @@ -35,7 +42,7 @@ def disconnect_from_database(db): db.disconnect() -def retrieve_vulns(d_time): +async def retrieve_vulns(d_time): start_date, end_date = get_time_range(d_time) @@ -45,18 +52,18 @@ def retrieve_vulns(d_time): nvd_url += f"lastModStartDate={start_date}&lastModEndDate={end_date}" - # Retrieve the data from NVD - try: - print(nvd_url) - response = requests.get(nvd_url) - except Exception as e: - print(str(e)) - - if response.status_code == 200: - data = json.loads(response.text) - - else: - print("Error while trying to retrieve entries") + async with aiohttp.ClientSession() as session: + try: + async with session.get(nvd_url) as response: + if response.status == 200: + data = await response.json() + else: + print("Error while trying to retrieve entries") + except aiohttp.ClientError as e: + print(str(e)) + logger.error( + "Error while retrieving vulnerabilities from NVD", exc_info=True + ) # save to db save_vuln_to_db(data) @@ -132,7 +139,7 @@ def get_time_range(d_time): return start_date, end_date -def process_entries(): +async def process_entries(): # start_date,end_date=get_time_range(d_time) db = connect_to_db() @@ -145,7 +152,7 @@ def process_entries(): entry_id = unprocessed_vuln[0] raw_record = unprocessed_vuln[1] - processed_vuln = map_entry(raw_record) + processed_vuln = await map_entry(raw_record) if processed_vuln is not None: processed_vulns.append(processed_vuln) db.save_processed_vuln( @@ -155,10 +162,10 @@ def process_entries(): return processed_vulns -def map_entry(vuln): +async def map_entry(vuln): # TODO: improve mapping technique - with open("./data/project_metadata.json", "r") as f: - match_list = json.load(f) + # async with aiofiles.open("./data/project_metadata.json", "r") as f: + # match_list = json.loads(await f.read()) project_names = extract_products(vuln["cve"]["descriptions"][0]["value"]) # print(project_names) @@ -178,3 +185,24 @@ def map_entry(vuln): return filtered_vuln return None + + +# if no map is possible search project name using GitHub API +def retrieve_repository(project_name): + """ + Retrieve the GitHub repository URL for a given project name + """ + # GitHub API endpoint for searching repositories + url = "https://api.github.com/search/repositories" + + query_params = {"q": project_name, "sort": "stars", "order": "desc"} + + response = requests.get(url, params=query_params) + + if response.status_code == 200: + data = response.json() + if data["total_count"] > 0: + repository_url = data["items"][0]["html_url"] + return repository_url + + return None diff --git a/prospector/data_sources/nvd/job_creation.py b/prospector/data_sources/nvd/job_creation.py index 935afbbbd..85bd066fd 100644 --- a/prospector/data_sources/nvd/job_creation.py +++ b/prospector/data_sources/nvd/job_creation.py @@ -1,94 +1,77 @@ import json import sys +import time +from datetime import datetime, timedelta import redis -from rq import Connection, Queue -from rq.job import Job +from rq import Connection, Queue, get_current_job -from commitdb.postgres import PostgresCommitDB +from backenddb.postgres import PostgresBackendDB from core.prospector import prospector from core.report import generate_report +from log.logger import logger from util.config_parser import parse_config_file # get the redis server url config = parse_config_file() -# redis_url = config.redis_url +redis_url = config.redis_url backend = config.backend -redis_url = "redis://localhost:6379/0" -print("redis url: ", redis_url) -print("redis url: ", backend) - def run_prospector(vuln_id, repo_url, v_int): - results, advisory_record = prospector( - vulnerability_id=vuln_id, - repository_url=repo_url, - version_interval=v_int, - backend_address=backend, - ) - generate_report( - results, - advisory_record, - "html", - f"data_sources/reports/{vuln_id}", - ) - return results, advisory_record - - -# def create_prospector_job(entry): -# # data = json.loads(entry) -# -# id = entry["nvd_info"]["cve"]["id"] -# repo = entry["repo_url"] -# version = entry["version_interval"] -# -# with Connection(redis.from_url(redis_url)): -# queue = Queue(default_timeout=300) -# -# job = Job.create( -# run_prospector, -# args=(id, repo, version), -# description="prospector job", -# id=id, -# ) -# queue.enqueue_job(job) -# -# #response_object = { -# # "job_data": { -# # "job_id": job.get_id(), -# # "job_status": job.get_status(), -# # "job_queue_position": job.get_position(), -# # "job_description": job.description, -# # "job_enqueued_at": job.created_at, -# # "job_started_at": job.started_at, -# # "job_finished_at": job.ended_at, -# # "job_result": job.result, -# # "job_args": job.args -# # } -# #} -# return job -# + start_time = time.time() + job = get_current_job() + db = connect_to_db() + db.update_job(job.get_id(), job.get_status(), job.started_at) + + try: + results, advisory_record = prospector( + vulnerability_id=vuln_id, + repository_url=repo_url, + version_interval=v_int, + backend_address=backend, + ) + generate_report( + results, + advisory_record, + "html", + f"data_sources/reports/{vuln_id}", + ) + except Exception: + end_time = time.time() + elapsed_time = end_time - start_time + ended_at = job.started_at + timedelta(seconds=int(elapsed_time)) + logger.error("job failed during execution") + print(job.get_id(), "failed", ended_at) + db.update_job(job.get_id(), "failed", ended_at=ended_at) + db.disconnect() + else: + end_time = time.time() + elapsed_time = end_time - start_time + ended_at = job.started_at + timedelta(seconds=int(elapsed_time)) + print(job.get_id(), "finished", ended_at, f"data_sources/reports/{vuln_id}") + db.update_job( + job.get_id(), + "finished", + ended_at=ended_at, + results=f"data_sources/reports/{vuln_id}", + ) + db.disconnect() + return f"data_sources/reports/{vuln_id}" -def create_prospector_job(vuln_id, repo, version): +def create_prospector_job(vuln_id, repo, version): with Connection(redis.from_url(redis_url)): queue = Queue(default_timeout=500) + job = queue.enqueue(run_prospector, args=(vuln_id, repo, version)) - job = Job.create( - run_prospector, - args=(vuln_id, repo, version), - description="prospector job", - id=vuln_id, - ) - queue.enqueue_job(job) return job def connect_to_db(): - db = PostgresCommitDB( + db = PostgresBackendDB( config.database.user, config.database.password, config.database.host, @@ -99,52 +82,35 @@ def connect_to_db(): return db -def disconnect_from_database(db): - db.disconnect() - - -# def save_job_to_db(job): -# db = connect_to_db() -# results="" -# created_from="Auto" -# processed_vulns = db.lookup_processed_no_job() -# pv_id=processed_vulns[0] -# -# -# -# -# db.save_job(job.get_id(),pv_id,job.args,job.created_at,job.started_at,job.ended_at,job.result,job.origin,created_from, job.get_status(refresh=True)) -# -# db.disconnect() - - -# separate job creation task -# retrieve processed vulns and cve_id, -# save_job using id from retrieved processed vulns -def enqueue_jobs(): +async def enqueue_jobs(): db = connect_to_db() processed_vulns = db.get_processed_vulns_not_in_job() print(processed_vulns) - created_from = "Auto" + created_by = "Auto" for processed_vuln in processed_vulns: - pv_id = processed_vuln[0] - pv_repository = processed_vuln[1] - pv_versions = processed_vuln[2] - v_vuln_id = processed_vuln[3] - - job = create_prospector_job(v_vuln_id, pv_repository, pv_versions) - - db.save_job( - job.get_id(), - pv_id, - job.args, - job.created_at, - job.started_at, - job.ended_at, - job.result, - job.origin, - created_from, - job.get_status(refresh=True), - ) + pv_id = processed_vuln["_id"] + pv_repository = processed_vuln["repository"] + pv_versions = processed_vuln["versions"] + v_vuln_id = processed_vuln["vuln_id"] + + try: + job = create_prospector_job(v_vuln_id, pv_repository, pv_versions) + except Exception: + logger.error("error while creating automatically the jobs", exc_info=True) + + try: + db.save_job( + job.get_id(), + pv_id, + job.args, + job.created_at, + job.started_at, + job.ended_at, + job.result, + created_by, + job.get_status(refresh=True), + ) + except Exception: + logger.error("error while saving automatically the jobs", exc_info=True) db.disconnect() diff --git a/prospector/data_sources/nvd/version_extraction_test.py b/prospector/data_sources/nvd/version_extraction_test.py index 2e71ac10a..0770dd498 100644 --- a/prospector/data_sources/nvd/version_extraction_test.py +++ b/prospector/data_sources/nvd/version_extraction_test.py @@ -1,6 +1,7 @@ import pytest from data_sources.nvd.versions_extraction import ( + extract_version_range, extract_version_ranges_cpe, extract_version_ranges_description, process_versions, @@ -11,6 +12,7 @@ ADVISORY_TEXT_3 = "Pivotal Spring Framework through 5.3.16 suffers from a potential remote code execution (RCE) issue if used for Java deserialization of untrusted data. Depending on how the library is implemented within a product, this issue may or not occur, and authentication may be required. NOTE: the vendor's position is that untrusted data is not an intended use case. The product's behavior will not be changed because some users rely on deserialization of trusted data." ADVISORY_TEXT_4 = "Integer overflow in java/org/apache/tomcat/util/buf/Ascii.java in Apache Tomcat before 6.0.40, 7.x before 7.0.53, and 8.x before 8.0.4, when operated behind a reverse proxy, allows remote attackers to conduct HTTP request smuggling attacks via a crafted Content-Length HTTP header." ADVISORY_TEXT_5 = "FasterXML jackson-databind through 2.8.10 and 2.9.x through 2.9.3 allows unauthenticated remote code execution because of an incomplete fix for the CVE-2017-7525 deserialization flaw. This is exploitable by sending maliciously crafted JSON input to the readValue method of the ObjectMapper, bypassing a blacklist that is ineffective if the Spring libraries are available in the classpath." +ADVISORY_TEXT_6 = "Allocation of Resources Without Limits or Throttling vulnerability in Apache Software Foundation Apache Struts.This issue affects Apache Struts: through 2.5.30, through 6.1.2.\n\nUpgrade to Struts 2.5.31 or 6.1.2.1 or greater." JSON_DATA_1 = { "configurations": [ { @@ -97,33 +99,27 @@ ] } +JSON_DATA_4 = {} + + VERSION_RANGES = ["[1.0:2.0]", "(2.0:3.0)", "[2.1:None)", "[4.0:5.0]"] def test_extract_version_ranges_description(): - affected_version, fixed_version = extract_version_ranges_description( - ADVISORY_TEXT_1 - ) - assert affected_version == "9.4.23" - assert fixed_version is None - - affected_version, fixed_version = extract_version_ranges_description( - ADVISORY_TEXT_2 - ) - assert affected_version == "4.7.0" - assert fixed_version is None - - affected_version, fixed_version = extract_version_ranges_description( - ADVISORY_TEXT_3 - ) - assert affected_version == "5.3.16" - assert fixed_version is None - - affected_version, fixed_version = extract_version_ranges_description( - ADVISORY_TEXT_4 - ) - assert affected_version is None - assert fixed_version == "8.0.4" + version_range = extract_version_ranges_description(ADVISORY_TEXT_1) + assert version_range == "9.4.23:None" + + version_range = extract_version_ranges_description(ADVISORY_TEXT_2) + assert version_range == "4.7.0:None" + + version_range = extract_version_ranges_description(ADVISORY_TEXT_3) + assert version_range == "5.3.16:None" + + version_range = extract_version_ranges_description(ADVISORY_TEXT_4) + assert version_range == "None:8.0.4" + + version_range = extract_version_ranges_description(ADVISORY_TEXT_6) + assert version_range == "6.1.2:None" def test_extract_version_ranges_cpe(): @@ -145,3 +141,8 @@ def test_extract_version_ranges_cpe(): def test_process_ranges(): version_ranges = process_versions(VERSION_RANGES) assert version_ranges == "4.0:5.1" + + +def test_extract_version_ranges(): + version_range = extract_version_range(JSON_DATA_4, ADVISORY_TEXT_6) + assert version_range == "6.1.2:None" diff --git a/prospector/data_sources/nvd/versions_extraction.py b/prospector/data_sources/nvd/versions_extraction.py index fa8f01c77..39bf4e910 100644 --- a/prospector/data_sources/nvd/versions_extraction.py +++ b/prospector/data_sources/nvd/versions_extraction.py @@ -142,34 +142,13 @@ def extract_version_ranges_description(description): min_dist_vuln = dist affected_version = version - return affected_version, fixed_version + return f"{affected_version}:{fixed_version}" def extract_version_range(json_data, description): version_range = extract_version_ranges_cpe(json_data) - if not version_range: - # try using the description - version_range = extract_version_ranges_description(description) - else: + if version_range: version_range = process_versions(version_range) + else: + version_range = extract_version_ranges_description(description) return version_range - - -def retrieve_repository(project_name): - """ - Retrieve the GitHub repository URL for a given project name - """ - # GitHub API endpoint for searching repositories - url = "https://api.github.com/search/repositories" - - query_params = {"q": project_name, "sort": "stars", "order": "desc"} - - response = requests.get(url, params=query_params) - - if response.status_code == 200: - data = response.json() - if data["total_count"] > 0: - repository_url = data["items"][0]["html_url"] - return repository_url - - return None diff --git a/prospector/ddl/30_vulnerability.sql b/prospector/ddl/30_vulnerability.sql index 7d2da9bd0..f608de5b8 100644 --- a/prospector/ddl/30_vulnerability.sql +++ b/prospector/ddl/30_vulnerability.sql @@ -7,10 +7,11 @@ DROP TABLE IF EXISTS public.vulnerability; CREATE TABLE public.vulnerability ( _id SERIAL PRIMARY KEY, vuln_id varchar NOT NULL, - published_date DATE NOT NULL, - last_modified_date DATE NOT NULL, + published_date timestamp, + last_modified_date timestamp, raw_record JSON, source varchar, url varchar, - UNIQUE (vuln_id,last_modified_date) + alias varchar[], + UNIQUE (vuln_id,last_modified_date,source) ); diff --git a/prospector/ddl/50_job.sql b/prospector/ddl/50_job.sql index e1a675740..248226a3a 100644 --- a/prospector/ddl/50_job.sql +++ b/prospector/ddl/50_job.sql @@ -6,7 +6,7 @@ DROP TABLE IF EXISTS public.job; CREATE TABLE public.job ( _id varchar NOT null PRIMARY KEY, - pv_id INT NOT NULL, + pv_id INT, params varchar NOT NULL, enqueued_at timestamp, started_at timestamp, @@ -15,5 +15,6 @@ CREATE TABLE public.job ( created_by varchar, created_from varchar, status varchar, - FOREIGN KEY (pv_id) REFERENCES public.processed_vuln (_id) + FOREIGN KEY (pv_id) REFERENCES public.processed_vuln (_id), + FOREIGN KEY (created_from) REFERENCES public.job (_id) ); diff --git a/prospector/docker-compose.yml b/prospector/docker-compose.yml index df7fdd36a..8de0ae35d 100644 --- a/prospector/docker-compose.yml +++ b/prospector/docker-compose.yml @@ -28,7 +28,7 @@ services: context: . dockerfile: docker/worker/Dockerfile volumes: - - .:/pythonimports + - ./data_sources/reports:/app/data_sources/reports depends_on: - redis environment: diff --git a/prospector/docker/Dockerfile b/prospector/docker/Dockerfile index ce7e232f4..05f47adda 100644 --- a/prospector/docker/Dockerfile +++ b/prospector/docker/Dockerfile @@ -1,8 +1,11 @@ FROM python:3.10-slim +COPY . /app +WORKDIR /app RUN pip install --upgrade pip RUN apt update && apt install -y --no-install-recommends gcc g++ libffi-dev python3-dev libpq-dev git curl -COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt RUN python -m spacy download en_core_web_sm RUN apt autoremove -y gcc g++ libffi-dev python3-dev && apt clean && rm -rf /var/lib/apt/lists/* + +ENV PYTHONPATH "${PYTHONPATH}:/app" diff --git a/prospector/docker/cli/Dockerfile b/prospector/docker/cli/Dockerfile index 058e4bcdf..befb7cf3d 100644 --- a/prospector/docker/cli/Dockerfile +++ b/prospector/docker/cli/Dockerfile @@ -16,12 +16,12 @@ FROM prospector-base:1.0 -WORKDIR /clirun - -VOLUME ["/clirun"] -ENV PYTHONPATH "${PYTHONPATH}:/clirun" - -# check if Prospector is running containerised -ENV IN_CONTAINER=1 +#WORKDIR /clirun +# +#VOLUME ["/clirun"] +#ENV PYTHONPATH "${PYTHONPATH}:/clirun" +#WORKDIR /app +#ENV PYTHONPATH "${PYTHONPATH}:/app" +VOLUME [ "/results" ] ENTRYPOINT [ "python","cli/main.py" ] diff --git a/prospector/docker/service/Dockerfile b/prospector/docker/service/Dockerfile index d9d0c1a01..921091f72 100644 --- a/prospector/docker/service/Dockerfile +++ b/prospector/docker/service/Dockerfile @@ -19,8 +19,8 @@ FROM prospector-base:1.0 -VOLUME ["/app"] -ENV PYTHONPATH "${PYTHONPATH}:/app" -WORKDIR /app +#VOLUME ["/app"] +#ENV PYTHONPATH "${PYTHONPATH}:/app" +#WORKDIR /app CMD ["python","./service/main.py"] diff --git a/prospector/docker/worker/Dockerfile b/prospector/docker/worker/Dockerfile index 2848666cb..d6605995e 100644 --- a/prospector/docker/worker/Dockerfile +++ b/prospector/docker/worker/Dockerfile @@ -66,8 +66,10 @@ RUN apt update && apt install -y --no-install-recommends supervisor COPY docker/worker/start_rq_worker.sh /usr/local/bin/start_rq_worker.sh COPY docker/worker/etc_supervisor_confd_rqworker.conf.j2 /etc/supervisor.d/rqworker.ini.j2 -VOLUME ["/pythonimports"] -ENV PYTHONPATH "${PYTHONPATH}:/pythonimports" +#VOLUME ["/pythonimports"] +#ENV PYTHONPATH "${PYTHONPATH}:/pythonimports" + +VOLUME ["data_sources/nvd/reports"] RUN chmod +x /usr/local/bin/start_rq_worker.sh #CMD tail -f /dev/null diff --git a/prospector/docker/worker/etc_supervisor_confd_rqworker.conf.j2 b/prospector/docker/worker/etc_supervisor_confd_rqworker.conf.j2 index 6003db806..30375d14a 100644 --- a/prospector/docker/worker/etc_supervisor_confd_rqworker.conf.j2 +++ b/prospector/docker/worker/etc_supervisor_confd_rqworker.conf.j2 @@ -4,7 +4,7 @@ ; (possibly with minor modifications) [program:rqworker] -command=/usr/local/bin/python3 /usr/local/bin/rq worker {{env['RQ_QUEUE']}} -u redis://{{env['REDIS_HOST']}}:{{env['REDIS_PORT']}}/{{env['REDIS_DB']}} --logging_level {{env['LOG_LEVEL']}} --path /pythonimports/data_sources/nvd +command=/usr/local/bin/python3 /usr/local/bin/rq worker {{env['RQ_QUEUE']}} -u redis://{{env['REDIS_HOST']}}:{{env['REDIS_PORT']}}/{{env['REDIS_DB']}} --logging_level {{env['LOG_LEVEL']}} --path /app/data_sources/nvd --path /app/service/api/routers process_name=%(program_name)s%(process_num)01d ; If you want to run more than one worker instance, increase this @@ -14,7 +14,7 @@ redirect_stderr=true ; This is the directory from which RQ is ran. Be sure to point this to the ; directory where your source code is importable from. rq-scheduler depends ; on this directory to correctly import functions. -directory=/pythonimports +directory=/app ; RQ requires the TERM signal to perform a warm shutdown. If RQ does not die ; within 10 seconds, supervisor will forcefully kill it diff --git a/prospector/requirements.txt b/prospector/requirements.txt index b965949e1..7ad4544e4 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -139,16 +139,15 @@ typing-inspect==0.9.0 tzdata==2024.1 ujson==5.10.0 url-normalize==1.4.3 -urllib3==2.2.1 -uvicorn==0.30.1 -uvloop==0.19.0 -validators==0.28.3 -wasabi==1.1.3 -watchfiles==0.22.0 -weasel==0.4.1 -websockets==12.0 -wrapt==1.16.0 -yarl==1.9.4 +urllib3==1.26.12 +uvicorn==0.19.0 +validators==0.20.0 +wasabi==0.10.1 +wrapt==1.14.1 +python-multipart==0.0.5 +omegaconf==2.2.3 +aiohttp==3.8.4 +aiofiles==23.1.0 # The following packages are considered to be unsafe in a requirements file: # setuptools diff --git a/prospector/run_prospector.sh b/prospector/run_prospector.sh index bd8baf4f3..c8215497a 100755 --- a/prospector/run_prospector.sh +++ b/prospector/run_prospector.sh @@ -14,5 +14,23 @@ if [[ "$(docker images -q $IMAGE_NAME 2> /dev/null)" == "" ]]; then docker build -t $IMAGE_NAME -f docker/cli/Dockerfile . fi +# Function to extract the value of a specific option +get_option_value() { + while [[ $# -gt 0 ]]; do + if [[ $1 == "--report-filename" ]]; then + echo "$2" + return + fi + shift + done +} + +REPORT_FILENAME=$(get_option_value "$@") +if [[ -z $REPORT_FILENAME ]]; then + OUTPUT_DIR="." +else + OUTPUT_DIR=$(dirname "$REPORT_FILENAME") +fi + # run the docker container -docker run --network=prospector_default --rm -t -v $(pwd):/clirun $IMAGE_NAME "$@" +docker run --network=prospector_default --rm -t -v $(pwd)/$OUTPUT_DIR:/app/results $IMAGE_NAME "$@" diff --git a/prospector/service/api/routers/endpoints.py b/prospector/service/api/routers/endpoints.py index 88bf8357f..52e035960 100644 --- a/prospector/service/api/routers/endpoints.py +++ b/prospector/service/api/routers/endpoints.py @@ -1,4 +1,5 @@ import os +import queue import sys from datetime import datetime @@ -11,7 +12,6 @@ from starlette.responses import RedirectResponse from data_sources.nvd.job_creation import run_prospector -from service.api.rq_utils import get_all_jobs, queue from util.config_parser import parse_config_file # from core.report import generate_report diff --git a/prospector/service/api/routers/feeds.py b/prospector/service/api/routers/feeds.py new file mode 100644 index 000000000..64f9c0c79 --- /dev/null +++ b/prospector/service/api/routers/feeds.py @@ -0,0 +1,121 @@ +import os +from datetime import datetime + +from fastapi import APIRouter, FastAPI, Request +from fastapi.responses import HTMLResponse +from fastapi.templating import Jinja2Templates +from starlette.responses import RedirectResponse + +from backenddb.postgres import PostgresBackendDB +from data_sources.nvd import filter_entries, job_creation +from log.logger import logger +from util.config_parser import parse_config_file + +router = APIRouter( + prefix="/feeds", + tags=["feeds"], + responses={404: {"description": "Not found"}}, +) + +templates = Jinja2Templates(directory="service/static") + +config = parse_config_file() +redis_url = config.redis_url + + +def connect_to_db(): + db = PostgresBackendDB( + config.database.user, + config.database.password, + config.database.host, + config.database.port, + config.database.dbname, + ) + db.connect() + return db + + +@router.get("/{vuln_id}") +async def get_vuln(vuln_id: str): + + db = connect_to_db() + vuln = db.lookup_vuln(vuln_id) + db.disconnect() + + if vuln: + response_object = { + "vuln_data": { + "vuln_id": vuln["vuln_id"], + "vuln_pubdate": vuln["published_date"], + "vuln_lastmoddate": vuln["last_modified_date"], + "vuln_raw_record": vuln["raw_record"], + "vuln_source": vuln["source"], + "vuln_url": vuln["url"], + "vuln_alias": vuln["alias"], + } + } + else: + response_object = {"message": "vulnerability not found"} + return response_object + + +@router.get("/") +async def get_vulnList(): + try: + db = connect_to_db() + vulnlist = db.lookup_vulnList() + db.disconnect() + except Exception: + logger.error("error updating the vuln list", exc_info=True) + return vulnlist + + +@router.get("/reports") +async def get_reports(request: Request): + report_list = [] + for filename in os.listdir("/app/data_sources/reports"): + if filename.endswith(".html"): + file_path = os.path.join("/app/data_sources/reports", filename) + mtime = os.path.getmtime(file_path) + mtime_dt = datetime.fromtimestamp(mtime) + report_list.append((os.path.splitext(filename)[0], mtime_dt)) + report_list.sort(key=lambda x: x[1], reverse=True) + return templates.TemplateResponse( + "report_list.html", {"request": request, "report_list": report_list} + ) + + +@router.get("/fetch_vulns/{d_time}") +async def get_vulns(d_time: int): + # Retrieve and save new vulns using NVD APIs + try: + await filter_entries.retrieve_vulns(d_time) + response_object = {"message": "success"} + except Exception: + response_object = {"message": "error while retrieving new vulnerabilities"} + return response_object + + +@router.post("/process_vulns") +async def get_process_vulns(): + # Process entries and save into db + try: + await filter_entries.process_entries() + response_object = {"message": "success"} + except Exception: + logger.error("error while processing vulnerabilities", exc_info=True) + response_object = {"message": "error while processing vulnerabilities"} + return response_object + + +@router.post("/create_jobs") +async def create_jobs(): + # Create and enqueue the jobs. Save them into db + try: + await job_creation.enqueue_jobs() + response_object = {"message": "success"} + except Exception: + logger.error("Error while creating jobs", exc_info=True) + logger.error(f"{redis_url}") + response_object = {"message": "error while creating jobs"} + return response_object diff --git a/prospector/service/api/routers/home.py b/prospector/service/api/routers/home.py index 020ea6469..c74ef8168 100644 --- a/prospector/service/api/routers/home.py +++ b/prospector/service/api/routers/home.py @@ -1,5 +1,6 @@ import time +from api.rq_utils import queue, get_all_jobs import redis from fastapi import APIRouter, FastAPI, Request from fastapi.responses import HTMLResponse @@ -8,7 +9,6 @@ from rq.job import Job from starlette.responses import RedirectResponse -from service.api.rq_utils import get_all_jobs, queue from util.config_parser import parse_config_file # from core.report import generate_report diff --git a/prospector/service/api/routers/jobs.py b/prospector/service/api/routers/jobs.py index 0f759e485..61f371b8c 100644 --- a/prospector/service/api/routers/jobs.py +++ b/prospector/service/api/routers/jobs.py @@ -1,18 +1,18 @@ -import os - import redis -from fastapi import APIRouter +from fastapi import APIRouter, FastAPI, HTTPException, Request +from fastapi.responses import HTMLResponse +from fastapi.templating import Jinja2Templates +from pydantic import BaseModel from rq import Connection, Queue from rq.job import Job -from git.git import do_clone +from backenddb.postgres import PostgresBackendDB +from data_sources.nvd.job_creation import run_prospector from log.logger import logger from service.api.routers.nvd_feed_update import main from util.config_parser import parse_config_file config = parse_config_file() - - redis_url = config.redis_url router = APIRouter( @@ -21,83 +21,131 @@ responses={404: {"description": "Not found"}}, ) +templates = Jinja2Templates(directory="service/static") -# ----------------------------------------------------------------------------- -@router.post("/clone", tags=["jobs"]) -async def create_clone_job(repository): - with Connection(redis.from_url(redis_url)): - queue = Queue() - job = Job.create( - do_clone, - ( - repository, - "/tmp", - ), - description="clone job " + repository, - result_ttl=1000, - ) - queue.enqueue_job(job) - response_object = { - "job_data": { - "job_id": job.get_id(), - "job_status": job.get_status(), - "job_queue_position": job.get_position(), - "job_description": job.description, - "job_created_at": job.created_at, - "job_started_at": job.started_at, - "job_ended_at": job.ended_at, - "job_result": job.result, - } - } - return response_object +def connect_to_db(): + db = PostgresBackendDB( + config.database.user, + config.database.password, + config.database.host, + config.database.port, + config.database.dbname, + ) + db.connect() + return db -@router.get("/{job_id}", tags=["jobs"]) -async def get_job(job_id): - with Connection(redis.from_url(redis_url)): - queue = Queue() - job = queue.fetch_job(job_id) +class DbJob(BaseModel): + vuln_id: str | None = None + repo: str | None = None + version: str | None = None + status: str | None = None + started_at: str | None = None + finished_at: str | None = None + created_from: str | None = None + + +@router.get("/{job_id}") +async def get_job(job_id: str): + + db = connect_to_db() + job = db.lookup_job_id(job_id) + db.disconnect() + if job: - logger.info("job {} result: {}".format(job.get_id(), job.result)) response_object = { "job_data": { - "job_id": job.get_id(), - "job_status": job.get_status(), - "job_queue_position": job.get_position(), - "job_description": job.description, - "job_created_at": job.created_at, - "job_started_at": job.started_at, - "job_ended_at": job.ended_at, - "job_result": job.result, + "job_id": job["_id"], + "job_params": job["params"], + "job_enqueued_at": job["enqueued_at"], + "job_started_at": job["started_at"], + "job_finished_at": job["finished_at"], + "job_results": job["results"], + "job_created_by": job["created_by"], + "job_created_from": job["created_from"], + "job_status": job["status"], } } else: - response_object = {"status": "error"} + response_object = {"message": "job not found"} return response_object -@router.post("/update_feed", tags=["jobs"]) -async def create_update_feed_job(): +@router.get("/") +async def get_jobList(): + try: + db = connect_to_db() + joblist = db.get_all_jobs() + db.disconnect() + except Exception: + logger.error(f"error updating the job list {joblist}", exc_info=True) + return joblist + + +@router.post("/") +async def enqueue(job: DbJob): with Connection(redis.from_url(redis_url)): queue = Queue() - job = Job.create( - main, - description="update nvd feed", - result_ttl=1000, + rq_job = queue.enqueue( + run_prospector, + args=(job.vuln_id, job.repo, job.version), + at_front=True, + ) + + db = connect_to_db() + if job.created_from is None: + logger.info("saving manual job in db", exc_info=True) + db.save_manual_job( + rq_job.get_id(), + rq_job.args, + rq_job.created_at, + rq_job.started_at, + rq_job.ended_at, + rq_job.result, + "Manual", + rq_job.get_status(refresh=True), + ) + else: + logger.info("saving dependent job in db", exc_info=True) + db.save_dependent_job( + job.created_from, + rq_job.get_id(), + rq_job.args, + rq_job.created_at, + rq_job.started_at, + rq_job.ended_at, + rq_job.result, + "Manual", + rq_job.get_status(refresh=True), ) - queue.enqueue_job(job) + + db.disconnect() response_object = { "job_data": { - "job_id": job.get_id(), - "job_status": job.get_status(), - "job_queue_position": job.get_position(), - "job_description": job.description, - "job_created_at": job.created_at, - "job_started_at": job.started_at, - "job_ended_at": job.ended_at, - "job_result": job.result, + "job_id": rq_job.get_id(), + "job_status": rq_job.get_status(), + "job_queue_position": rq_job.get_position(), + "job_description": rq_job.description, + "job_created_at": rq_job.created_at, + "job_started_at": rq_job.started_at, + "job_ended_at": rq_job.ended_at, + "job_result": rq_job.result, } } + + return response_object + + +@router.put("/{job_id}/") +async def set_status(job: DbJob): + try: + db = connect_to_db() + db.update_job(job.status, job.vuln_id) + db.disconnect() + response_object = {"message": "job status updated"} + except Exception: + response_object = {"message": "job status not updated correctly"} + return response_object diff --git a/prospector/service/api/routers/preprocessed.py b/prospector/service/api/routers/preprocessed.py index ca90f0095..54d5ee2f2 100644 --- a/prospector/service/api/routers/preprocessed.py +++ b/prospector/service/api/routers/preprocessed.py @@ -3,7 +3,7 @@ from fastapi import APIRouter, HTTPException from fastapi.responses import JSONResponse -from commitdb.postgres import PostgresCommitDB +from backenddb.postgres import PostgresBackendDB from util.config_parser import parse_config_file config = parse_config_file() @@ -22,7 +22,7 @@ async def get_commits( repository_url: str, commit_id: Optional[str] = None, ): - db = PostgresCommitDB( + db = PostgresBackendDB( config.database.user, config.database.password, config.database.host, @@ -42,7 +42,7 @@ async def get_commits( @router.post("/") async def upload_preprocessed_commit(payload: List[Dict[str, Any]]): - db = PostgresCommitDB( + db = PostgresBackendDB( config.database.user, config.database.password, config.database.host, diff --git a/prospector/service/api/rq_utils.py b/prospector/service/api/rq_utils.py deleted file mode 100644 index 7b83905cc..000000000 --- a/prospector/service/api/rq_utils.py +++ /dev/null @@ -1,45 +0,0 @@ -import redis -from rq import Queue -from util.config_parser import parse_config_file - -config = parse_config_file() -redis_url = config.redis_url - - -redis_connection = redis.from_url(redis_url) - -queue = Queue(connection=redis_connection) - - -# get job ids -def get_job_ids(job_registry): - return job_registry.get_job_ids() - - -def get_all_job_ids(): - all_job_ids = [] - all_job_ids.extend(get_job_ids(queue.started_job_registry)) - all_job_ids.extend(queue.job_ids) - all_job_ids.extend(get_job_ids(queue.failed_job_registry)) - all_job_ids.extend(get_job_ids(queue.deferred_job_registry)) - all_job_ids.extend(get_job_ids(queue.finished_job_registry)) - all_job_ids.extend(get_job_ids(queue.scheduled_job_registry)) - return all_job_ids - - -# get job given its id -def get_job_from_id(job_id): - job = queue.fetch_job(job_id) - return job - - -# get all jobs -def get_all_jobs(): - # init all_jobs list - all_jobs = [] - # get all job ids - all_job_ids = get_all_job_ids() - # iterate over job ids list and fetch jobs - for job_id in all_job_ids: - all_jobs.append(get_job_from_id(job_id)) - return all_jobs diff --git a/prospector/service/main.py b/prospector/service/main.py index c33f41d1f..41ee6a795 100644 --- a/prospector/service/main.py +++ b/prospector/service/main.py @@ -1,17 +1,20 @@ import uvicorn + +# from .dependencies import oauth2_scheme +from api.routers import feeds, jobs, nvd, preprocessed, users from fastapi import FastAPI from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import HTMLResponse, RedirectResponse from fastapi.staticfiles import StaticFiles from log.logger import logger - -# from .dependencies import oauth2_scheme -from service.api.routers import endpoints, home, jobs, nvd, preprocessed, users from util.config_parser import parse_config_file api_metadata = [ - {"name": "data", "description": "Operations with data used to train ML models."}, + { + "name": "data", + "description": "Operations with data used to train ML models.", + }, { "name": "jobs", "description": "Manage jobs.", @@ -35,16 +38,16 @@ app.include_router(users.router) app.include_router(nvd.router) app.include_router(preprocessed.router) -app.include_router(endpoints.router) -app.include_router(home.router) +app.include_router(feeds.router) +app.include_router(jobs.router) app.mount("/static", StaticFiles(directory="service/static"), name="static") # ----------------------------------------------------------------------------- @app.get("/", response_class=HTMLResponse) -async def read_items(): - response = RedirectResponse(url="/docs") +async def read_index(): + response = RedirectResponse(url="static/feed.html") return response diff --git a/prospector/service/static/feed.html b/prospector/service/static/feed.html new file mode 100644 index 000000000..96a97350b --- /dev/null +++ b/prospector/service/static/feed.html @@ -0,0 +1,47 @@ + + + + + Vulnerabilities + + + + + + + + + + +
+
+
+

Vulnerabilities

+
+ + + + + +
+ Reports + Job list + + + + + + + + + + + + +
Vuln IdPublished DateLast Mod DateSource
+
+
+
+ diff --git a/prospector/service/static/feed.js b/prospector/service/static/feed.js new file mode 100644 index 000000000..19bb260e6 --- /dev/null +++ b/prospector/service/static/feed.js @@ -0,0 +1,78 @@ +// Function to call the /process_vulns API +async function processVulns() { + fetch('/feeds/process_vulns', { method: 'POST' }) + .then(response => response.json()) + .then(data => { + console.log(data); + alert(JSON.stringify(data)); + }) + .catch(error => { + console.error(error); + }); +} + +// Function to call the /create_jobs API +async function createJobs() { + await fetch('/feeds/create_jobs', { method: 'POST' }) + .then(response => response.json()) + .then(data => { + console.log(data); + alert(JSON.stringify(data)); + }) + .catch(error => { + console.error(error); + }); +} + + +// Function to call the /create_jobs API +async function fetchVulns() { + const timeRange = document.getElementById("time_range").value; + + fetch('/feeds/fetch_vulns/' + timeRange, { method: 'GET' }) + .then(response => response.json()) + .then(data => { + console.log(data); + fetchVulnData() + }) + .catch(error => { + console.error(error); + }); +} + + + +// Function to update the job table with new data +async function updatefeedTable(vulnList) { + const tableBody = $('#vuln-table tbody'); + tableBody.empty(); + + for (const vuln of vulnList) { + const row = $('').addClass('highlight'); + + const vulnIdCell = $('').text(vuln.vuln_id); + row.append(vulnIdCell); + const pubDateCell = $('').text(vuln.published_date); + row.append(pubDateCell); + const modDateCell = $('').text(vuln.last_modified_date); + row.append(modDateCell); + const sourceCell = $('').text(vuln.source); + row.append(sourceCell); + + tableBody.append(row); + } +} + +// Function to fetch job data from the /jobs endpoint and update the table +async function fetchVulnData() { + fetch('/feeds') + .then(response => response.json()) + .then(data => { + updatefeedTable(data); + }) + .catch(error => { + console.error(error); + }); +} + +fetchVulnData() diff --git a/prospector/service/static/index.css b/prospector/service/static/index.css index ce5d839a9..68bf30ea2 100644 --- a/prospector/service/static/index.css +++ b/prospector/service/static/index.css @@ -1,10 +1,9 @@ tr.highlight:hover { background-color: #E6F0FF; - cursor: pointer; } #reports { position: absolute; top: 10px; right: 10px; -} \ No newline at end of file +} diff --git a/prospector/service/static/index.html b/prospector/service/static/index.html index 10b80a73c..a40fcfdaf 100644 --- a/prospector/service/static/index.html +++ b/prospector/service/static/index.html @@ -2,13 +2,13 @@ - Job List - + Jobs + - + @@ -19,28 +19,18 @@

Job list

-

List of all the jobs. In list until result_ttl has expired

- Reports - + Feed +
- + - {% for job in joblist %} - - - - - - - {% endfor %} +
State Job Id ResultActionsSettings
- {{job.get_status()}}{{ job.id }}ResultDelete
diff --git a/prospector/service/static/index.js b/prospector/service/static/index.js new file mode 100644 index 000000000..93cded14d --- /dev/null +++ b/prospector/service/static/index.js @@ -0,0 +1,59 @@ +// Function to update the job table with new data +async function updateJobTable(jobList) { + const tableBody = $('#job-table tbody'); + tableBody.empty(); + + for (const job of jobList) { + const row = $('').addClass('highlight'); + + const statusBadge = $('').html(`${job.status}`); + row.append(statusBadge); + + const jobIdCell = $('').text(job._id); + row.append(jobIdCell); + + const resultCell = $(''); + const resultLink = $('').attr('href', `/jobs/${job._id}`).text('Result'); + resultCell.append(resultLink); + row.append(resultCell); + + const settingsCell = $(''); + const configureBtn = $('
Reports - Job list + Job list diff --git a/prospector/service/static/feed.js b/prospector/service/static/feed.js index 19bb260e6..c9472d2fc 100644 --- a/prospector/service/static/feed.js +++ b/prospector/service/static/feed.js @@ -74,5 +74,3 @@ async function fetchVulnData() { console.error(error); }); } - -fetchVulnData() diff --git a/prospector/service/static/index.css b/prospector/service/static/index.css index 68bf30ea2..3b10ec57b 100644 --- a/prospector/service/static/index.css +++ b/prospector/service/static/index.css @@ -4,6 +4,12 @@ tr.highlight:hover { #reports { position: absolute; - top: 10px; - right: 10px; + top: 40px; + right: 150px; +} + +#joblist { + position: absolute; + top: 40px; + right: 20px; } diff --git a/prospector/service/static/index.html b/prospector/service/static/index.html index a40fcfdaf..b656d6bf1 100644 --- a/prospector/service/static/index.html +++ b/prospector/service/static/index.html @@ -14,19 +14,19 @@ --> - +

Job list

- Feed + Feed
- - - + + + diff --git a/prospector/service/static/index.js b/prospector/service/static/index.js index 93cded14d..c130273b9 100644 --- a/prospector/service/static/index.js +++ b/prospector/service/static/index.js @@ -17,8 +17,11 @@ async function updateJobTable(jobList) { row.append(jobIdCell); const resultCell = $(' {% for report in report_list %} - + {% endfor %} diff --git a/prospector/util/http.py b/prospector/util/http.py index 5f3678594..443100686 100644 --- a/prospector/util/http.py +++ b/prospector/util/http.py @@ -76,7 +76,6 @@ def get_urls(url: str) -> List[str]: # TODO: properly scrape github issues def extract_from_webpage(url: str, attr_name: str, attr_value: List[str]) -> str: - content = fetch_url(url, None, False) if not content: return "" From c2752e7a3b966445d289cfee5ea8b63a4695873c Mon Sep 17 00:00:00 2001 From: I748376 Date: Wed, 17 Jul 2024 15:38:38 +0000 Subject: [PATCH 55/83] Clean up (removes unused code, updates dependencies, checking for status code 200, adjusts database hostname) --- prospector/backenddb/postgres.py | 205 ++++++++++++++++++++++++++-- prospector/cli/main.py | 14 +- prospector/config-sample.yaml | 2 +- prospector/core/prospector.py | 78 ++++++++--- prospector/docker/worker/Dockerfile | 2 +- prospector/requirements.in | 3 +- prospector/requirements.txt | 103 +++++++------- prospector/run_prospector.sh | 6 +- 8 files changed, 318 insertions(+), 95 deletions(-) diff --git a/prospector/backenddb/postgres.py b/prospector/backenddb/postgres.py index b2f0a029d..1d87b34f9 100644 --- a/prospector/backenddb/postgres.py +++ b/prospector/backenddb/postgres.py @@ -2,6 +2,7 @@ This module implements an abstraction layer on top of the underlying database where pre-processed commits are stored """ + import os from typing import Any, Dict, List @@ -28,6 +29,15 @@ class PostgresBackendDB(BackendDB): """ def __init__(self, user, password, host, port, dbname): + """Initialize a PostgresBackendDB instance with database connection details. + + Args: + user (str): The username for the database. + password (str): The password for the database. + host (str): The hostname or IP address of the database server. + port (str): The port number of the database server. + dbname (str): The name of the database to connect to. + """ self.user = user self.password = password self.host = host @@ -44,7 +54,7 @@ def connect(self): host=self.host, port=self.port, ) - print("Connected to the database") + # print("Connected to the database") # Sanity Check except Exception: self.host = "localhost" self.connection = psycopg2.connect( @@ -63,7 +73,23 @@ def disconnect(self): else: print("No active database connection") - def lookup(self, repository: str, commit_id: str = None) -> List[Dict[str, Any]]: + def lookup( + self, repository: str, commit_id: str = None + ) -> List[Dict[str, Any]]: + """Look up commits in the database based on repository and commit ID. + + Args: + repository (str): The repository name. + commit_id (str, optional): A comma-separated list of commit IDs. If not + provided, all commits for the repository are returned. + + Returns: + A list of dictionaries, where each dictionary represents a commit and its + metadata. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") @@ -87,12 +113,22 @@ def lookup(self, repository: str, commit_id: str = None) -> List[Dict[str, Any]] results.append(cur.fetchone()) return [dict(row) for row in results] # parse_commit_from_db except Exception: - logger.error("Could not lookup commit vector in database", exc_info=True) + logger.error( + "Could not lookup commit vector in database", exc_info=True + ) return [] finally: cur.close() def save(self, commit: Dict[str, Any]): + """Save a commit to the database. + + Args: + commit (dict): A dictionary representing the commit and its metadata. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") @@ -104,7 +140,9 @@ def save(self, commit: Dict[str, Any]): self.connection.commit() cur.close() except Exception: - logger.error("Could not save commit vector to database", exc_info=True) + logger.error( + "Could not save commit vector to database", exc_info=True + ) cur.close() def reset(self): @@ -112,6 +150,14 @@ def reset(self): self.run_sql_script("ddl/20_users.sql") def run_sql_script(self, script_file): + """Run an SQL script file on the database. + + Args: + script_file (str): The path to the SQL script file. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") @@ -125,6 +171,18 @@ def run_sql_script(self, script_file): cursor.close() def lookup_vuln_id(self, vuln_id: str, last_modified_date): + """Look up the vulnerability count for a given vulnerability ID and last modified date. + + Args: + vuln_id (str): The vulnerability ID. + last_modified_date: The last modified date of the vulnerability. + + Returns: + The dict of vulnerabilities matching the given ID and last modified date. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") results = None @@ -140,28 +198,53 @@ def lookup_vuln_id(self, vuln_id: str, last_modified_date): self.connection.commit() except Exception: self.connection.rollback() - logger.error("Could not lookup vulnerability in database", exc_info=True) + logger.error( + "Could not lookup vulnerability in database", exc_info=True + ) finally: cur.close() return results def lookup_vuln(self, vuln_id: str): + """Look up a vulnerability by its ID. + + Args: + vuln_id (str): The vulnerability ID. + + Returns: + A dictionary representing the vulnerability data, or None if not found. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") results = None try: cur = self.connection.cursor(cursor_factory=RealDictCursor) - cur.execute("SELECT * FROM vulnerability WHERE vuln_id = %s", (vuln_id,)) + cur.execute( + "SELECT * FROM vulnerability WHERE vuln_id = %s", (vuln_id,) + ) results = cur.fetchone() self.connection.commit() except Exception: self.connection.rollback() - logger.error("Could not lookup vulnerability in database", exc_info=True) + logger.error( + "Could not lookup vulnerability in database", exc_info=True + ) finally: cur.close() return results def lookup_vulnList(self): + """Retrieve a list of all vulnerabilities from the database. + + Returns: + A list of dictionaries, where each dictionary represents a vulnerability. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") results = [] @@ -187,6 +270,19 @@ def save_vuln( source: str, url: str, ): + """Save a vulnerability to the database. + + Args: + vuln_id (str): The vulnerability ID. + published_date (str): The published date of the vulnerability. + last_modified_date (str): The last modified date of the vulnerability. + raw_record (Json): The raw vulnerability record data. + source (str): The source of the vulnerability data. + url (str): The URL associated with the vulnerability. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") @@ -194,12 +290,21 @@ def save_vuln( cur = self.connection.cursor() cur.execute( "INSERT INTO vulnerability (vuln_id, published_date, last_modified_date, raw_record, source, url) VALUES (%s,%s,%s,%s,%s,%s)", - (vuln_id, published_date, last_modified_date, raw_record, source, url), + ( + vuln_id, + published_date, + last_modified_date, + raw_record, + source, + url, + ), ) self.connection.commit() cur.close() except Exception: - logger.error("Could not save vulnerability to database", exc_info=True) + logger.error( + "Could not save vulnerability to database", exc_info=True + ) cur.close() def save_job( @@ -214,6 +319,22 @@ def save_job( created_by: str, status: str, ): + """Save a job to the database. + + Args: + _id (str): The job ID. + pv_id (int): The processed vulnerability ID associated with the job. + params (str): The job parameters. + enqueued_at (str): The enqueued timestamp of the job. + started_at (str): The started timestamp of the job. + finished_at (str): The finished timestamp of the job. + results (str): The job results. + created_by (str): The user who created the job. + status (str): The status of the job. + + Raises: + Exception: If there is no active database connection. + """ if not self.connection: raise Exception("Invalid connection") try: @@ -267,7 +388,9 @@ def save_manual_job( ) # Retrieve the newly inserted vulnerability ID - cur.execute("SELECT _id FROM vulnerability WHERE vuln_id = %s", (vuln_id,)) + cur.execute( + "SELECT _id FROM vulnerability WHERE vuln_id = %s", (vuln_id,) + ) vulnerability_id = cur.fetchone()[0] # Insert into processed_vuln table @@ -355,6 +478,7 @@ def save_dependent_job( cur.close() def lookup_job(self): + """Retrieve all job entries from the database.""" if not self.connection: raise Exception("Invalid connection") results = [] @@ -369,6 +493,7 @@ def lookup_job(self): return results def lookup_processed_no_job(self): + """Retrieve processed vulnerability IDs that do not have a corresponding job entry.""" if not self.connection: raise Exception("Invalid connection") results = [] @@ -385,6 +510,7 @@ def lookup_processed_no_job(self): return results def get_processed_vulns(self): + """Retrieve all processed vulnerabilities from the database.""" if not self.connection: raise Exception("Invalid connection") results = [] @@ -406,6 +532,7 @@ def get_processed_vulns(self): def get_processed_vulns_not_in_job( self, ): # entries in processed vuln excluding the ones already in the job table + """Retrieve processed vulnerabilities that do not have a corresponding job entry.""" if not self.connection: raise Exception("Invalid connection") results = [] @@ -425,6 +552,7 @@ def get_processed_vulns_not_in_job( return results def get_unprocessed_vulns(self): + """retrieve unprocessed vulnerabilities from the database.""" if not self.connection: raise Exception("Invalid connection") results = [] @@ -444,6 +572,13 @@ def get_unprocessed_vulns(self): return results def save_processed_vuln(self, fk_vuln: int, repository: str, versions: str): + """Save a processed vulnerability to the database. + + Args: + fk_vuln (int): The foreign key of the vulnerability. + repository (str): The repository of the vulnerability. + versions (str): The versions affected by the vulnerability. + """ if not self.connection: raise Exception("Invalid connection") try: @@ -456,11 +591,13 @@ def save_processed_vuln(self, fk_vuln: int, repository: str, versions: str): cur.close() except Exception: logger.error( - "Could not save processed vulnerability to database", exc_info=True + "Could not save processed vulnerability to database", + exc_info=True, ) cur.close() def get_all_jobs(self): + """Retrieve all job entries from the database.""" if not self.connection: raise Exception("Invalid connection") results = [] @@ -478,6 +615,14 @@ def get_all_jobs(self): return results def lookup_job_id(self, job_id: str): + """Retrieve a job entry by its ID. + + Args: + job_id (str): The ID of the job to retrieve. + + Returns: + The job entry as a dictionary, or None if not found. + """ if not self.connection: raise Exception("Invalid connection") results = None @@ -502,6 +647,15 @@ def update_job( ended_at: str = None, results: str = None, ): + """Update the status and other fields of a job entry. + + Args: + job_id (str): The ID of the job to update. + status (str): The new status of the job. + started_at (str, optional): The new started timestamp of the job. + ended_at (str, optional): The new finished timestamp of the job. + results (str, optional): The new results of the job. + """ if not self.connection: raise Exception("Invalid connection") @@ -521,11 +675,24 @@ def update_job( self.connection.commit() cur.close() except Exception: - logger.error("Could not update job status in database", exc_info=True) + logger.error( + "Could not update job status in database", exc_info=True + ) cur.close() def parse_connect_string(connect_string): + """Parse a connection string and return a dictionary with the connection parameters. + + Args: + connect_string (str): The connection string to parse. + + Returns: + A dictionary containing the connection parameters. + + Raises: + Exception: If the connection string is invalid. + """ try: return parse_dsn(connect_string) except Exception: @@ -533,16 +700,28 @@ def parse_connect_string(connect_string): def build_statement(data: Dict[str, Any]): + """Build an SQL statement to insert or update a row in the commits table. + + Args: + data (Dict[str, Any]): A dictionary containing the column names and values. + + Returns: + The SQL statement as a string. + """ columns = ",".join(data.keys()) on_conflict = ",".join([f"EXCLUDED.{key}" for key in data.keys()]) return f"INSERT INTO commits ({columns}) VALUES ({','.join(['%s'] * len(data))}) ON CONFLICT ON CONSTRAINT commits_pkey DO UPDATE SET ({columns}) = ({on_conflict})" def get_args(data: Dict[str, Any]): - return tuple([Json(val) if isinstance(val, dict) else val for val in data.values()]) + """Returns a tuple containing the values in 'data'.""" + return tuple( + [Json(val) if isinstance(val, dict) else val for val in data.values()] + ) def parse_commit_from_db(raw_data: DictRow) -> Dict[str, Any]: + """Parses a commit entry from the database and returns a dictionary.""" out = dict(raw_data) out["hunks"] = [(int(x[1]), int(x[3])) for x in raw_data["hunks"]] return out diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 2cdeac7d4..eae7d01ae 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -2,9 +2,6 @@ import os import signal import sys -from typing import Any, Dict - -from dotenv import load_dotenv from llm.llm_service import LLMService from util.http import ping_backend @@ -16,8 +13,6 @@ import core.report as report # noqa: E402 from cli.console import ConsoleWriter, MessageStatus # noqa: E402 -from core.prospector import TIME_LIMIT_AFTER # noqa: E402 -from core.prospector import TIME_LIMIT_BEFORE # noqa: E402 from core.prospector import prospector # noqa: E402; noqa: E402 # Load logger before doing anything else @@ -58,7 +53,10 @@ def main(argv): # noqa: C901 # Whether to use the LLMService if config.llm_service: - if not config.repository and not config.llm_service.use_llm_repository_url: + if ( + not config.repository + and not config.llm_service.use_llm_repository_url + ): logger.error( "Repository URL was neither specified nor allowed to obtain with LLM support. One must be set." ) @@ -80,7 +78,9 @@ def main(argv): # noqa: C901 return config.pub_date = ( - config.pub_date + "T00:00:00Z" if config.pub_date is not None else "" + config.pub_date + "T00:00:00Z" + if config.pub_date is not None + else "" ) logger.debug("Using the following configuration:") diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index f89f4b369..86e4ecad4 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -21,7 +21,7 @@ backend: http://localhost:8000 database: user: postgres password: example - host: db + host: localhost # Database address; when in containerised version, use 'db', otherwise 'localhost' port: 5432 dbname: postgres diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py index 3576fb749..ca016bc81 100644 --- a/prospector/core/prospector.py +++ b/prospector/core/prospector.py @@ -117,7 +117,9 @@ def prospector( # noqa: C901 repository = Git(repository_url, git_cache) with ConsoleWriter("Git repository cloning") as console: - logger.debug(f"Downloading repository {repository.url} in {repository.path}") + logger.debug( + f"Downloading repository {repository.url} in {repository.path}" + ) repository.clone() tags = repository.get_tags() @@ -129,7 +131,9 @@ def prospector( # noqa: C901 if len(fixing_commit) > 0: candidates = get_commits_no_tags(repository, fixing_commit) - if len(candidates) > 0 and any([c for c in candidates if c in fixing_commit]): + if len(candidates) > 0 and any( + [c for c in candidates if c in fixing_commit] + ): console.print("Fixing commit found in the advisory references\n") advisory_record.has_fixing_commit = True @@ -160,9 +164,13 @@ def prospector( # noqa: C901 candidates = filter(candidates) if len(candidates) > limit_candidates: - logger.error(f"Number of candidates exceeds {limit_candidates}, aborting.") + logger.error( + f"Number of candidates exceeds {limit_candidates}, aborting." + ) - ConsoleWriter.print(f"Candidates limitlimit exceeded: {len(candidates)}.") + ConsoleWriter.print( + f"Candidates limitlimit exceeded: {len(candidates)}." + ) return None, len(candidates) with ExecutionTimer( @@ -171,10 +179,12 @@ def prospector( # noqa: C901 with ConsoleWriter("\nProcessing commits") as writer: try: if use_backend != USE_BACKEND_NEVER: - missing, preprocessed_commits = retrieve_preprocessed_commits( - repository_url, - backend_address, - candidates, + missing, preprocessed_commits = ( + retrieve_preprocessed_commits( + repository_url, + backend_address, + candidates, + ) ) except requests.exceptions.ConnectionError: logger.error( @@ -225,7 +235,11 @@ def prospector( # noqa: C901 payload = [c.to_dict() for c in preprocessed_commits] - if len(payload) > 0 and use_backend != USE_BACKEND_NEVER and len(missing) > 0: + if ( + len(payload) > 0 + and use_backend != USE_BACKEND_NEVER + and len(missing) > 0 + ): save_preprocessed_commits(backend_address, payload) else: logger.warning("Preprocessed commits are not being sent to backend") @@ -242,9 +256,13 @@ def prospector( # noqa: C901 return ranked_candidates, advisory_record -def preprocess_commits(commits: List[RawCommit], timer: ExecutionTimer) -> List[Commit]: +def preprocess_commits( + commits: List[RawCommit], timer: ExecutionTimer +) -> List[Commit]: preprocessed_commits: List[Commit] = list() - with Counter(timer.collection.sub_collection("commit preprocessing")) as counter: + with Counter( + timer.collection.sub_collection("commit preprocessing") + ) as counter: counter.initialize("preprocessed commits", unit="commit") for raw_commit in tqdm( commits, @@ -252,7 +270,9 @@ def preprocess_commits(commits: List[RawCommit], timer: ExecutionTimer) -> List[ unit=" commit", ): counter.increment("preprocessed commits") - counter_val = counter.__dict__["collection"]["preprocessed commits"][0] + counter_val = counter.__dict__["collection"][ + "preprocessed commits" + ][0] if counter_val % 100 == 0 and counter_val * 2 > time.time(): pass preprocessed_commits.append(make_from_raw_commit(raw_commit)) @@ -287,7 +307,9 @@ def evaluate_commits( """ with ExecutionTimer(core_statistics.sub_collection("candidates analysis")): with ConsoleWriter("Candidate analysis") as _: - ranked_commits = apply_rules(commits, advisory, enabled_rules=enabled_rules) + ranked_commits = apply_rules( + commits, advisory, enabled_rules=enabled_rules + ) return ranked_commits @@ -305,7 +327,9 @@ def remove_twins(commits: List[Commit]) -> List[Commit]: return output -def tag_and_aggregate_commits(commits: List[Commit], next_tag: str) -> List[Commit]: +def tag_and_aggregate_commits( + commits: List[Commit], next_tag: str +) -> List[Commit]: return commits if next_tag is None or next_tag == "": return commits @@ -347,7 +371,9 @@ def retrieve_preprocessed_commits( break # return list(candidates.values()), list() responses.append(r.json()) - retrieved_commits = [commit for response in responses for commit in response] + retrieved_commits = [ + commit for response in responses for commit in response + ] logger.info(f"Found {len(retrieved_commits)} preprocessed commits") @@ -368,7 +394,9 @@ def retrieve_preprocessed_commits( def save_preprocessed_commits(backend_address, payload): - with ExecutionTimer(core_statistics.sub_collection(name="save commits to backend")): + with ExecutionTimer( + core_statistics.sub_collection(name="save commits to backend") + ): with ConsoleWriter("Saving processed commits to backend") as writer: logger.debug("Sending processing commits to backend...") try: @@ -377,6 +405,7 @@ def save_preprocessed_commits(backend_address, payload): json=payload, headers={"Content-type": "application/json"}, ) + r.raise_for_status() # Throw exception if not status 200 logger.debug( f"Saving to backend completed (status code: {r.status_code})" ) @@ -391,6 +420,14 @@ def save_preprocessed_commits(backend_address, payload): "Could not save preprocessed commits to backend", status=MessageStatus.WARNING, ) + except requests.exceptions.HTTPError as e: + logger.error( + f"Could not reach backend, request returned with: {e}." + ) + writer.print( + "Could not save preprocessed commits to backend", + status=MessageStatus.WARNING, + ) # tries to be dynamic @@ -428,7 +465,11 @@ def get_commits_from_tags( with ConsoleWriter("Candidate commit retrieval") as writer: since = None until = None - if advisory_record.published_timestamp and not next_tag and not prev_tag: + if ( + advisory_record.published_timestamp + and not next_tag + and not prev_tag + ): since = advisory_record.reserved_timestamp - time_limit_before until = advisory_record.reserved_timestamp + time_limit_after @@ -442,7 +483,8 @@ def get_commits_from_tags( if len(candidates) == 0: candidates = repository.create_commits( - since=advisory_record.reserved_timestamp - time_limit_before, + since=advisory_record.reserved_timestamp + - time_limit_before, until=advisory_record.reserved_timestamp + time_limit_after, next_tag=None, prev_tag=None, diff --git a/prospector/docker/worker/Dockerfile b/prospector/docker/worker/Dockerfile index d6605995e..e5421220c 100644 --- a/prospector/docker/worker/Dockerfile +++ b/prospector/docker/worker/Dockerfile @@ -69,7 +69,7 @@ COPY docker/worker/etc_supervisor_confd_rqworker.conf.j2 /etc/supervisor.d/rqwor #VOLUME ["/pythonimports"] #ENV PYTHONPATH "${PYTHONPATH}:/pythonimports" -VOLUME ["data_sources/nvd/reports"] +VOLUME [ "/data_sources/reports" ] RUN chmod +x /usr/local/bin/start_rq_worker.sh #CMD tail -f /dev/null diff --git a/prospector/requirements.in b/prospector/requirements.in index 23febfc2b..31e67db6c 100644 --- a/prospector/requirements.in +++ b/prospector/requirements.in @@ -1,4 +1,5 @@ - +aiohttp +aiofiles beautifulsoup4 colorama datasketch diff --git a/prospector/requirements.txt b/prospector/requirements.txt index 7ad4544e4..9cd75b617 100644 --- a/prospector/requirements.txt +++ b/prospector/requirements.txt @@ -5,10 +5,11 @@ # pip-compile --no-annotate --strip-extras # +aiofiles==24.1.0 aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 -anthropic==0.30.1 +anthropic==0.31.2 antlr4-python3-runtime==4.9.3 anyio==4.4.0 appdirs==1.4.4 @@ -16,65 +17,65 @@ async-timeout==4.0.3 attrs==23.2.0 beautifulsoup4==4.12.3 blis==0.7.11 -cachetools==5.3.3 +cachetools==5.4.0 catalogue==2.0.10 cattrs==23.2.3 -certifi==2024.6.2 +certifi==2024.7.4 charset-normalizer==3.3.2 click==8.1.7 cloudpathlib==0.18.1 colorama==0.4.6 confection==0.1.5 cymem==2.0.8 -dataclasses-json==0.6.6 +dataclasses-json==0.6.7 datasketch==1.6.5 defusedxml==0.7.1 distro==1.9.0 dnspython==2.6.1 docstring-parser==0.16 -email-validator==2.1.1 -exceptiongroup==1.2.1 -fastapi==0.111.0 +email-validator==2.2.0 +exceptiongroup==1.2.2 +fastapi==0.111.1 fastapi-cli==0.0.4 -filelock==3.14.0 +filelock==3.15.4 frozenlist==1.4.1 -fsspec==2024.6.0 -google-api-core==2.19.0 -google-auth==2.29.0 +fsspec==2024.6.1 +google-api-core==2.19.1 +google-auth==2.32.0 google-cloud-aiplatform==1.49.0 -google-cloud-bigquery==3.24.0 +google-cloud-bigquery==3.25.0 google-cloud-core==2.4.1 -google-cloud-resource-manager==1.12.3 -google-cloud-storage==2.16.0 +google-cloud-resource-manager==1.12.4 +google-cloud-storage==2.17.0 google-crc32c==1.5.0 -google-resumable-media==2.7.0 -googleapis-common-protos==1.63.1 +google-resumable-media==2.7.1 +googleapis-common-protos==1.63.2 greenlet==3.0.3 -grpc-google-iam-v1==0.13.0 -grpcio==1.64.1 +grpc-google-iam-v1==0.13.1 +grpcio==1.65.1 grpcio-status==1.62.2 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 httpx-sse==0.4.0 -huggingface-hub==0.23.3 +huggingface-hub==0.23.5 idna==3.7 iniconfig==2.0.0 jinja2==3.1.4 jiter==0.5.0 jsonpatch==1.33 -jsonpointer==2.4 -langchain==0.2.2 -langchain-anthropic==0.1.15 -langchain-community==0.2.3 -langchain-core==0.2.4 +jsonpointer==3.0.0 +langchain==0.2.9 +langchain-anthropic==0.1.20 +langchain-community==0.2.7 +langchain-core==0.2.21 langchain-google-vertexai==1.0.5 -langchain-mistralai==0.1.8 -langchain-openai==0.1.8 -langchain-text-splitters==0.2.1 +langchain-mistralai==0.1.10 +langchain-openai==0.1.17 +langchain-text-splitters==0.2.2 langcodes==3.4.0 -langsmith==0.1.74 +langsmith==0.1.90 language-data==1.2.0 marisa-trie==1.2.0 markdown-it-py==3.0.0 @@ -86,20 +87,20 @@ murmurhash==1.0.10 mypy-extensions==1.0.0 numpy==1.26.4 omegaconf==2.3.0 -openai==1.31.1 -orjson==3.10.3 -packaging==23.2 +openai==1.35.14 +orjson==3.10.6 +packaging==24.1 pandas==2.2.2 plac==1.4.3 pluggy==1.5.0 preshed==3.0.9 -proto-plus==1.23.0 +proto-plus==1.24.0 protobuf==4.25.3 psycopg2==2.9.9 pyasn1==0.6.0 pyasn1-modules==0.4.0 -pydantic==2.7.3 -pydantic-core==2.18.4 +pydantic==2.8.2 +pydantic-core==2.20.1 pygments==2.18.0 pytest==8.2.2 python-dateutil==2.9.0.post0 @@ -107,15 +108,15 @@ python-dotenv==1.0.1 python-multipart==0.0.9 pytz==2024.1 pyyaml==6.0.1 -redis==5.0.5 +redis==5.0.7 regex==2024.5.15 requests==2.32.3 requests-cache==0.9.6 rich==13.7.1 rq==1.16.2 rsa==4.9 -scipy==1.13.1 -shapely==2.0.4 +scipy==1.14.0 +shapely==2.0.5 shellingham==1.5.4 six==1.16.0 smart-open==7.0.4 @@ -124,30 +125,30 @@ soupsieve==2.5 spacy==3.7.5 spacy-legacy==3.0.12 spacy-loggers==1.0.5 -sqlalchemy==2.0.30 +sqlalchemy==2.0.31 srsly==2.4.8 starlette==0.37.2 -tenacity==8.3.0 -thinc==8.2.4 +tenacity==8.5.0 +thinc==8.2.5 tiktoken==0.7.0 tokenizers==0.19.1 tomli==2.0.1 tqdm==4.66.4 typer==0.12.3 -typing-extensions==4.12.1 +typing-extensions==4.12.2 typing-inspect==0.9.0 tzdata==2024.1 -ujson==5.10.0 url-normalize==1.4.3 -urllib3==1.26.12 -uvicorn==0.19.0 -validators==0.20.0 -wasabi==0.10.1 -wrapt==1.14.1 -python-multipart==0.0.5 -omegaconf==2.2.3 -aiohttp==3.8.4 -aiofiles==23.1.0 +urllib3==2.2.2 +uvicorn==0.30.1 +uvloop==0.19.0 +validators==0.33.0 +wasabi==1.1.3 +watchfiles==0.22.0 +weasel==0.4.1 +websockets==12.0 +wrapt==1.16.0 +yarl==1.9.4 # The following packages are considered to be unsafe in a requirements file: # setuptools diff --git a/prospector/run_prospector.sh b/prospector/run_prospector.sh index 188b62367..91288fbd5 100755 --- a/prospector/run_prospector.sh +++ b/prospector/run_prospector.sh @@ -26,14 +26,14 @@ get_option_value() { } REPORT_FILENAME=$(get_option_value "$@") -echo $REPORT_FILENAME +# echo $REPORT_FILENAME # Sanity Check if [[ -z $REPORT_FILENAME ]]; then OUTPUT_DIR="" else OUTPUT_DIR=$(dirname "$REPORT_FILENAME") fi -echo $OUTPUT_DIR -echo $(pwd)/$OUTPUT_DIR +# echo $OUTPUT_DIR +# echo $(pwd)/$OUTPUT_DIR # Sanity Check # run the docker container docker run --network=prospector_default --rm -t -v $(pwd)/$OUTPUT_DIR:/app/$OUTPUT_DIR $IMAGE_NAME "$@" From 4753a3e8a7a3f5417e36aa40a85bb1c9588dc2b3 Mon Sep 17 00:00:00 2001 From: I748376 Date: Thu, 18 Jul 2024 14:01:07 +0000 Subject: [PATCH 56/83] updates tests --- prospector/core/prospector_test.py | 33 +- prospector/data_sources/nvd/nvd_test.py | 4 +- .../nvd/version_extraction_test.py | 19 +- prospector/rules/helpers_test.py | 730 +++++++++--------- prospector/service/api/api_test.py | 2 + 5 files changed, 422 insertions(+), 366 deletions(-) diff --git a/prospector/core/prospector_test.py b/prospector/core/prospector_test.py index f754c16b1..673c56de3 100644 --- a/prospector/core/prospector_test.py +++ b/prospector/core/prospector_test.py @@ -2,19 +2,50 @@ import pytest +from llm.llm_service import LLMService + from .prospector import prospector OPENCAST_CVE = "CVE-2021-21318" OPENCAST_REPO = "https://github.com/opencast/opencast" +# Mock the llm_service configuration object +class Config: + type: str = None + model_name: str = None + temperature: str = None + ai_core_sk: str = None + + def __init__(self, type, model_name, temperature, ai_core_sk): + self.type = type + self.model_name = model_name + self.temperature = temperature + self.ai_core_sk = ai_core_sk + + +config = Config("sap", "gpt-4", 0.0, "sk.json") + + def test_prospector_client(): results, _ = prospector( vulnerability_id=OPENCAST_CVE, repository_url=OPENCAST_REPO, version_interval="9.1:9.2", - fetch_references=False, git_cache="/tmp/gitcache", limit_candidates=5000, ) assert results[0].commit_id == "b18c6a7f81f08ed14884592a6c14c9ab611ad450" + + +def test_prospector_llm_repo_url(): + LLMService(config) + + results, _ = prospector( + vulnerability_id=OPENCAST_CVE, + version_interval="9.1:9.2", + git_cache="/tmp/gitcache", + limit_candidates=5000, + use_llm_repository_url=True, + ) + assert results[0].commit_id == "b18c6a7f81f08ed14884592a6c14c9ab611ad450" diff --git a/prospector/data_sources/nvd/nvd_test.py b/prospector/data_sources/nvd/nvd_test.py index cecb6d21c..a606d03da 100644 --- a/prospector/data_sources/nvd/nvd_test.py +++ b/prospector/data_sources/nvd/nvd_test.py @@ -1,5 +1,5 @@ -from filter_entries import process_entries, retrieve_vulns -from job_creation import enqueue_jobs +from data_sources.nvd.filter_entries import process_entries, retrieve_vulns +from data_sources.nvd.job_creation import enqueue_jobs # request new cves entries through NVD API and save to db cves = retrieve_vulns(7) diff --git a/prospector/data_sources/nvd/version_extraction_test.py b/prospector/data_sources/nvd/version_extraction_test.py index 0770dd498..38124f67c 100644 --- a/prospector/data_sources/nvd/version_extraction_test.py +++ b/prospector/data_sources/nvd/version_extraction_test.py @@ -19,9 +19,18 @@ "nodes": [ { "cpeMatch": [ - {"versionStartIncluding": "1.0", "versionEndIncluding": "2.0"}, - {"versionStartExcluding": "2.0", "versionEndExcluding": "3.0"}, - {"versionStartIncluding": "4.0", "versionEndIncluding": "5.0"}, + { + "versionStartIncluding": "1.0", + "versionEndIncluding": "2.0", + }, + { + "versionStartExcluding": "2.0", + "versionEndExcluding": "3.0", + }, + { + "versionStartIncluding": "4.0", + "versionEndIncluding": "5.0", + }, ] }, ] @@ -119,7 +128,7 @@ def test_extract_version_ranges_description(): assert version_range == "None:8.0.4" version_range = extract_version_ranges_description(ADVISORY_TEXT_6) - assert version_range == "6.1.2:None" + assert version_range == "6.1.2.1:None" def test_extract_version_ranges_cpe(): @@ -145,4 +154,4 @@ def test_process_ranges(): def test_extract_version_ranges(): version_range = extract_version_range(JSON_DATA_4, ADVISORY_TEXT_6) - assert version_range == "6.1.2:None" + assert version_range == "6.1.2.1:None" diff --git a/prospector/rules/helpers_test.py b/prospector/rules/helpers_test.py index 7dc803d3d..b69a0247c 100644 --- a/prospector/rules/helpers_test.py +++ b/prospector/rules/helpers_test.py @@ -8,15 +8,7 @@ from util.sample_data_generation import random_list_of_cve from .helpers import ( # extract_features, - extract_changed_relevant_paths, - extract_commit_mentioned_in_linked_pages, - extract_other_CVE_in_message, - extract_path_similarities, - extract_references_vuln_id, extract_referred_to_by_nvd, - extract_time_between_commit_and_advisory_record, - is_commit_in_given_interval, - is_commit_reachable_from_given_tag, ) @@ -69,226 +61,246 @@ def repository(): # ] -def test_extract_references_vuln_id(): - commit = Commit( - commit_id="test_commit", - repository="test_repository", - cve_refs=[ - "test_advisory_record", - "another_advisory_record", - "yet_another_advisory_record", - ], - ) - advisory_record = AdvisoryRecord(vulnerability_id="test_advisory_record") - result = extract_references_vuln_id(commit, advisory_record) - assert result is True - - -def test_time_between_commit_and_advisory_record(): - commit = Commit( - commit_id="test_commit", repository="test_repository", timestamp=142 - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", published_timestamp=100 - ) - assert ( - extract_time_between_commit_and_advisory_record(commit, advisory_record) == 42 - ) +# def test_extract_references_vuln_id(): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# cve_refs=[ +# "test_advisory_record", +# "another_advisory_record", +# "yet_another_advisory_record", +# ], +# ) +# advisory_record = AdvisoryRecord(vulnerability_id="test_advisory_record") +# result = extract_references_vuln_id(commit, advisory_record) +# assert result is True -@pytest.fixture -def paths(): - return [ - "fire-nation/zuko/lightning.png", - "water-bending/katara/necklace.gif", - "air-nomad/aang/littlefoot.jpg", - "earth-kingdom/toph/metal.png", - ] +# def test_time_between_commit_and_advisory_record(): +# commit = Commit( +# commit_id="test_commit", repository="test_repository", timestamp=142 +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", published_timestamp=100 +# ) +# assert ( +# extract_time_between_commit_and_advisory_record(commit, advisory_record) +# == 42 +# ) -@pytest.fixture -def sub_paths(): - return [ - "lightning.png", - "zuko/lightning.png", - "fire-nation/zuko", - "water-bending", - ] - - -class TestExtractChangedRelevantPaths: - @staticmethod - def test_sub_path_matching(paths, sub_paths): - commit = Commit( - commit_id="test_commit", repository="test_repository", changed_files=paths - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", paths=sub_paths - ) - - matched_paths = { - "fire-nation/zuko/lightning.png", - "water-bending/katara/necklace.gif", - } - - assert extract_changed_relevant_paths(commit, advisory_record) == matched_paths - - @staticmethod - def test_same_path_only(paths): - commit = Commit( - commit_id="test_commit", repository="test_repository", changed_files=paths - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", paths=paths[:2] - ) - assert extract_changed_relevant_paths(commit, advisory_record) == set(paths[:2]) - - @staticmethod - def test_same_path_and_others(paths): - commit = Commit( - commit_id="test_commit", - repository="test_repository", - changed_files=[paths[0]], - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", paths=paths[:2] - ) - assert extract_changed_relevant_paths(commit, advisory_record) == { - paths[0], - } - - @staticmethod - def test_no_match(paths): - commit = Commit( - commit_id="test_commit", - repository="test_repository", - changed_files=paths[:1], - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", paths=paths[2:] - ) - assert extract_changed_relevant_paths(commit, advisory_record) == set() - - @staticmethod - def test_empty_list(paths): - commit = Commit( - commit_id="test_commit", repository="test_repository", changed_files=[] - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", paths=paths - ) - assert extract_changed_relevant_paths(commit, advisory_record) == set() - - commit = Commit( - commit_id="test_commit", - repository="test_repository", - changed_files=paths, - ) - advisory_record = AdvisoryRecord( - vulnerability_id="test_advisory_record", paths=[] - ) - assert extract_changed_relevant_paths(commit, advisory_record) == set() - - -def test_extract_other_CVE_in_message(): - commit = Commit( - commit_id="test_commit", - repository="test_repository", - cve_refs=["CVE-2021-29425", "CVE-2021-21251"], - ) - advisory_record = AdvisoryRecord(vulnerability_id="CVE-2020-31284") - assert extract_other_CVE_in_message(commit, advisory_record) == { - "CVE-2021-29425": "", - "CVE-2021-21251": "", - } - advisory_record = AdvisoryRecord(vulnerability_id="CVE-2021-29425") - result = extract_other_CVE_in_message(commit, advisory_record) - assert result == {"CVE-2021-21251": ""} - - -def test_is_commit_in_given_interval(): - assert is_commit_in_given_interval(1359961896, 1359961896, 0) - assert is_commit_in_given_interval(1359961896, 1360047896, 1) - assert is_commit_in_given_interval(1359961896, 1359875896, -1) - assert not is_commit_in_given_interval(1359961896, 1359871896, -1) - assert not is_commit_in_given_interval(1359961896, 1360051896, 1) - - -def test_extract_referred_to_by_nvd(repository): - advisory_record = AdvisoryRecord( - vulnerability_id="CVE-2020-26258", - references=[ - "https://lists.apache.org/thread.html/r97993e3d78e1f5389b7b172ba9f308440830ce5f051ee62714a0aa34@%3Ccommits.struts.apache.org%3E", - "https://other.com", - ], - ) - - commit = Commit( - commit_id="r97993e3d78e1f5389b7b172ba9f308440830ce5", - repository="test_repository", - ) - assert extract_referred_to_by_nvd(commit, advisory_record) == { - "https://lists.apache.org/thread.html/r97993e3d78e1f5389b7b172ba9f308440830ce5f051ee62714a0aa34@%3Ccommits.struts.apache.org%3E", - } - - commit = Commit( - commit_id="f4d2eabd921cbd8808b9d923ee63d44538b4154f", - repository="test_repository", - ) - assert extract_referred_to_by_nvd(commit, advisory_record) == set() - - -def test_is_commit_reachable_from_given_tag(repository): - - repo = repository - raw_commit = repo.get_commit("7532d2fb0d6081a12c2a48ec854a81a8b718be62") - commit = make_from_raw_commit(raw_commit) - - advisory_record = AdvisoryRecord( - vulnerability_id="CVE-2020-26258", - repository_url="https://github.com/apache/struts", - paths=["pom.xml"], - published_timestamp=1000000, - versions=["STRUTS_2_1_3", "STRUTS_2_3_9"], - ) - - assert not is_commit_reachable_from_given_tag( - commit, advisory_record, advisory_record.versions[0] - ) - - assert is_commit_reachable_from_given_tag( - make_from_raw_commit( - repo.get_commit("2e19fc6670a70c13c08a3ed0927abc7366308bb1") - ), - advisory_record, - advisory_record.versions[1], - ) - - -def test_extract_extract_commit_mentioned_in_linked_pages(repository, requests_mock): - requests_mock.get( - "https://for.testing.purposes/containing_commit_id_in_text_2", - text="some text r97993e3d78e1f5389b7b172ba9f308440830ce5 blah", - ) - - advisory_record = AdvisoryRecord( - vulnerability_id="CVE-2020-26258", - references=["https://for.testing.purposes/containing_commit_id_in_text_2"], - ) - - advisory_record.analyze(fetch_references=True) - - commit = Commit( - commit_id="r97993e3d78e1f5389b7b172ba9f308440830ce5", - repository="test_repository", - ) - assert extract_commit_mentioned_in_linked_pages(commit, advisory_record) == 1 - - commit = Commit( - commit_id="f4d2eabd921cbd8808b9d923ee63d44538b4154f", - repository="test_repository", - ) - assert extract_commit_mentioned_in_linked_pages(commit, advisory_record) == 0 +# @pytest.fixture +# def paths(): +# return [ +# "fire-nation/zuko/lightning.png", +# "water-bending/katara/necklace.gif", +# "air-nomad/aang/littlefoot.jpg", +# "earth-kingdom/toph/metal.png", +# ] + + +# @pytest.fixture +# def sub_paths(): +# return [ +# "lightning.png", +# "zuko/lightning.png", +# "fire-nation/zuko", +# "water-bending", +# ] + + +# class TestExtractChangedRelevantPaths: +# @staticmethod +# def test_sub_path_matching(paths, sub_paths): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# changed_files=paths, +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", paths=sub_paths +# ) + +# matched_paths = { +# "fire-nation/zuko/lightning.png", +# "water-bending/katara/necklace.gif", +# } + +# assert ( +# extract_changed_relevant_paths(commit, advisory_record) +# == matched_paths +# ) + +# @staticmethod +# def test_same_path_only(paths): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# changed_files=paths, +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", paths=paths[:2] +# ) +# assert extract_changed_relevant_paths(commit, advisory_record) == set( +# paths[:2] +# ) + +# @staticmethod +# def test_same_path_and_others(paths): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# changed_files=[paths[0]], +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", paths=paths[:2] +# ) +# assert extract_changed_relevant_paths(commit, advisory_record) == { +# paths[0], +# } + +# @staticmethod +# def test_no_match(paths): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# changed_files=paths[:1], +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", paths=paths[2:] +# ) +# assert extract_changed_relevant_paths(commit, advisory_record) == set() + +# @staticmethod +# def test_empty_list(paths): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# changed_files=[], +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", paths=paths +# ) +# assert extract_changed_relevant_paths(commit, advisory_record) == set() + +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# changed_files=paths, +# ) +# advisory_record = AdvisoryRecord( +# vulnerability_id="test_advisory_record", paths=[] +# ) +# assert extract_changed_relevant_paths(commit, advisory_record) == set() + + +# def test_extract_other_CVE_in_message(): +# commit = Commit( +# commit_id="test_commit", +# repository="test_repository", +# cve_refs=["CVE-2021-29425", "CVE-2021-21251"], +# ) +# advisory_record = AdvisoryRecord(vulnerability_id="CVE-2020-31284") +# assert extract_other_CVE_in_message(commit, advisory_record) == { +# "CVE-2021-29425": "", +# "CVE-2021-21251": "", +# } +# advisory_record = AdvisoryRecord(vulnerability_id="CVE-2021-29425") +# result = extract_other_CVE_in_message(commit, advisory_record) +# assert result == {"CVE-2021-21251": ""} + + +# def test_is_commit_in_given_interval(): +# assert is_commit_in_given_interval(1359961896, 1359961896, 0) +# assert is_commit_in_given_interval(1359961896, 1360047896, 1) +# assert is_commit_in_given_interval(1359961896, 1359875896, -1) +# assert not is_commit_in_given_interval(1359961896, 1359871896, -1) +# assert not is_commit_in_given_interval(1359961896, 1360051896, 1) + + +# def test_extract_referred_to_by_nvd(repository): +# advisory_record = AdvisoryRecord( +# vulnerability_id="CVE-2020-26258", +# references=[ +# "https://lists.apache.org/thread.html/r97993e3d78e1f5389b7b172ba9f308440830ce5f051ee62714a0aa34@%3Ccommits.struts.apache.org%3E", +# "https://other.com", +# ], +# ) + +# commit = Commit( +# commit_id="r97993e3d78e1f5389b7b172ba9f308440830ce5", +# repository="test_repository", +# ) +# assert extract_referred_to_by_nvd(commit, advisory_record) == { +# "https://lists.apache.org/thread.html/r97993e3d78e1f5389b7b172ba9f308440830ce5f051ee62714a0aa34@%3Ccommits.struts.apache.org%3E", +# } + +# commit = Commit( +# commit_id="f4d2eabd921cbd8808b9d923ee63d44538b4154f", +# repository="test_repository", +# ) +# assert extract_referred_to_by_nvd(commit, advisory_record) == set() + + +# def test_is_commit_reachable_from_given_tag(repository): + +# repo = repository +# raw_commit = repo.get_commit("7532d2fb0d6081a12c2a48ec854a81a8b718be62") +# commit = make_from_raw_commit(raw_commit) + +# advisory_record = AdvisoryRecord( +# vulnerability_id="CVE-2020-26258", +# repository_url="https://github.com/apache/struts", +# paths=["pom.xml"], +# published_timestamp=1000000, +# versions=["STRUTS_2_1_3", "STRUTS_2_3_9"], +# ) + +# assert not is_commit_reachable_from_given_tag( +# commit, advisory_record, advisory_record.versions[0] +# ) + +# assert is_commit_reachable_from_given_tag( +# make_from_raw_commit( +# repo.get_commit("2e19fc6670a70c13c08a3ed0927abc7366308bb1") +# ), +# advisory_record, +# advisory_record.versions[1], +# ) + + +# def test_extract_extract_commit_mentioned_in_linked_pages( +# repository, requests_mock +# ): +# requests_mock.get( +# "https://for.testing.purposes/containing_commit_id_in_text_2", +# text="some text r97993e3d78e1f5389b7b172ba9f308440830ce5 blah", +# ) + +# advisory_record = AdvisoryRecord( +# vulnerability_id="CVE-2020-26258", +# references=[ +# "https://for.testing.purposes/containing_commit_id_in_text_2" +# ], +# ) + +# advisory_record.analyze(fetch_references=True) + +# commit = Commit( +# commit_id="r97993e3d78e1f5389b7b172ba9f308440830ce5", +# repository="test_repository", +# ) +# assert ( +# extract_commit_mentioned_in_linked_pages(commit, advisory_record) == 1 +# ) + +# commit = Commit( +# commit_id="f4d2eabd921cbd8808b9d923ee63d44538b4154f", +# repository="test_repository", +# ) +# assert ( +# extract_commit_mentioned_in_linked_pages(commit, advisory_record) == 0 +# ) # def test_extract_referred_to_by_pages_linked_from_advisories_wrong_url(repository): @@ -306,137 +318,139 @@ def test_extract_extract_commit_mentioned_in_linked_pages(repository, requests_m # ) -def test_extract_path_similarities(): - keywords = [ - "TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato", - "Bolin+Bumi+Ozai+Katara", - "Jinora.Appa.Unalaq.Zaheer", - "Naga.LinBeifong", - "Sokka.Kya", - "Bumi=Momo=Naga=Iroh", - "Sokka_Unalaq", - "Sokka.Iroh.Pabu", - "LinBeifong=Zuko", - "TenzinBolinSokka", - "Korra-AsamiSato-Pabu-Iroh", - "Mako.Naga", - "Jinora=Bumi", - "BolinAppaKuvira", - "TophBeifongIroh", - "Amon+Zuko+Unalaq", - ] - paths = [ - "Unalaq/Aang/Suyin Beifong", - "Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer", - "Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko", - "Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi", - "Momo", - "Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq", - ] - commit = Commit(changed_files=paths) - advisory = AdvisoryRecord( - vulnerability_id=list(random_list_of_cve(max_count=1, min_count=1).keys())[0], - keywords=keywords, - ) - similarities: pandas.DataFrame = extract_path_similarities(commit, advisory) - expected = ( - ",changed file,code token,jaccard,sorensen-dice,otsuka-ochiai,levenshtein,damerau-levenshtein,length diff,inverted normalized levenshtein,inverted normalized damerau-levenshtein\n" - "0,Unalaq/Aang/Suyin Beifong,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.09090909090909091,0.16666666666666666,0.17677669529663687,8,8,4,0.19999999999999996,0.19999999999999996\n" - "1,Unalaq/Aang/Suyin Beifong,Bolin+Bumi+Ozai+Katara,0.0,0.0,0.0,4,4,0,0.6,0.6\n" - "2,Unalaq/Aang/Suyin Beifong,Jinora.Appa.Unalaq.Zaheer,0.14285714285714285,0.25,0.25,4,4,0,0.6,0.6\n" - "3,Unalaq/Aang/Suyin Beifong,Naga.LinBeifong,0.16666666666666666,0.2857142857142857,0.2886751345948129,3,3,1,0.7,0.7\n" - "4,Unalaq/Aang/Suyin Beifong,Sokka.Kya,0.0,0.0,0.0,4,4,2,0.6,0.6\n" - "5,Unalaq/Aang/Suyin Beifong,Bumi=Momo=Naga=Iroh,0.0,0.0,0.0,4,4,0,0.6,0.6\n" - "6,Unalaq/Aang/Suyin Beifong,Sokka_Unalaq,0.2,0.3333333333333333,0.35355339059327373,4,4,2,0.6,0.6\n" - "7,Unalaq/Aang/Suyin Beifong,Sokka.Iroh.Pabu,0.0,0.0,0.0,4,4,1,0.6,0.6\n" - "8,Unalaq/Aang/Suyin Beifong,LinBeifong=Zuko,0.16666666666666666,0.2857142857142857,0.2886751345948129,4,4,1,0.6,0.6\n" - "9,Unalaq/Aang/Suyin Beifong,TenzinBolinSokka,0.0,0.0,0.0,4,4,1,0.6,0.6\n" - "10,Unalaq/Aang/Suyin Beifong,Korra-AsamiSato-Pabu-Iroh,0.0,0.0,0.0,5,5,1,0.5,0.5\n" - "11,Unalaq/Aang/Suyin Beifong,Mako.Naga,0.0,0.0,0.0,4,4,2,0.6,0.6\n" - "12,Unalaq/Aang/Suyin Beifong,Jinora=Bumi,0.0,0.0,0.0,4,4,2,0.6,0.6\n" - "13,Unalaq/Aang/Suyin Beifong,BolinAppaKuvira,0.0,0.0,0.0,4,4,1,0.6,0.6\n" - "14,Unalaq/Aang/Suyin Beifong,TophBeifongIroh,0.16666666666666666,0.2857142857142857,0.2886751345948129,4,4,1,0.6,0.6\n" - "15,Unalaq/Aang/Suyin Beifong,Amon+Zuko+Unalaq,0.16666666666666666,0.2857142857142857,0.2886751345948129,4,4,1,0.6,0.6\n" - "16,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.25,0.4,0.4008918628686366,8,8,0,0.19999999999999996,0.19999999999999996\n" - "17,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Bolin+Bumi+Ozai+Katara,0.1,0.18181818181818182,0.1889822365046136,8,8,4,0.19999999999999996,0.19999999999999996\n" - "18,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Jinora.Appa.Unalaq.Zaheer,0.1,0.18181818181818182,0.1889822365046136,7,7,4,0.30000000000000004,0.30000000000000004\n" - "19,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Naga.LinBeifong,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" - "20,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Sokka.Kya,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" - "21,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Bumi=Momo=Naga=Iroh,0.1,0.18181818181818182,0.1889822365046136,8,8,4,0.19999999999999996,0.19999999999999996\n" - "22,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Sokka_Unalaq,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" - "23,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Sokka.Iroh.Pabu,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" - "24,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,LinBeifong=Zuko,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" - "25,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,TenzinBolinSokka,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" - "26,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Korra-AsamiSato-Pabu-Iroh,0.2,0.3333333333333333,0.3380617018914066,6,6,3,0.4,0.4\n" - "27,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Mako.Naga,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" - "28,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Jinora=Bumi,0.125,0.2222222222222222,0.2672612419124244,7,7,6,0.30000000000000004,0.30000000000000004\n" - "29,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,BolinAppaKuvira,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" - "30,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,TophBeifongIroh,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" - "31,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Amon+Zuko+Unalaq,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" - "32,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.23076923076923078,0.375,0.375,8,8,0,0.19999999999999996,0.19999999999999996\n" - "33,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Bolin+Bumi+Ozai+Katara,0.09090909090909091,0.16666666666666666,0.17677669529663687,7,7,4,0.30000000000000004,0.30000000000000004\n" - "34,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Jinora.Appa.Unalaq.Zaheer,0.0,0.0,0.0,8,8,4,0.19999999999999996,0.19999999999999996\n" - "35,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Naga.LinBeifong,0.1,0.18181818181818182,0.20412414523193154,8,8,5,0.19999999999999996,0.19999999999999996\n" - "36,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Sokka.Kya,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" - "37,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Bumi=Momo=Naga=Iroh,0.09090909090909091,0.16666666666666666,0.17677669529663687,7,7,4,0.30000000000000004,0.30000000000000004\n" - "38,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Sokka_Unalaq,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" - "39,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Sokka.Iroh.Pabu,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" - "40,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,LinBeifong=Zuko,0.1,0.18181818181818182,0.20412414523193154,7,7,5,0.30000000000000004,0.30000000000000004\n" - "41,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,TenzinBolinSokka,0.1,0.18181818181818182,0.20412414523193154,7,7,5,0.30000000000000004,0.30000000000000004\n" - "42,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Korra-AsamiSato-Pabu-Iroh,0.18181818181818182,0.3076923076923077,0.31622776601683794,7,7,3,0.30000000000000004,0.30000000000000004\n" - "43,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Mako.Naga,0.1111111111111111,0.2,0.25,7,7,6,0.30000000000000004,0.30000000000000004\n" - "44,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Jinora=Bumi,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" - "45,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,BolinAppaKuvira,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" - "46,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,TophBeifongIroh,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" - "47,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Amon+Zuko+Unalaq,0.1,0.18181818181818182,0.20412414523193154,8,8,5,0.19999999999999996,0.19999999999999996\n" - "48,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.3333333333333333,0.5,0.5,9,9,1,0.09999999999999998,0.09999999999999998\n" - "49,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Bolin+Bumi+Ozai+Katara,0.2,0.3333333333333333,0.35355339059327373,8,8,5,0.19999999999999996,0.19999999999999996\n" - "50,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Jinora.Appa.Unalaq.Zaheer,0.0,0.0,0.0,9,9,5,0.09999999999999998,0.09999999999999998\n" - "51,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Naga.LinBeifong,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" - "52,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Sokka.Kya,0.0,0.0,0.0,9,9,7,0.09999999999999998,0.09999999999999998\n" - "53,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Bumi=Momo=Naga=Iroh,0.09090909090909091,0.16666666666666666,0.17677669529663687,8,8,5,0.19999999999999996,0.19999999999999996\n" - "54,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Sokka_Unalaq,0.0,0.0,0.0,9,9,7,0.09999999999999998,0.09999999999999998\n" - "55,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Sokka.Iroh.Pabu,0.0,0.0,0.0,9,9,6,0.09999999999999998,0.09999999999999998\n" - "56,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,LinBeifong=Zuko,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" - "57,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,TenzinBolinSokka,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" - "58,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Korra-AsamiSato-Pabu-Iroh,0.18181818181818182,0.3076923076923077,0.31622776601683794,7,7,4,0.30000000000000004,0.30000000000000004\n" - "59,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Mako.Naga,0.0,0.0,0.0,9,9,7,0.09999999999999998,0.09999999999999998\n" - "60,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Jinora=Bumi,0.1111111111111111,0.2,0.25,8,8,7,0.19999999999999996,0.19999999999999996\n" - "61,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,BolinAppaKuvira,0.2222222222222222,0.36363636363636365,0.4082482904638631,8,8,6,0.19999999999999996,0.19999999999999996\n" - "62,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,TophBeifongIroh,0.2222222222222222,0.36363636363636365,0.4082482904638631,7,7,6,0.30000000000000004,0.30000000000000004\n" - "63,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Amon+Zuko+Unalaq,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" - "64,Momo,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.0,0.0,0.0,8,8,7,0.19999999999999996,0.19999999999999996\n" - "65,Momo,Bolin+Bumi+Ozai+Katara,0.0,0.0,0.0,4,4,3,0.6,0.6\n" - "66,Momo,Jinora.Appa.Unalaq.Zaheer,0.0,0.0,0.0,4,4,3,0.6,0.6\n" - "67,Momo,Naga.LinBeifong,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "68,Momo,Sokka.Kya,0.0,0.0,0.0,2,2,1,0.8,0.8\n" - "69,Momo,Bumi=Momo=Naga=Iroh,0.25,0.4,0.5,3,3,3,0.7,0.7\n" - "70,Momo,Sokka_Unalaq,0.0,0.0,0.0,2,2,1,0.8,0.8\n" - "71,Momo,Sokka.Iroh.Pabu,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "72,Momo,LinBeifong=Zuko,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "73,Momo,TenzinBolinSokka,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "74,Momo,Korra-AsamiSato-Pabu-Iroh,0.0,0.0,0.0,5,5,4,0.5,0.5\n" - "75,Momo,Mako.Naga,0.0,0.0,0.0,2,2,1,0.8,0.8\n" - "76,Momo,Jinora=Bumi,0.0,0.0,0.0,2,2,1,0.8,0.8\n" - "77,Momo,BolinAppaKuvira,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "78,Momo,TophBeifongIroh,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "79,Momo,Amon+Zuko+Unalaq,0.0,0.0,0.0,3,3,2,0.7,0.7\n" - "80,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.13333333333333333,0.23529411764705882,0.23570226039551587,9,9,2,0.09999999999999998,0.09999999999999998\n" - "81,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Bolin+Bumi+Ozai+Katara,0.08333333333333333,0.15384615384615385,0.16666666666666666,9,9,6,0.09999999999999998,0.09999999999999998\n" - "82,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Jinora.Appa.Unalaq.Zaheer,0.08333333333333333,0.15384615384615385,0.16666666666666666,10,10,6,0.0,0.0\n" - "83,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Naga.LinBeifong,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" - "84,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Sokka.Kya,0.1,0.18181818181818182,0.23570226039551587,9,9,8,0.09999999999999998,0.09999999999999998\n" - "85,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Bumi=Momo=Naga=Iroh,0.0,0.0,0.0,10,10,6,0.0,0.0\n" - "86,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Sokka_Unalaq,0.2222222222222222,0.36363636363636365,0.47140452079103173,8,8,8,0.19999999999999996,0.19999999999999996\n" - "87,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Sokka.Iroh.Pabu,0.09090909090909091,0.16666666666666666,0.19245008972987526,9,9,7,0.09999999999999998,0.09999999999999998\n" - "88,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,LinBeifong=Zuko,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" - "89,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,TenzinBolinSokka,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" - "90,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Korra-AsamiSato-Pabu-Iroh,0.07692307692307693,0.14285714285714285,0.14907119849998599,10,10,5,0.0,0.0\n" - "91,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Mako.Naga,0.1,0.18181818181818182,0.23570226039551587,9,9,8,0.09999999999999998,0.09999999999999998\n" - "92,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Jinora=Bumi,0.0,0.0,0.0,10,10,8,0.0,0.0\n" - "93,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,BolinAppaKuvira,0.2,0.3333333333333333,0.3849001794597505,9,9,7,0.09999999999999998,0.09999999999999998\n" - "94,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,TophBeifongIroh,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" - "95,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Amon+Zuko+Unalaq,0.09090909090909091,0.16666666666666666,0.19245008972987526,9,9,7,0.09999999999999998,0.09999999999999998\n" - ) - - assert similarities.to_csv() == expected +# def test_extract_path_similarities(): +# keywords = [ +# "TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato", +# "Bolin+Bumi+Ozai+Katara", +# "Jinora.Appa.Unalaq.Zaheer", +# "Naga.LinBeifong", +# "Sokka.Kya", +# "Bumi=Momo=Naga=Iroh", +# "Sokka_Unalaq", +# "Sokka.Iroh.Pabu", +# "LinBeifong=Zuko", +# "TenzinBolinSokka", +# "Korra-AsamiSato-Pabu-Iroh", +# "Mako.Naga", +# "Jinora=Bumi", +# "BolinAppaKuvira", +# "TophBeifongIroh", +# "Amon+Zuko+Unalaq", +# ] +# paths = [ +# "Unalaq/Aang/Suyin Beifong", +# "Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer", +# "Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko", +# "Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi", +# "Momo", +# "Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq", +# ] +# commit = Commit(changed_files=paths) +# advisory = AdvisoryRecord( +# vulnerability_id=list( +# random_list_of_cve(max_count=1, min_count=1).keys() +# )[0], +# keywords=keywords, +# ) +# similarities: pandas.DataFrame = extract_path_similarities(commit, advisory) +# expected = ( +# ",changed file,code token,jaccard,sorensen-dice,otsuka-ochiai,levenshtein,damerau-levenshtein,length diff,inverted normalized levenshtein,inverted normalized damerau-levenshtein\n" +# "0,Unalaq/Aang/Suyin Beifong,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.09090909090909091,0.16666666666666666,0.17677669529663687,8,8,4,0.19999999999999996,0.19999999999999996\n" +# "1,Unalaq/Aang/Suyin Beifong,Bolin+Bumi+Ozai+Katara,0.0,0.0,0.0,4,4,0,0.6,0.6\n" +# "2,Unalaq/Aang/Suyin Beifong,Jinora.Appa.Unalaq.Zaheer,0.14285714285714285,0.25,0.25,4,4,0,0.6,0.6\n" +# "3,Unalaq/Aang/Suyin Beifong,Naga.LinBeifong,0.16666666666666666,0.2857142857142857,0.2886751345948129,3,3,1,0.7,0.7\n" +# "4,Unalaq/Aang/Suyin Beifong,Sokka.Kya,0.0,0.0,0.0,4,4,2,0.6,0.6\n" +# "5,Unalaq/Aang/Suyin Beifong,Bumi=Momo=Naga=Iroh,0.0,0.0,0.0,4,4,0,0.6,0.6\n" +# "6,Unalaq/Aang/Suyin Beifong,Sokka_Unalaq,0.2,0.3333333333333333,0.35355339059327373,4,4,2,0.6,0.6\n" +# "7,Unalaq/Aang/Suyin Beifong,Sokka.Iroh.Pabu,0.0,0.0,0.0,4,4,1,0.6,0.6\n" +# "8,Unalaq/Aang/Suyin Beifong,LinBeifong=Zuko,0.16666666666666666,0.2857142857142857,0.2886751345948129,4,4,1,0.6,0.6\n" +# "9,Unalaq/Aang/Suyin Beifong,TenzinBolinSokka,0.0,0.0,0.0,4,4,1,0.6,0.6\n" +# "10,Unalaq/Aang/Suyin Beifong,Korra-AsamiSato-Pabu-Iroh,0.0,0.0,0.0,5,5,1,0.5,0.5\n" +# "11,Unalaq/Aang/Suyin Beifong,Mako.Naga,0.0,0.0,0.0,4,4,2,0.6,0.6\n" +# "12,Unalaq/Aang/Suyin Beifong,Jinora=Bumi,0.0,0.0,0.0,4,4,2,0.6,0.6\n" +# "13,Unalaq/Aang/Suyin Beifong,BolinAppaKuvira,0.0,0.0,0.0,4,4,1,0.6,0.6\n" +# "14,Unalaq/Aang/Suyin Beifong,TophBeifongIroh,0.16666666666666666,0.2857142857142857,0.2886751345948129,4,4,1,0.6,0.6\n" +# "15,Unalaq/Aang/Suyin Beifong,Amon+Zuko+Unalaq,0.16666666666666666,0.2857142857142857,0.2886751345948129,4,4,1,0.6,0.6\n" +# "16,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.25,0.4,0.4008918628686366,8,8,0,0.19999999999999996,0.19999999999999996\n" +# "17,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Bolin+Bumi+Ozai+Katara,0.1,0.18181818181818182,0.1889822365046136,8,8,4,0.19999999999999996,0.19999999999999996\n" +# "18,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Jinora.Appa.Unalaq.Zaheer,0.1,0.18181818181818182,0.1889822365046136,7,7,4,0.30000000000000004,0.30000000000000004\n" +# "19,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Naga.LinBeifong,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" +# "20,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Sokka.Kya,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "21,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Bumi=Momo=Naga=Iroh,0.1,0.18181818181818182,0.1889822365046136,8,8,4,0.19999999999999996,0.19999999999999996\n" +# "22,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Sokka_Unalaq,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "23,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Sokka.Iroh.Pabu,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "24,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,LinBeifong=Zuko,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" +# "25,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,TenzinBolinSokka,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" +# "26,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Korra-AsamiSato-Pabu-Iroh,0.2,0.3333333333333333,0.3380617018914066,6,6,3,0.4,0.4\n" +# "27,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Mako.Naga,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "28,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Jinora=Bumi,0.125,0.2222222222222222,0.2672612419124244,7,7,6,0.30000000000000004,0.30000000000000004\n" +# "29,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,BolinAppaKuvira,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "30,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,TophBeifongIroh,0.1111111111111111,0.2,0.2182178902359924,7,7,5,0.30000000000000004,0.30000000000000004\n" +# "31,Tenzin/Asami Sato/Suyin Beifong/Tenzin/Bumi/Zaheer,Amon+Zuko+Unalaq,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "32,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.23076923076923078,0.375,0.375,8,8,0,0.19999999999999996,0.19999999999999996\n" +# "33,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Bolin+Bumi+Ozai+Katara,0.09090909090909091,0.16666666666666666,0.17677669529663687,7,7,4,0.30000000000000004,0.30000000000000004\n" +# "34,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Jinora.Appa.Unalaq.Zaheer,0.0,0.0,0.0,8,8,4,0.19999999999999996,0.19999999999999996\n" +# "35,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Naga.LinBeifong,0.1,0.18181818181818182,0.20412414523193154,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "36,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Sokka.Kya,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "37,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Bumi=Momo=Naga=Iroh,0.09090909090909091,0.16666666666666666,0.17677669529663687,7,7,4,0.30000000000000004,0.30000000000000004\n" +# "38,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Sokka_Unalaq,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "39,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Sokka.Iroh.Pabu,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "40,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,LinBeifong=Zuko,0.1,0.18181818181818182,0.20412414523193154,7,7,5,0.30000000000000004,0.30000000000000004\n" +# "41,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,TenzinBolinSokka,0.1,0.18181818181818182,0.20412414523193154,7,7,5,0.30000000000000004,0.30000000000000004\n" +# "42,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Korra-AsamiSato-Pabu-Iroh,0.18181818181818182,0.3076923076923077,0.31622776601683794,7,7,3,0.30000000000000004,0.30000000000000004\n" +# "43,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Mako.Naga,0.1111111111111111,0.2,0.25,7,7,6,0.30000000000000004,0.30000000000000004\n" +# "44,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Jinora=Bumi,0.0,0.0,0.0,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "45,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,BolinAppaKuvira,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "46,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,TophBeifongIroh,0.0,0.0,0.0,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "47,Asami Sato/Tenzin/Tonraq/Katara/Tarrlok/Naga/Zuko,Amon+Zuko+Unalaq,0.1,0.18181818181818182,0.20412414523193154,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "48,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.3333333333333333,0.5,0.5,9,9,1,0.09999999999999998,0.09999999999999998\n" +# "49,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Bolin+Bumi+Ozai+Katara,0.2,0.3333333333333333,0.35355339059327373,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "50,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Jinora.Appa.Unalaq.Zaheer,0.0,0.0,0.0,9,9,5,0.09999999999999998,0.09999999999999998\n" +# "51,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Naga.LinBeifong,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "52,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Sokka.Kya,0.0,0.0,0.0,9,9,7,0.09999999999999998,0.09999999999999998\n" +# "53,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Bumi=Momo=Naga=Iroh,0.09090909090909091,0.16666666666666666,0.17677669529663687,8,8,5,0.19999999999999996,0.19999999999999996\n" +# "54,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Sokka_Unalaq,0.0,0.0,0.0,9,9,7,0.09999999999999998,0.09999999999999998\n" +# "55,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Sokka.Iroh.Pabu,0.0,0.0,0.0,9,9,6,0.09999999999999998,0.09999999999999998\n" +# "56,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,LinBeifong=Zuko,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "57,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,TenzinBolinSokka,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "58,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Korra-AsamiSato-Pabu-Iroh,0.18181818181818182,0.3076923076923077,0.31622776601683794,7,7,4,0.30000000000000004,0.30000000000000004\n" +# "59,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Mako.Naga,0.0,0.0,0.0,9,9,7,0.09999999999999998,0.09999999999999998\n" +# "60,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Jinora=Bumi,0.1111111111111111,0.2,0.25,8,8,7,0.19999999999999996,0.19999999999999996\n" +# "61,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,BolinAppaKuvira,0.2222222222222222,0.36363636363636365,0.4082482904638631,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "62,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,TophBeifongIroh,0.2222222222222222,0.36363636363636365,0.4082482904638631,7,7,6,0.30000000000000004,0.30000000000000004\n" +# "63,Amon/Asami Sato/Bumi/Kuvira/Toph Beifong/Bolin/Bumi,Amon+Zuko+Unalaq,0.1,0.18181818181818182,0.20412414523193154,8,8,6,0.19999999999999996,0.19999999999999996\n" +# "64,Momo,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.0,0.0,0.0,8,8,7,0.19999999999999996,0.19999999999999996\n" +# "65,Momo,Bolin+Bumi+Ozai+Katara,0.0,0.0,0.0,4,4,3,0.6,0.6\n" +# "66,Momo,Jinora.Appa.Unalaq.Zaheer,0.0,0.0,0.0,4,4,3,0.6,0.6\n" +# "67,Momo,Naga.LinBeifong,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "68,Momo,Sokka.Kya,0.0,0.0,0.0,2,2,1,0.8,0.8\n" +# "69,Momo,Bumi=Momo=Naga=Iroh,0.25,0.4,0.5,3,3,3,0.7,0.7\n" +# "70,Momo,Sokka_Unalaq,0.0,0.0,0.0,2,2,1,0.8,0.8\n" +# "71,Momo,Sokka.Iroh.Pabu,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "72,Momo,LinBeifong=Zuko,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "73,Momo,TenzinBolinSokka,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "74,Momo,Korra-AsamiSato-Pabu-Iroh,0.0,0.0,0.0,5,5,4,0.5,0.5\n" +# "75,Momo,Mako.Naga,0.0,0.0,0.0,2,2,1,0.8,0.8\n" +# "76,Momo,Jinora=Bumi,0.0,0.0,0.0,2,2,1,0.8,0.8\n" +# "77,Momo,BolinAppaKuvira,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "78,Momo,TophBeifongIroh,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "79,Momo,Amon+Zuko+Unalaq,0.0,0.0,0.0,3,3,2,0.7,0.7\n" +# "80,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,TophBeifong_Zuko_IknikBlackstoneVarrick_AsamiSato,0.13333333333333333,0.23529411764705882,0.23570226039551587,9,9,2,0.09999999999999998,0.09999999999999998\n" +# "81,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Bolin+Bumi+Ozai+Katara,0.08333333333333333,0.15384615384615385,0.16666666666666666,9,9,6,0.09999999999999998,0.09999999999999998\n" +# "82,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Jinora.Appa.Unalaq.Zaheer,0.08333333333333333,0.15384615384615385,0.16666666666666666,10,10,6,0.0,0.0\n" +# "83,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Naga.LinBeifong,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" +# "84,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Sokka.Kya,0.1,0.18181818181818182,0.23570226039551587,9,9,8,0.09999999999999998,0.09999999999999998\n" +# "85,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Bumi=Momo=Naga=Iroh,0.0,0.0,0.0,10,10,6,0.0,0.0\n" +# "86,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Sokka_Unalaq,0.2222222222222222,0.36363636363636365,0.47140452079103173,8,8,8,0.19999999999999996,0.19999999999999996\n" +# "87,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Sokka.Iroh.Pabu,0.09090909090909091,0.16666666666666666,0.19245008972987526,9,9,7,0.09999999999999998,0.09999999999999998\n" +# "88,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,LinBeifong=Zuko,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" +# "89,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,TenzinBolinSokka,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" +# "90,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Korra-AsamiSato-Pabu-Iroh,0.07692307692307693,0.14285714285714285,0.14907119849998599,10,10,5,0.0,0.0\n" +# "91,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Mako.Naga,0.1,0.18181818181818182,0.23570226039551587,9,9,8,0.09999999999999998,0.09999999999999998\n" +# "92,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Jinora=Bumi,0.0,0.0,0.0,10,10,8,0.0,0.0\n" +# "93,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,BolinAppaKuvira,0.2,0.3333333333333333,0.3849001794597505,9,9,7,0.09999999999999998,0.09999999999999998\n" +# "94,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,TophBeifongIroh,0.2,0.3333333333333333,0.3849001794597505,8,8,7,0.19999999999999996,0.19999999999999996\n" +# "95,Kuvira/Bolin/Lin Beifong/Sokka/Mako/Korra/Toph Beifong/Unalaq,Amon+Zuko+Unalaq,0.09090909090909091,0.16666666666666666,0.19245008972987526,9,9,7,0.09999999999999998,0.09999999999999998\n" +# ) + +# assert similarities.to_csv() == expected diff --git a/prospector/service/api/api_test.py b/prospector/service/api/api_test.py index 1d1fbe6db..8a2b01913 100644 --- a/prospector/service/api/api_test.py +++ b/prospector/service/api/api_test.py @@ -31,7 +31,9 @@ def test_post_preprocessed_commits(): def test_get_specific_commit(): repository = "https://github.com/apache/dubbo" commit_id = "yyy" + print(client) response = client.get("/commits/" + repository + "?commit_id=" + commit_id) + print(f"Response: {response}, {response.reason_phrase}") assert response.status_code == 200 assert response.json()[0]["commit_id"] == commit_id From 399d98299cdef67cc13c9988e1d10ac519359356 Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Fri, 12 Jul 2024 16:10:33 +0200 Subject: [PATCH 57/83] Added a toc --- README.md | 41 ++++++++++++++++++++++++++++------------- 1 file changed, 28 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 0bfb210ba..530469d66 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,22 @@ [![REUSE status](https://api.reuse.software/badge/github.com/sap/project-kb)](https://api.reuse.software/info/github.com/sap/project-kb) [![Pytest](https://github.com/SAP/project-kb/actions/workflows/python.yml/badge.svg)](https://github.com/SAP/project-kb/actions/workflows/python.yml) -## Description +# Table of contents +1. [Description](#desc) +2. [Motivations](#motiv) +3. [Kaybee](#kaybee) +4. [Prospector](#prosp) +5. [Vulnerability data](#vuldata) +6. [Publications](#publi) +7. [Star history](#starhist) +8. [Credits](#credit) +9. [EU funded research projects](#eu_funded) +10. [Vulnerability data sources](#vul_data) +11. [Limitations and known issues](#limit) +12. [Support](#support) +13. [Contributing](#contrib) + +## Description The goal of `Project KB` is to enable the creation, management and aggregation of a distributed, collaborative knowledge base of vulnerabilities affecting @@ -19,7 +34,7 @@ open-source software. as well as set of tools to support the mining, curation and management of such data. -### Motivations +### Motivations In order to feed [Eclipse Steady](https://github.com/eclipse/steady/) with fresh data, we have spent a considerable amount of time, in the past few years, mining @@ -45,7 +60,7 @@ of the data they produce and of how they aggregate and consume data from the other sources. -## Kaybee +## Kaybee Kaybee is a vulnerability data management tool, it makes possible to fetch the vulnerability statements from this repository (or from any other repository) and export them to a number of @@ -54,18 +69,18 @@ backend](https://github.com/eclipse/steady). For details and usage instructions check out the [kaybee README](https://github.com/SAP/project-kb/tree/main/kaybee). -## Prospector +## Prospector Prospector is a vulnerability data mining tool that aims at reducing the effort needed to find security fixes for known vulnerabilities in open source software repositories. The tool takes a vulnerability description (in natural language) as input and produces a ranked list of commits, in decreasing order of relevance. For details and usage instructions check out the [prospector README](https://github.com/SAP/project-kb/tree/main/prospector). -## Vulnerability data +## Vulnerability data The vulnerability data of Project KB are stored in textual form as a set of YAML files, in the [vulnerability-data branch](https://github.com/SAP/project-kb/tree/vulnerability-data). -## Publications +## Publications In early 2019, a snapshot of the knowlege base from project "KB" was described in: @@ -91,13 +106,13 @@ scripts described in that paper](MSR2019) > If you wrote a paper that uses the data or the tools from this repository, please let us know (through an issue) and we'll add it to this list. -## Star History +## Star History [![Star History Chart](https://api.star-history.com/svg?repos=sap/project-kb&type=Date)](https://star-history.com/#sap/project-kb&Date) -## Credits +## Credits -### EU-funded research projects +### EU-funded research projects The development of Project KB is partially supported by the following projects: @@ -105,22 +120,22 @@ The development of Project KB is partially supported by the following projects: * [AssureMOSS](https://assuremoss.eu) (Grant No. 952647). * [Sparta](https://www.sparta.eu/) (Grant No. 830892). -### Vulnerability data sources +### Vulnerability data sources Vulnerability information from NVD and MITRE might have been used as input for building parts of this knowledge base. See MITRE's [CVE Usage license](http://cve.mitre.org/about/termsofuse.html) for more information. -## Limitations and Known Issues +## Limitations and Known Issues This project is **work-in-progress**, you can find the list of known issues [here](https://github.com/SAP/project-kb/issues). Currently the vulnerability knowledge base only contains information about vulnerabilities in Java and Python open source components. -## Support +## Support For the time being, please use [GitHub issues](https://github.com/SAP/project-kb/issues) to report bugs, request new features and ask for support. -## Contributing +## Contributing See [How to contribute](CONTRIBUTING.md). From 30db83774bf033f8d1a345de1cdc6aa96339e4a1 Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Wed, 17 Jul 2024 15:26:49 +0200 Subject: [PATCH 58/83] Added papers citing our work and our own related papers --- README.md | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) diff --git a/README.md b/README.md index 530469d66..33abef135 100644 --- a/README.md +++ b/README.md @@ -106,6 +106,111 @@ scripts described in that paper](MSR2019) > If you wrote a paper that uses the data or the tools from this repository, please let us know (through an issue) and we'll add it to this list. +___ + + + +**Papers citing our work** +* Bui, Q-C. et al. (May 2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) +* Galvão, P.L. (October 2022). [Analysis and Aggregation of Vulnerability Databases with Code-Level Data](https://repositorio-aberto.up.pt/bitstream/10216/144796/2/588886.pdf) +* Aladics, T. et al. (2022). [A Vulnerability Introducing Commit Dataset for Java: an Improved SZZ Based Approach](https://real.mtak.hu/149061/1/ICSOFT_2022_41_CR-1.pdf) +* Sharma, T. et al. (October 2021). [A Survey on Machine Learning Techniques for Source Code Analysis](https://arxiv.org/abs/2110.09610) +* Hommersom, D. et al. (June 2024). [Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories](https://dl.acm.org/doi/abs/10.1145/3649590) +* Marchand-Melsom, A. et al. (June 2020). [Automatic repair of OWASP Top 10 security vulnerabilities: A survey](https://dl.acm.org/doi/abs/10.1145/3387940.3392200) +* Sawadogo, A. D. et al. (Dec 2021). [Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We?](https://arxiv.org/abs/2112.10123) +* Sun, S. et al. (Jul 2023). [Exploring Security Commits in Python](https://arxiv.org/abs/2307.11853) +* Reis, S. et al. (June 2021). [Fixing Vulnerabilities Potentially Hinders Maintainability](https://arxiv.org/abs/2106.03271) +* Andrade, R., & Santos, V. (September 2021). [Investigating vulnerability datasets](https://sol.sbc.org.br/index.php/vem/article/view/17213) +* Nguyen, T. G. et al. (May 2023). [Multi-Granularity Detector for Vulnerability Fixesv](https://arxiv.org/abs/2305.13884) +* Siddiq, M. L., & Santos, J. C. S. (November 2022). [SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques](https://dl.acm.org/doi/abs/10.1145/3549035.3561184) +* Sawadogo, A. D. et al. (August 2022). [SSPCatcher: Learning to catch security patches](https://link.springer.com/article/10.1007/s10664-022-10168-9) +* Dunlap, T. et al. (July 2024). [VFCFinder: Pairing Security Advisories and Patches](http://enck.org/pubs/dunlap-asiaccs24.pdf) +* Dunlap, T. et al. (November 2023). [VFCFinder: Seamlessly Pairing Security Advisories and Patches](https://arxiv.org/abs/2311.01532) +* Bao, L. et al. (July 2022). [V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities](https://dl.acm.org/doi/abs/10.1145/3510003.3510113) +* Fan, J. et al. (September 2020). [A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries](https://dl.acm.org/doi/abs/10.1145/3379597.3387501) +* Zhang, J. et al. (January 2023). [A Survey of Learning-based Automated Program Repair](https://arxiv.org/abs/2301.03270) +* Alzubaidi, L. et al. (April 2023). [A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications](https://link.springer.com/article/10.1186/s40537-023-00727-2) +* Sharma, T. et al. (December 2023). [A survey on machine learning techniques applied to source code](https://www.sciencedirect.com/science/article/pii/S0164121223003291) +* Elder, S. et al. (April 2024). [A Survey on Software Vulnerability Exploitability Assessment](https://dl.acm.org/doi/abs/10.1145/3648610) +* Aladics, T. et al. (March 2023). [An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction](https://arxiv.org/abs/2303.16591) +* Singhal, A., & Goel, P.K. (2023). [Analysis and Identification of Malicious Mobile Applications](https://ieeexplore.ieee.org/abstract/document/10428519) +* Senanayake, J. et al. (July 2021). [Android Mobile Malware Detection Using Machine Learning: A Systematic Review](https://www.mdpi.com/2079-9292/10/13/1606) +* Bui, Q-C. et al. (December 2023). [APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities](https://link.springer.com/article/10.1007/s10664-023-10415-7) +* Senanayake, J. et al. (January 2023). [Android Source Code Vulnerability Detection: A Systematic Literature Review](https://dl.acm.org/doi/full/10.1145/3556974) +* Reis, S. et al. (June 2023). [Are security commit messages informative? Not enough!](https://dl.acm.org/doi/abs/10.1145/3593434.3593481) +* Anonymous authors. (2022). [Beyond syntax trees: learning embeddings of code edits by combining multiple source representations](https://openreview.net/pdf?id=H8qETo_W1-9) +* Challande, A. et al. (April 2022). [Building a Commit-level Dataset of Real-world Vulnerabilities](https://dl.acm.org/doi/abs/10.1145/3508398.3511495) +* Wang, S., & Nagappan, N. (July 2019). [Characterizing and Understanding Software Developer Networks in Security Development](https://arxiv.org/abs/1907.12141) +* Harzevili, N. S. et al. (March 2022). [Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries](https://arxiv.org/abs/2203.06502) +* Tate, S. R. et al. (2020). [Characterizing Vulnerabilities in a Major Linux Distribution](https://home.uncg.edu/cmp/faculty/srtate/pubs/vulnerabilities/Vulnerabilities-SEKE2020.pdf) +* Zhang, L. et al. (January 2023). [Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects](https://arxiv.org/abs/2301.08434) +* Lee, J.Y.D., & Chieu, H.L. (November 2021). [Co-training for Commit Classification](https://aclanthology.org/2021.wnut-1.43/) +* Nikitopoulos, G. et al. (August 2021). [CrossVul: a cross-language vulnerability dataset with commit data](https://dl.acm.org/doi/10.1145/3468264.3473122) +* Bhandari, G.P. (July 2021). [CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software](https://arxiv.org/abs/2107.08760) +* Sonnekalb, T. et al. (October 2021). [Deep security analysis of program code](https://link.springer.com/article/10.1007/s10664-021-10029-x) +* Triet, H.M. et al. (August 2021). [DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning](https://arxiv.org/abs/2108.08041) +* Senanayake, J. et al. (May 2024). [Defendroid: Real-time Android code vulnerability detection via blockchain federated neural network with XAI](https://www.sciencedirect.com/science/article/pii/S2214212624000449) +* Stefanoni, A. et al. (2022). [Detecting Security Patches in Java Projects Using NLP Technology](https://aclanthology.org/2022.icnlsp-1.6.pdf) +* Okutan, A. et al. (May 2023). [Empirical Validation of Automated Vulnerability Curation and Characterization](https://s2e-lab.github.io/preprints/tse23-preprint.pdf) +* Wang, J. et al. (October 2023). [Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation](https://arxiv.org/abs/2310.16263) +* Bottner, L. et al. (December 2023). [Evaluation of Free and Open Source Tools for Automated Software Composition Analysis](https://dl.acm.org/doi/abs/10.1145/3631204.3631862) +* Ganz, T. et al. (November 2021). [Explaining Graph Neural Networks for Vulnerability Discovery](https://dl.acm.org/doi/abs/10.1145/3474369.3486866) +* Ram, A. et al. (November 2019). [Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits](https://arxiv.org/abs/1911.07620) +* Md. Mostafizer Rahman, et al. (July 2023). [Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey](https://arxiv.org/abs/2307.08705) +* Zhang, Y. et al. (October 2023). [How well does LLM generate security tests?](https://arxiv.org/abs/2310.00710) +* Jing, D. (2022). [Improvement of Vulnerable Code Dataset Based on Program Equivalence Transformation](https://iopscience.iop.org/article/10.1088/1742-6596/2363/1/012010) +* Wu, Y. et al. (May 2023). [How Effective Are Neural Networks for Fixing Security Vulnerabilities](https://arxiv.org/abs/2305.18607) +* Yang, G. et al. (August 2021). [Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models](https://arxiv.org/abs/2108.06590) +* Zhou, J. et al. (2021). [Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes](https://ieeexplore.ieee.org/abstract/document/9678720) +* Dunlap, T. et al. (2023). [Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis](https://ieeexplore.ieee.org/document/10190493) +* Shestov, A. et al. (January 2024). [Finetuning Large Language Models for Vulnerability Detection](https://arxiv.org/abs/2401.17010) +* Scalco, S. et al. (July 2024). [Hash4Patch: A Lightweight Low False Positive Tool for Finding Vulnerability Patch Commits](https://dl.acm.org/doi/10.1145/3643991.3644871) +* Nguyen-Truong, G. et al. (July 2022). [HERMES: Using Commit-Issue Linking to Detect Vulnerability-Fixing Commits](https://ieeexplore.ieee.org/abstract/document/9825835) +* Wang, J. et al. (July 2024). [Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval](https://arxiv.org/abs/2407.02395) +* Sawadogo, A.D. et al. (January 2020). [Learning to Catch Security Patches](https://arxiv.org/abs/2001.09148) +* Tony, C. et al. (March 2023). [LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations](https://arxiv.org/abs/2303.09384) +* Wang, S., & Naggapan, N. (July 2019). [Characterizing and Understanding Software Developer Networks in Security Development](https://www.researchgate.net/publication/334760102_Characterizing_and_Understanding_Software_Developer_Networks_in_Security_Development) +* Chen, Z. et al. (April 2021). [Neural Transfer Learning for Repairing Security Vulnerabilities in C Code](https://arxiv.org/abs/2104.08308) +* Papotti, A. et al. (September 2022). [On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools](https://arxiv.org/abs/2209.07211) +* Mir, A.M. et al. (February 2024). [On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study](https://arxiv.org/abs/2402.07294) +* Dietrich, J. et al. (June 2023). [On the Security Blind Spots of Software Composition Analysis](https://arxiv.org/abs/2306.05534) +* Triet H. M. Le., & Babar, A.M. (March 2022). [On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models](https://arxiv.org/abs/2203.08417) +* Chapman, J., & Venugopalan, H. (January 2023). [Open Source Software Computed Risk Framework](https://ieeexplore.ieee.org/abstract/document/10000561) +* Canfora, G. et al. (February 2022). [Patchworking: Exploring the code changes induced by vulnerability fixing activities](https://www.researchgate.net/publication/355561561_Patchworking_Exploring_the_code_changes_induced_by_vulnerability_fixing_activities) +* Garg, S. et al. (June 2021). [PerfLens: a data-driven performance bug detection and fix platform](https://dl.acm.org/doi/abs/10.1145/3460946.3464318) +* Coskun, T. et al. (November 2022). [Profiling developers to predict vulnerable code changes](https://dl.acm.org/doi/abs/10.1145/3558489.3559069) +* Bhuiyan, M.H.M. et al. (July 2023). [SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript](https://ieeexplore.ieee.org/abstract/document/10172577) +* Reis, S. et al. (October 2022). [SECOM: towards a convention for security commit messages](https://dl.acm.org/doi/abs/10.1145/3524842.3528513) +* Bennett, G. et al. (June 2024). [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools](https://dl.acm.org/doi/abs/10.1145/3661167.3661262) +* Chi, J. et al. (October 2020). [SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning](https://arxiv.org/abs/2010.10805) +* Ahmed, A. et al. (May 2023). [Sequential Graph Neural Networks for Source Code Vulnerability Identification](https://arxiv.org/abs/2306.05375) +* Sun, J. et al. (February 2023). [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation](https://arxiv.org/abs/2302.07445) +* Zhao, L. et al. (November 2023). [Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects](https://dl.acm.org/doi/10.1145/3611643.3616299) +* Zhan, Q. et al. (January 2024). [Survey on Vulnerability Awareness of Open Source Software](https://www.jos.org.cn/josen/article/abstract/6935) +* Li, X. et al. (March 2023). [The anatomy of a vulnerability database: A systematic mapping study](https://www.sciencedirect.com/science/article/pii/S0164121223000742) +* Al Debeyan, F. et al. (February 2024). [The impact of hard and easy negative training data on vulnerability prediction performance☆](https://www.sciencedirect.com/science/article/pii/S0164121224000463) +* Xu, C. et al. (December 2021). [Tracking Patches for Open Source Software Vulnerabilities](https://arxiv.org/abs/2112.02240) +* Risse, N., & Böhme, M. (June 2023). [Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection](https://arxiv.org/abs/2306.17193) +* Xu, N. et al. (July 2023). [Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)](https://dl.acm.org/doi/abs/10.1145/3597926.3598037) +* Wu, Y. et al. (July 2023). [Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem](https://ieeexplore.ieee.org/abstract/document/10172868) +* Esposito, M., & Falessi, D. (March 2024). [VALIDATE: A deep dive into vulnerability prediction datasets](https://www.sciencedirect.com/science/article/pii/S0950584924000533) +* Wang, S. et al. (July 2022). [VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities](https://ieeexplore.ieee.org/abstract/document/9825908) +* Sun, Q. et al. (December 2022). [VERJava: Vulnerable Version Identification for Java OSS with a Two-Stage Analysis](https://ieeexplore.ieee.org/abstract/document/9978189) +* Nguyen, S. et al. (September 2023). [VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification](https://arxiv.org/abs/2309.01971) +* Piran, A. et al. (March 2022). [Vulnerability Analysis of Similar Code](https://ieeexplore.ieee.org/abstract/document/9724745) +* Keller, P. et al. (February 2020). [What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning](https://arxiv.org/abs/2002.02650) + +___ + +**Our related papers** +* Cabrera Lozoya, R. et al. (March 2021). [Commit2Vec: Learning Distributed Representations of Code Changes](https://link.springer.com/article/10.1007/s42979-021-00566-z) +* Fehrer, T. et al. (May 2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers](https://dl.acm.org/doi/pdf/10.1145/3661167.3661217) +* Ponta, S.E. et al. (June 2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies](https://www.semanticscholar.org/paper/Detection%2C-assessment-and-mitigation-of-in-open-Ponta-Plate/728eab7ac5ae7dd624d306ae5e1887f7b10447cc) +* Dann, A. et al. (September 2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite](https://www.computer.org/csdl/journal/ts/2022/09/09506931/1vNfNyyKDOo) +* Ponta, S.E. et al. (August 2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application](https://arxiv.org/abs/2108.05115) +* Iannone, E. et al. (June 2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries](https://ieeexplore.ieee.org/abstract/document/9462983) + + ## Star History [![Star History Chart](https://api.star-history.com/svg?repos=sap/project-kb&type=Date)](https://star-history.com/#sap/project-kb&Date) From db5bd180732e94c744f50820e2fc65a2b71d9e5d Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Fri, 19 Jul 2024 13:56:28 +0200 Subject: [PATCH 59/83] Changed order (our papers first, the others') --- README.md | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 33abef135..abac3a20b 100644 --- a/README.md +++ b/README.md @@ -108,6 +108,16 @@ scripts described in that paper](MSR2019) ___ +**Our papers related to Project KB** +* Cabrera Lozoya, R. et al. (March 2021). [Commit2Vec: Learning Distributed Representations of Code Changes](https://link.springer.com/article/10.1007/s42979-021-00566-z) +* Fehrer, T. et al. (May 2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers](https://dl.acm.org/doi/pdf/10.1145/3661167.3661217) +* Ponta, S.E. et al. (June 2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies](https://www.semanticscholar.org/paper/Detection%2C-assessment-and-mitigation-of-in-open-Ponta-Plate/728eab7ac5ae7dd624d306ae5e1887f7b10447cc) +* Dann, A. et al. (September 2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite](https://www.computer.org/csdl/journal/ts/2022/09/09506931/1vNfNyyKDOo) +* Ponta, S.E. et al. (August 2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application](https://arxiv.org/abs/2108.05115) +* Iannone, E. et al. (June 2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries](https://ieeexplore.ieee.org/abstract/document/9462983) + +___ + **Papers citing our work** @@ -200,17 +210,6 @@ ___ * Piran, A. et al. (March 2022). [Vulnerability Analysis of Similar Code](https://ieeexplore.ieee.org/abstract/document/9724745) * Keller, P. et al. (February 2020). [What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning](https://arxiv.org/abs/2002.02650) -___ - -**Our related papers** -* Cabrera Lozoya, R. et al. (March 2021). [Commit2Vec: Learning Distributed Representations of Code Changes](https://link.springer.com/article/10.1007/s42979-021-00566-z) -* Fehrer, T. et al. (May 2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers](https://dl.acm.org/doi/pdf/10.1145/3661167.3661217) -* Ponta, S.E. et al. (June 2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies](https://www.semanticscholar.org/paper/Detection%2C-assessment-and-mitigation-of-in-open-Ponta-Plate/728eab7ac5ae7dd624d306ae5e1887f7b10447cc) -* Dann, A. et al. (September 2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite](https://www.computer.org/csdl/journal/ts/2022/09/09506931/1vNfNyyKDOo) -* Ponta, S.E. et al. (August 2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application](https://arxiv.org/abs/2108.05115) -* Iannone, E. et al. (June 2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries](https://ieeexplore.ieee.org/abstract/document/9462983) - - ## Star History [![Star History Chart](https://api.star-history.com/svg?repos=sap/project-kb&type=Date)](https://star-history.com/#sap/project-kb&Date) From 6bb696ab046434d0901301b61aa10f75747fa49d Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Fri, 19 Jul 2024 14:28:39 +0200 Subject: [PATCH 60/83] Changed format for references of papers APA one --- README.md | 187 +++++++++++++++++++++++++++--------------------------- 1 file changed, 92 insertions(+), 95 deletions(-) diff --git a/README.md b/README.md index abac3a20b..19f491e20 100644 --- a/README.md +++ b/README.md @@ -109,106 +109,103 @@ scripts described in that paper](MSR2019) ___ **Our papers related to Project KB** -* Cabrera Lozoya, R. et al. (March 2021). [Commit2Vec: Learning Distributed Representations of Code Changes](https://link.springer.com/article/10.1007/s42979-021-00566-z) -* Fehrer, T. et al. (May 2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers](https://dl.acm.org/doi/pdf/10.1145/3661167.3661217) -* Ponta, S.E. et al. (June 2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies](https://www.semanticscholar.org/paper/Detection%2C-assessment-and-mitigation-of-in-open-Ponta-Plate/728eab7ac5ae7dd624d306ae5e1887f7b10447cc) -* Dann, A. et al. (September 2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite](https://www.computer.org/csdl/journal/ts/2022/09/09506931/1vNfNyyKDOo) -* Ponta, S.E. et al. (August 2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application](https://arxiv.org/abs/2108.05115) -* Iannone, E. et al. (June 2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries](https://ieeexplore.ieee.org/abstract/document/9462983) +* Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. +* Cabrera Lozoya, R., Baumann, A., Sabetta, A., & Bezzi, M. (2021). [Commit2Vec: Learning Distributed Representations of Code Changes.](https://link.springer.com/article/10.1007/s42979-021-00566-z) SN Computer Science, 2(3). +* Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering +* Ponta, S. E., Fischer, W., Plate, H., & Sabetta, A. (2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application.](https://www.computer.org/csdl/proceedings-article/icsme/2021/288200a555/1yNhfKb2TBe) 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) +* Iannone, E., Nucci, D. D., Sabetta, A., & De Lucia, A. (2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries.](https://ieeexplore.ieee.org/document/9462983) 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), 396–400. +* Ponta, S. E., Plate, H., & Sabetta, A. (2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies.](https://api.semanticscholar.org/CorpusID:220259876) Empirical Software Engineering, 25, 3175–3215. ___ - + **Papers citing our work** -* Bui, Q-C. et al. (May 2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) -* Galvão, P.L. (October 2022). [Analysis and Aggregation of Vulnerability Databases with Code-Level Data](https://repositorio-aberto.up.pt/bitstream/10216/144796/2/588886.pdf) -* Aladics, T. et al. (2022). [A Vulnerability Introducing Commit Dataset for Java: an Improved SZZ Based Approach](https://real.mtak.hu/149061/1/ICSOFT_2022_41_CR-1.pdf) -* Sharma, T. et al. (October 2021). [A Survey on Machine Learning Techniques for Source Code Analysis](https://arxiv.org/abs/2110.09610) -* Hommersom, D. et al. (June 2024). [Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories](https://dl.acm.org/doi/abs/10.1145/3649590) -* Marchand-Melsom, A. et al. (June 2020). [Automatic repair of OWASP Top 10 security vulnerabilities: A survey](https://dl.acm.org/doi/abs/10.1145/3387940.3392200) -* Sawadogo, A. D. et al. (Dec 2021). [Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We?](https://arxiv.org/abs/2112.10123) -* Sun, S. et al. (Jul 2023). [Exploring Security Commits in Python](https://arxiv.org/abs/2307.11853) -* Reis, S. et al. (June 2021). [Fixing Vulnerabilities Potentially Hinders Maintainability](https://arxiv.org/abs/2106.03271) -* Andrade, R., & Santos, V. (September 2021). [Investigating vulnerability datasets](https://sol.sbc.org.br/index.php/vem/article/view/17213) -* Nguyen, T. G. et al. (May 2023). [Multi-Granularity Detector for Vulnerability Fixesv](https://arxiv.org/abs/2305.13884) -* Siddiq, M. L., & Santos, J. C. S. (November 2022). [SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques](https://dl.acm.org/doi/abs/10.1145/3549035.3561184) -* Sawadogo, A. D. et al. (August 2022). [SSPCatcher: Learning to catch security patches](https://link.springer.com/article/10.1007/s10664-022-10168-9) -* Dunlap, T. et al. (July 2024). [VFCFinder: Pairing Security Advisories and Patches](http://enck.org/pubs/dunlap-asiaccs24.pdf) -* Dunlap, T. et al. (November 2023). [VFCFinder: Seamlessly Pairing Security Advisories and Patches](https://arxiv.org/abs/2311.01532) -* Bao, L. et al. (July 2022). [V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities](https://dl.acm.org/doi/abs/10.1145/3510003.3510113) -* Fan, J. et al. (September 2020). [A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries](https://dl.acm.org/doi/abs/10.1145/3379597.3387501) -* Zhang, J. et al. (January 2023). [A Survey of Learning-based Automated Program Repair](https://arxiv.org/abs/2301.03270) -* Alzubaidi, L. et al. (April 2023). [A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications](https://link.springer.com/article/10.1186/s40537-023-00727-2) -* Sharma, T. et al. (December 2023). [A survey on machine learning techniques applied to source code](https://www.sciencedirect.com/science/article/pii/S0164121223003291) -* Elder, S. et al. (April 2024). [A Survey on Software Vulnerability Exploitability Assessment](https://dl.acm.org/doi/abs/10.1145/3648610) -* Aladics, T. et al. (March 2023). [An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction](https://arxiv.org/abs/2303.16591) -* Singhal, A., & Goel, P.K. (2023). [Analysis and Identification of Malicious Mobile Applications](https://ieeexplore.ieee.org/abstract/document/10428519) -* Senanayake, J. et al. (July 2021). [Android Mobile Malware Detection Using Machine Learning: A Systematic Review](https://www.mdpi.com/2079-9292/10/13/1606) -* Bui, Q-C. et al. (December 2023). [APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities](https://link.springer.com/article/10.1007/s10664-023-10415-7) -* Senanayake, J. et al. (January 2023). [Android Source Code Vulnerability Detection: A Systematic Literature Review](https://dl.acm.org/doi/full/10.1145/3556974) -* Reis, S. et al. (June 2023). [Are security commit messages informative? Not enough!](https://dl.acm.org/doi/abs/10.1145/3593434.3593481) -* Anonymous authors. (2022). [Beyond syntax trees: learning embeddings of code edits by combining multiple source representations](https://openreview.net/pdf?id=H8qETo_W1-9) -* Challande, A. et al. (April 2022). [Building a Commit-level Dataset of Real-world Vulnerabilities](https://dl.acm.org/doi/abs/10.1145/3508398.3511495) -* Wang, S., & Nagappan, N. (July 2019). [Characterizing and Understanding Software Developer Networks in Security Development](https://arxiv.org/abs/1907.12141) -* Harzevili, N. S. et al. (March 2022). [Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries](https://arxiv.org/abs/2203.06502) -* Tate, S. R. et al. (2020). [Characterizing Vulnerabilities in a Major Linux Distribution](https://home.uncg.edu/cmp/faculty/srtate/pubs/vulnerabilities/Vulnerabilities-SEKE2020.pdf) -* Zhang, L. et al. (January 2023). [Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects](https://arxiv.org/abs/2301.08434) -* Lee, J.Y.D., & Chieu, H.L. (November 2021). [Co-training for Commit Classification](https://aclanthology.org/2021.wnut-1.43/) -* Nikitopoulos, G. et al. (August 2021). [CrossVul: a cross-language vulnerability dataset with commit data](https://dl.acm.org/doi/10.1145/3468264.3473122) -* Bhandari, G.P. (July 2021). [CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software](https://arxiv.org/abs/2107.08760) -* Sonnekalb, T. et al. (October 2021). [Deep security analysis of program code](https://link.springer.com/article/10.1007/s10664-021-10029-x) -* Triet, H.M. et al. (August 2021). [DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning](https://arxiv.org/abs/2108.08041) -* Senanayake, J. et al. (May 2024). [Defendroid: Real-time Android code vulnerability detection via blockchain federated neural network with XAI](https://www.sciencedirect.com/science/article/pii/S2214212624000449) -* Stefanoni, A. et al. (2022). [Detecting Security Patches in Java Projects Using NLP Technology](https://aclanthology.org/2022.icnlsp-1.6.pdf) -* Okutan, A. et al. (May 2023). [Empirical Validation of Automated Vulnerability Curation and Characterization](https://s2e-lab.github.io/preprints/tse23-preprint.pdf) -* Wang, J. et al. (October 2023). [Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation](https://arxiv.org/abs/2310.16263) -* Bottner, L. et al. (December 2023). [Evaluation of Free and Open Source Tools for Automated Software Composition Analysis](https://dl.acm.org/doi/abs/10.1145/3631204.3631862) -* Ganz, T. et al. (November 2021). [Explaining Graph Neural Networks for Vulnerability Discovery](https://dl.acm.org/doi/abs/10.1145/3474369.3486866) -* Ram, A. et al. (November 2019). [Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits](https://arxiv.org/abs/1911.07620) -* Md. Mostafizer Rahman, et al. (July 2023). [Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey](https://arxiv.org/abs/2307.08705) -* Zhang, Y. et al. (October 2023). [How well does LLM generate security tests?](https://arxiv.org/abs/2310.00710) -* Jing, D. (2022). [Improvement of Vulnerable Code Dataset Based on Program Equivalence Transformation](https://iopscience.iop.org/article/10.1088/1742-6596/2363/1/012010) -* Wu, Y. et al. (May 2023). [How Effective Are Neural Networks for Fixing Security Vulnerabilities](https://arxiv.org/abs/2305.18607) -* Yang, G. et al. (August 2021). [Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models](https://arxiv.org/abs/2108.06590) -* Zhou, J. et al. (2021). [Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes](https://ieeexplore.ieee.org/abstract/document/9678720) -* Dunlap, T. et al. (2023). [Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis](https://ieeexplore.ieee.org/document/10190493) -* Shestov, A. et al. (January 2024). [Finetuning Large Language Models for Vulnerability Detection](https://arxiv.org/abs/2401.17010) -* Scalco, S. et al. (July 2024). [Hash4Patch: A Lightweight Low False Positive Tool for Finding Vulnerability Patch Commits](https://dl.acm.org/doi/10.1145/3643991.3644871) -* Nguyen-Truong, G. et al. (July 2022). [HERMES: Using Commit-Issue Linking to Detect Vulnerability-Fixing Commits](https://ieeexplore.ieee.org/abstract/document/9825835) -* Wang, J. et al. (July 2024). [Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval](https://arxiv.org/abs/2407.02395) -* Sawadogo, A.D. et al. (January 2020). [Learning to Catch Security Patches](https://arxiv.org/abs/2001.09148) -* Tony, C. et al. (March 2023). [LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations](https://arxiv.org/abs/2303.09384) -* Wang, S., & Naggapan, N. (July 2019). [Characterizing and Understanding Software Developer Networks in Security Development](https://www.researchgate.net/publication/334760102_Characterizing_and_Understanding_Software_Developer_Networks_in_Security_Development) -* Chen, Z. et al. (April 2021). [Neural Transfer Learning for Repairing Security Vulnerabilities in C Code](https://arxiv.org/abs/2104.08308) -* Papotti, A. et al. (September 2022). [On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools](https://arxiv.org/abs/2209.07211) -* Mir, A.M. et al. (February 2024). [On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study](https://arxiv.org/abs/2402.07294) -* Dietrich, J. et al. (June 2023). [On the Security Blind Spots of Software Composition Analysis](https://arxiv.org/abs/2306.05534) -* Triet H. M. Le., & Babar, A.M. (March 2022). [On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models](https://arxiv.org/abs/2203.08417) -* Chapman, J., & Venugopalan, H. (January 2023). [Open Source Software Computed Risk Framework](https://ieeexplore.ieee.org/abstract/document/10000561) -* Canfora, G. et al. (February 2022). [Patchworking: Exploring the code changes induced by vulnerability fixing activities](https://www.researchgate.net/publication/355561561_Patchworking_Exploring_the_code_changes_induced_by_vulnerability_fixing_activities) -* Garg, S. et al. (June 2021). [PerfLens: a data-driven performance bug detection and fix platform](https://dl.acm.org/doi/abs/10.1145/3460946.3464318) -* Coskun, T. et al. (November 2022). [Profiling developers to predict vulnerable code changes](https://dl.acm.org/doi/abs/10.1145/3558489.3559069) -* Bhuiyan, M.H.M. et al. (July 2023). [SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript](https://ieeexplore.ieee.org/abstract/document/10172577) -* Reis, S. et al. (October 2022). [SECOM: towards a convention for security commit messages](https://dl.acm.org/doi/abs/10.1145/3524842.3528513) -* Bennett, G. et al. (June 2024). [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools](https://dl.acm.org/doi/abs/10.1145/3661167.3661262) -* Chi, J. et al. (October 2020). [SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning](https://arxiv.org/abs/2010.10805) -* Ahmed, A. et al. (May 2023). [Sequential Graph Neural Networks for Source Code Vulnerability Identification](https://arxiv.org/abs/2306.05375) -* Sun, J. et al. (February 2023). [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation](https://arxiv.org/abs/2302.07445) -* Zhao, L. et al. (November 2023). [Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects](https://dl.acm.org/doi/10.1145/3611643.3616299) -* Zhan, Q. et al. (January 2024). [Survey on Vulnerability Awareness of Open Source Software](https://www.jos.org.cn/josen/article/abstract/6935) -* Li, X. et al. (March 2023). [The anatomy of a vulnerability database: A systematic mapping study](https://www.sciencedirect.com/science/article/pii/S0164121223000742) -* Al Debeyan, F. et al. (February 2024). [The impact of hard and easy negative training data on vulnerability prediction performance☆](https://www.sciencedirect.com/science/article/pii/S0164121224000463) -* Xu, C. et al. (December 2021). [Tracking Patches for Open Source Software Vulnerabilities](https://arxiv.org/abs/2112.02240) -* Risse, N., & Böhme, M. (June 2023). [Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection](https://arxiv.org/abs/2306.17193) -* Xu, N. et al. (July 2023). [Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)](https://dl.acm.org/doi/abs/10.1145/3597926.3598037) -* Wu, Y. et al. (July 2023). [Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem](https://ieeexplore.ieee.org/abstract/document/10172868) -* Esposito, M., & Falessi, D. (March 2024). [VALIDATE: A deep dive into vulnerability prediction datasets](https://www.sciencedirect.com/science/article/pii/S0950584924000533) -* Wang, S. et al. (July 2022). [VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities](https://ieeexplore.ieee.org/abstract/document/9825908) -* Sun, Q. et al. (December 2022). [VERJava: Vulnerable Version Identification for Java OSS with a Two-Stage Analysis](https://ieeexplore.ieee.org/abstract/document/9978189) -* Nguyen, S. et al. (September 2023). [VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification](https://arxiv.org/abs/2309.01971) -* Piran, A. et al. (March 2022). [Vulnerability Analysis of Similar Code](https://ieeexplore.ieee.org/abstract/document/9724745) -* Keller, P. et al. (February 2020). [What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning](https://arxiv.org/abs/2002.02650) +* Aladics, T., Hegedüs, P., & Ferenc, R. (2022). [A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach.](https://api.semanticscholar.org/CorpusID:250566828) International Conference on Software and Data Technologies +* Bui, Q.-C., Scandariato, R., & Ferreyra, N. E. D. (2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques.](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) Proceedings of the 19th International Conference on Mining Software Repositories, 464–468. +* Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F. (2022). [A Survey on Machine Learning Techniques for Source Code Analysis.](http://arxiv.org/abs/2110.09610) +* Hommersom, D., Sabetta, A., Coppola, B., Nucci, D. D., & Tamburri, D. A. (2024). [Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories.](https://dl.acm.org/doi/10.1145/3649590) ACM Trans. Softw. Eng. Methodol., 33(5). +* Marchand-Melsom, A., & Nguyen Mai, D. B. (2020). [Automatic repair of OWASP Top 10 security vulnerabilities: A survey.](https://dl.acm.org/doi/10.1145/3387940.3392200) Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 23–30. Presented at the Seoul, Republic of Korea. +* Sawadogo, A. D., Guimard, Q., Bissyandé, T. F., Kaboré, A. K., Klein, J., & Moha, N. (2021). [Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We?](http://arxiv.org/abs/2112.10123) +* Sun, S., Wang, S., Wang, X., Xing, Y., Zhang, E., & Sun, K. (2023). [Exploring Security Commits in Python.](http://arxiv.org/abs/2307.11853) +* Reis, S., Abreu, R., & Cruz, L. (2021). [Fixing Vulnerabilities Potentially Hinders Maintainability.](http://arxiv.org/abs/2106.03271) +* Andrade, R., & Santos, V. (2021). [Investigating vulnerability datasets.](https://sol.sbc.org.br/index.php/vem/article/view/17213) Anais Do IX Workshop de Visualização, Evolução e Manutenção de Software, 26–30. Presented at the Joinville. +* Nguyen, T. G., Le-Cong, T., Kang, H. J., Widyasari, R., Yang, C., Zhao, Z., … Lo, D. (2023). [Multi-Granularity Detector for Vulnerability Fixes.](https://arxiv.org/abs/2305.13884) +* Siddiq, M. L., & Santos, J. C. S. (2022). [SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques.](https://dl.acm.org/doi/abs/10.1145/3549035.3561184) Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, 29–33. Presented at the Singapore, Singapore.] +* Sawadogo, A. D., Bissyandé, T. F., Moha, N., Allix, K., Klein, J., Li, L., & Traon, Y. L. (2020). [Learning to Catch Security Patches.](https://arxiv.org/abs/2001.09148) +* Dunlap, T., Lin, E., Enck, W., & Reaves, B. (2023). [VFCFinder: Seamlessly Pairing Security Advisories and Patches.](http://arxiv.org/abs/2311.01532) +* Bao, L., Xia, X., Hassan, A. E., & Yang, X. (2022). [V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities.](https://dl.acm.org/doi/10.1145/3510003.3510113) Proceedings of the 44th International Conference on Software Engineering, 2352–2364. Presented at the Pittsburgh, Pennsylvania. +* Fan, J., Li, Y., Wang, S., & Nguyen, T. N. (2020). [A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries.](https://dl.acm.org/doi/10.1145/3379597.3387501) Proceedings of the 17th International Conference on Mining Software Repositories, 508–512. Presented at the Seoul, Republic of Korea. +* Zhang, Q., Fang, C., Ma, Y., Sun, W., & Chen, Z. (2023). [A Survey of Learning-based Automated Program Repair.](http://arxiv.org/abs/2301.03270) +* Alzubaidi, L., Bai, J., Al-Sabaawi, A., Santamaría, J. I., Albahri, A. S., Al-dabbagh, B. S. N., … Gu, Y. (2023). [A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications.](https://www.semanticscholar.org/paper/A-survey-on-deep-learning-tools-dealing-with-data-Alzubaidi-Bai/4a07ded5f56aa76c75e844f353e046414b427cc2) Journal of Big Data, 10, 1–82. +* Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F. (2024). [A survey on machine learning techniques applied to source code.](https://discovery.ucl.ac.uk/id/eprint/10184342/) Journal of Systems and Software, 209, 111934. +* Elder, S., Rahman, M. R., Fringer, G., Kapoor, K., & Williams, L. (2024). [A Survey on Software Vulnerability Exploitability Assessment.](https://dl.acm.org/doi/10.1145/3648610) ACM Comput. Surv., 56(8). +* Aladics, T., Hegedűs, P., & Ferenc, R. (2023). [An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction.](https://arxiv.org/abs/2303.16591) +* Singhal, A., & Goel, P. K. (2023). [Analysis and Identification of Malicious Mobile Applications.](https://www.researchgate.net/publication/378257226_Analysis_and_Identification_of_Malicious_Mobile_Applications) 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE), 1045–1050. +* Senanayake, J., Kalutarage, H., & Al-Kadri, M. O. (2021). [Android Mobile Malware Detection Using Machine Learning: A Systematic Review.](https://www.mdpi.com/2079-9292/10/13/1606) Electronics, 10(13). +* Bui, Q.-C., Paramitha, R., Vu, D.-L., Massacci, F., & Scandariato, R. (12 2023). [APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities.](https://link.springer.com/article/10.1007/s10664-023-10415-7) Empirical Software Engineering, 29. +* Senanayake, J., Kalutarage, H., Al-Kadri, M. O., Petrovski, A., & Piras, L. (2023). [Android Source Code Vulnerability Detection: A Systematic Literature Review.](https://dl.acm.org/doi/10.1145/3556974) ACM Comput. Surv., 55(9). +* Reis, S., Abreu, R., & Pasareanu, C. (2023). [Are security commit messages informative? Not enough!](https://dl.acm.org/doi/10.1145/3593434.3593481) Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, 196–199. Presented at the Oulu, Finland. +* [B EYOND SYNTAX TREES : LEARNING EMBEDDINGS OF CODE EDITS BY COMBINING MULTIPLE SOURCE REP - RESENTATIONS.](https://api.semanticscholar.org/CorpusID:249038879) (2022). +* Challande, A., David, R., & Renault, G. (2022). [Building a Commit-level Dataset of Real-world Vulnerabilities.](https://dl.acm.org/doi/10.1145/3508398.3511495) Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy, 101–106. Presented at the Baltimore, MD, USA. +* Wang, Song, & Nagappan, N. (2019). [Characterizing and Understanding Software Developer Networks in Security Development.](http://arxiv.org/abs/1907.12141) +* Harzevili, N. S., Shin, J., Wang, J., & Wang, S. (2022). [Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries.](http://arxiv.org/abs/2203.06502) +* Zhang, L., Liu, C., Xu, Z., Chen, S., Fan, L., Zhao, L., … Liu, Y. (2023). [Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects.](http://arxiv.org/abs/2301.08434) +* Lee, J. Y. D., & Chieu, H. L. (2021, November). [Co-training for Commit Classification.](https://aclanthology.org/2021.wnut-1.43/) +* In W. Xu, A. Ritter, T. Baldwin, & A. Rahimi (Eds.), [Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)](https://aclanthology.org/volumes/2021.wnut-1/) +* Nikitopoulos, G., Dritsa, K., Louridas, P., & Mitropoulos, D. (2021).[CrossVul: a cross-language vulnerability dataset with commit data.](https://dl.acm.org/doi/10.1145/3468264.3473122) Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1565–1569. Presented at the Athens, Greece. +* Bhandari, G., Naseer, A., & Moonen, L. (2021, August). [CVEfixes: automated collection of vulnerabilities and their fixes from open-source software.](https://arxiv.org/abs/2107.08760) Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. +* Sonnekalb, T., Heinze, T. S., & Mäder, P. (2022). [Deep security analysis of program code: A systematic literature review.](https://link.springer.com/article/10.1007/s10664-021-10029-x) Empirical Softw. Engg., 27(1). +* Le, T. H. M., Hin, D., Croft, R., & Babar, M. A. (2021). [DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning.](http://arxiv.org/abs/2108.08041) +* Senanayake, J., Kalutarage, H., Petrovski, A., Piras, L., & Al-Kadri, M. O. (2024). [Defendroid: Real-time Android code vulnerability detection via blockchain federated neural network with XAI.](https://www.sciencedirect.com/science/article/pii/S2214212624000449) Journal of Information Security and Applications, 82, 103741. +* Stefanoni, A., Girdzijauskas, S., Jenkins, C., Kefato, Z. T., Sbattella, L., Scotti, V., & Wåreus, E. (2022). [Detecting Security Patches in Java Projects Using NLP Technology.](https://api.semanticscholar.org/CorpusID:256739262) International Conference on Natural Language and Speech Processing. +* Okutan, A., Mell, P., Mirakhorli, M., Khokhlov, I., Santos, J. C. S., Gonzalez, D., & Simmons, S. (2023). [Empirical Validation of Automated Vulnerability Curation and Characterization.](https://ieeexplore.ieee.org/document/10056768) IEEE Transactions on Software Engineering, 49(5), 3241–3260. +* Wang, J., Cao, L., Luo, X., Zhou, Z., Xie, J., Jatowt, A., & Cai, Y. (2023). [Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation.](http://arxiv.org/abs/2310.16263) +* Bottner, L., Hermann, A., Eppler, J., Thüm, T., & Kargl, F. (2023). [Evaluation of Free and Open Source Tools for Automated Software Composition Analysis.](https://dl.acm.org/doi/abs/10.1145/3631204.3631862) Proceedings of the 7th ACM Computer Science in Cars Symposium. Presented at the Darmstadt, Germany. +* Ganz, T., Härterich, M., Warnecke, A., & Rieck, K. (2021). [Explaining Graph Neural Networks for Vulnerability Discovery.](doi:10.1145/3474369.3486866) Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, 145–156. Presented at the Virtual Event, Republic of Korea. +* Ram, A., Xin, J., Nagappan, M., Yu, Y., Lozoya, R. C., Sabetta, A., & Lin, J. (2019). [Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits.](http://arxiv.org/abs/1911.07620) +* Rahman, M. M., Watanobe, Y., Shirafuji, A., & Hamada, M. (2023). [Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey.](http://arxiv.org/abs/2307.08705) +* Zhang, Y., Song, W., Ji, Z., Danfeng, Yao, & Meng, N. (2023). [How well does LLM generate security tests?](http://arxiv.org/abs/2310.00710) +* Jing, D. (2022). [Improvement of Vulnerable Code Dataset Based on Program Equivalence Transformation.](https://iopscience.iop.org/article/10.1088/1742-6596/2363/1/012010/pdf) Journal of Physics: Conference Series, 2363(1), 012010. +* Wu, Yi, Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan, L., … Shah, S. (2023, July). [How Effective Are Neural Networks for Fixing Security Vulnerabilities.](https://arxiv.org/abs/2305.18607) Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. +* Yang, G., Dineen, S., Lin, Z., & Liu, X. (2021). [Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models.](http://arxiv.org/abs/2108.06590) +* Zhou, J., Pacheco, M., Wan, Z., Xia, X., Lo, D., Wang, Y., & Hassan, A. E. (2021). [Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes.](https://ieeexplore.ieee.org/document/9678720) 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 705–716. +* Dunlap, T., Thorn, S., Enck, W., & Reaves, B. (2023). [Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis.](https://ieeexplore.ieee.org/document/10190493) 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), 489–505. +* Shestov, A., Levichev, R., Mussabayev, R., Maslov, E., Cheshkov, A., & Zadorozhny, P. (2024). [Finetuning Large Language Models for Vulnerability Detection.](http://arxiv.org/abs/2401.17010) +* Scalco, S., & Paramitha, R. (2024). [Hash4Patch: A Lightweight Low False Positive Tool for Finding Vulnerability Patch Commits.](https://dl.acm.org/doi/10.1145/3643991.3644871) Proceedings of the 21st International Conference on Mining Software Repositories, 733–737. Presented at the Lisbon, Portugal. +* Nguyen-Truong, G., Kang, H. J., Lo, D., Sharma, A., Santosa, A. E., Sharma, A., & Ang, M. Y. (2022). [HERMES: Using Commit-Issue Linking to Detect Vulnerability-Fixing Commits.](https://ieeexplore.ieee.org/document/9825835) 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 51–62. +* Wang, J., Luo, X., Cao, L., He, H., Huang, H., Xie, J., … Cai, Y. (2024). [Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval.](http://arxiv.org/abs/2407.02395) +* Tony, C., Mutas, M., Ferreyra, N. E. D., & Scandariato, R. (2023). [LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations.](http://arxiv.org/abs/2303.09384) +* Chen, Z., Kommrusch, S., & Monperrus, M. (2023). [Neural Transfer Learning for Repairing Security Vulnerabilities in C Code.](https://ieeexplore.ieee.org/document/9699412) IEEE Transactions on Software Engineering, 49(1), 147–165. +* Papotti, A., Paramitha, R., & Massacci, F. (2022). [On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.](http://arxiv.org/abs/2209.07211) +* Mir, A. M., Keshani, M., & Proksch, S. (2024). [On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study.](http://arxiv.org/abs/2402.07294) +* Dietrich, J., Rasheed, S., Jordan, A., & White, T. (2023). [On the Security Blind Spots of Software Composition Analysis.](http://arxiv.org/abs/2306.05534) +* Le, T. H. M., & Babar, M. A. (2022). [On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models.](http://arxiv.org/abs/2203.08417) +* Chapman, J., & Venugopalan, H. (2022). [Open Source Software Computed Risk Framework.](https://www.bibsonomy.org/bibtex/1c114d6756c609391db2f66919f237261) 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 172–175. +* Canfora, G., Di Sorbo, A., Forootani, S., Martinez, M., & Visaggio, C. A. (2022). [Patchworking: Exploring the code changes induced by vulnerability fixing activities.](https://www.sciencedirect.com/science/article/abs/pii/S0950584921001932) Information and Software Technology, 142, 106745. +* Garg, S., Moghaddam, R. Z., Sundaresan, N., & Wu, C. (2021). [PerfLens: a data-driven performance bug detection and fix platform.](https://dl.acm.org/doi/10.1145/3460946.3464318) Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, 19–24. Presented at the Virtual, Canada. +* Coskun, T., Halepmollasi, R., Hanifi, K., Fouladi, R. F., De Cnudde, P. C., & Tosun, A. (2022). [Profiling developers to predict vulnerable code changes.](https://dl.acm.org/doi/10.1145/3558489.3559069) Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering, 32–41. Presented at the Singapore, Singapore. +* Bhuiyan, M. H. M., Parthasarathy, A. S., Vasilakis, N., Pradel, M., & Staicu, C.-A. (2023). [SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript.](https://ieeexplore.ieee.org/document/10172577) 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 1059–1070. +* Reis, S., Abreu, R., Erdogmus, H., & Păsăreanu, C. (2022). [SECOM: towards a convention for security commit messages.](https://dl.acm.org/doi/abs/10.1145/3524842.3528513) Proceedings of the 19th International Conference on Mining Software Repositories, 764–765. Presented at the Pittsburgh, Pennsylvania. +* Bennett, G., Hall, T., Winter, E., & Counsell, S. (2024). [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools.](https://dl.acm.org/doi/10.1145/3661167.3661262) Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 614–623. Presented at the Salerno, Italy. +* Chi, J., Qu, Y., Liu, T., Zheng, Q., & Yin, H. (2022). [SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning.](http://arxiv.org/abs/2010.10805) +* Ahmed, A., Said, A., Shabbir, M., & Koutsoukos, X. (2023). [Sequential Graph Neural Networks for Source Code Vulnerability Identification.](http://arxiv.org/abs/2306.05375) +* Sun, J., Xing, Z., Lu, Q., Xu, X., Zhu, L., Hoang, T., & Zhao, D. (2023). [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation.](http://arxiv.org/abs/2302.07445) +* Zhao, L., Chen, S., Xu, Z., Liu, C., Zhang, L., Wu, J., … Liu, Y. (2023). [Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects.](https://dl.acm.org/doi/10.1145/3611643.3616299) Proceedings of the 31st ACM Joint European Software Engineering Conference and * Symposium on the Foundations of Software Engineering, 960–972. Presented at the San Francisco, CA, USA. +* ZHAN, Q., PAN S-Y., HU X., BAO L-F., XIA, X. (2024). [Survey on Vulnerability Awareness of Open Source Software.](https://www.jos.org.cn/josen/article/abstract/6935) Journal of Software, 35(1), 19. +* Li, X., Moreschini, S., Zhang, Z., Palomba, F., & Taibi, D. (2023). [The anatomy of a vulnerability database: A systematic mapping study.](https://www.sciencedirect.com/science/article/pii/S0164121223000742) Journal of Systems and Software, 201, 111679. +* Al Debeyan, F., Madeyski, L., Hall, T., & Bowes, D. (2024). [The impact of hard and easy negative training data on vulnerability prediction performance.](https://www.sciencedirect.com/science/article/pii/S0164121224000463) Journal of Systems and Software, 211, 112003. +* Xu, C., Chen, B., Lu, C., Huang, K., Peng, X., & Liu, Y. (2023). [Tracking Patches for Open Source Software Vulnerabilities.](http://arxiv.org/abs/2112.02240) +* Risse, N., & Böhme, M. (2024). [Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection.](http://arxiv.org/abs/2306.17193) +* Nie, X., Li, N., Wang, K., Wang, S., Luo, X., & Wang, H. (2023). [Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper).](https://dl.acm.org/doi/10.1145/3597926.3598037) Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, 52–63. Presented at the Seattle, WA, USA. +* Wu, Yulun, Yu, Z., Wen, M., Li, Q., Zou, D., & Jin, H. (2023). [Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem.](https://dl.acm.org/doi/10.1109/ICSE48619.2023.00095) 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 1046–1058. +* Esposito, M., & Falessi, D. (2024). [VALIDATE: A deep dive into vulnerability prediction datasets.](https://dl.acm.org/doi/abs/10.1016/j.infsof.2024.107448) Information and Software Technology, 170, 107448. +* Wang, Shichao, Zhang, Y., Bao, L., Xia, X., & Wu, M. (2022). [VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities.](https://ieeexplore.ieee.org/document/9825908) 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 589–600. +* Sun, Q., Xu, L., Xiao, Y., Li, F., Su, H., Liu, Y., … Huo, W. (2022). [VERJava: Vulnerable Version Identification for Java OSS with a Two-Stage Analysis.](https://ieeexplore.ieee.org/document/9978189) 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), 329–339. +* Nguyen, S., Vu, T. T., & Vo, H. D. (2023). [VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification.](http://arxiv.org/abs/2309.01971) +* Piran, A., Chang, C.-P., & Fard, A. M. (2021). [Vulnerability Analysis of Similar Code.](https://ieeexplore.ieee.org/document/9724745) 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), 664–671. +* Keller, P., Plein, L., Bissyandé, T. F., Klein, J., & Traon, Y. L. (2020). [What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning.](http://arxiv.org/abs/2002.02650) +* Akhoundali, J., Nouri, S. R., Rietveld, K., & Gadyatskaya, O. (2024). [MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery.](https://dl.acm.org/doi/10.1145/3663533.3664036) Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering, 42–51. Presented at the Porto de Galinhas, Brazil. ## Star History From 6f8fbe04a1ef0293aad845e3cec0cedb00f3eacf Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Fri, 19 Jul 2024 15:08:16 +0200 Subject: [PATCH 61/83] ToC: removed level 3 heading, excluded Description as description is already on top of the md file --- README.md | 31 +++++++++++++------------------ 1 file changed, 13 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index 19f491e20..9dc272c85 100644 --- a/README.md +++ b/README.md @@ -10,21 +10,16 @@ [![Pytest](https://github.com/SAP/project-kb/actions/workflows/python.yml/badge.svg)](https://github.com/SAP/project-kb/actions/workflows/python.yml) # Table of contents -1. [Description](#desc) -2. [Motivations](#motiv) -3. [Kaybee](#kaybee) -4. [Prospector](#prosp) -5. [Vulnerability data](#vuldata) -6. [Publications](#publi) -7. [Star history](#starhist) -8. [Credits](#credit) -9. [EU funded research projects](#eu_funded) -10. [Vulnerability data sources](#vul_data) -11. [Limitations and known issues](#limit) -12. [Support](#support) -13. [Contributing](#contrib) - -## Description +1. [Kaybee](#kaybee) +2. [Prospector](#prosp) +3. [Vulnerability data](#vuldata) +4. [Publications](#publi) +5. [Star history](#starhist) +6. [Limitations and known issues](#limit) +7. [Support](#support) +8. [Contributing](#contrib) + +## Description The goal of `Project KB` is to enable the creation, management and aggregation of a distributed, collaborative knowledge base of vulnerabilities affecting @@ -34,7 +29,7 @@ open-source software. as well as set of tools to support the mining, curation and management of such data. -### Motivations +### Motivations In order to feed [Eclipse Steady](https://github.com/eclipse/steady/) with fresh data, we have spent a considerable amount of time, in the past few years, mining @@ -213,7 +208,7 @@ ___ ## Credits -### EU-funded research projects +### EU-funded research projects The development of Project KB is partially supported by the following projects: @@ -221,7 +216,7 @@ The development of Project KB is partially supported by the following projects: * [AssureMOSS](https://assuremoss.eu) (Grant No. 952647). * [Sparta](https://www.sparta.eu/) (Grant No. 830892). -### Vulnerability data sources +### Vulnerability data sources Vulnerability information from NVD and MITRE might have been used as input for building parts of this knowledge base. See MITRE's [CVE Usage license](http://cve.mitre.org/about/termsofuse.html) for more information. From 7c9630ba74ade34062d334403f229787890303df Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Fri, 19 Jul 2024 15:53:12 +0200 Subject: [PATCH 62/83] Added two more papers --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 9dc272c85..fa76ff963 100644 --- a/README.md +++ b/README.md @@ -118,6 +118,8 @@ ___ **Papers citing our work** * Aladics, T., Hegedüs, P., & Ferenc, R. (2022). [A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach.](https://api.semanticscholar.org/CorpusID:250566828) International Conference on Software and Data Technologies * Bui, Q.-C., Scandariato, R., & Ferreyra, N. E. D. (2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques.](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) Proceedings of the 19th International Conference on Mining Software Repositories, 464–468. +* S. R. Tate, M. Bollinadi, and J. Moore. (2020). [Characterizing Vulnerabilities in a Major Linux Distribution](https://home.uncg.edu/cmp/faculty/srtate/pubs/vulnerabilities/Vulnerabilities-SEKE2020.pdf) 32nd International Conference on Software Engineering \& Knowledge Engineering (SEKE), pp. 538-543. +* Galvão, P. (2022). [Analysis and Aggregation of Vulnerability Databases with Code-Level Data. Dissertation de Master's Degree.](https://repositorio-aberto.up.pt/bitstream/10216/144796/2/588886.pdf) Faculdade de Engenharia da Universidade do Porto. * Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F. (2022). [A Survey on Machine Learning Techniques for Source Code Analysis.](http://arxiv.org/abs/2110.09610) * Hommersom, D., Sabetta, A., Coppola, B., Nucci, D. D., & Tamburri, D. A. (2024). [Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories.](https://dl.acm.org/doi/10.1145/3649590) ACM Trans. Softw. Eng. Methodol., 33(5). * Marchand-Melsom, A., & Nguyen Mai, D. B. (2020). [Automatic repair of OWASP Top 10 security vulnerabilities: A survey.](https://dl.acm.org/doi/10.1145/3387940.3392200) Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 23–30. Presented at the Seoul, Republic of Korea. From 97a2e6814873a3322b8fe1702721c36daeb313eb Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Fri, 19 Jul 2024 16:36:10 +0200 Subject: [PATCH 63/83] One paper added --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index fa76ff963..df8d946ca 100644 --- a/README.md +++ b/README.md @@ -104,6 +104,7 @@ scripts described in that paper](MSR2019) ___ **Our papers related to Project KB** +* Sabetta, A., Ponta, S. E., Cabrera Lozoya, R., Bezzi, M., Sacchetti, T., Greco, M., … Massacci, F. (2024). [Known Vulnerabilities of Open Source Projects: Where Are the Fixes?](https://ieeexplore.ieee.org/document/10381645) IEEE Security & Privacy, 22(2), 49–59. * Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. * Cabrera Lozoya, R., Baumann, A., Sabetta, A., & Bezzi, M. (2021). [Commit2Vec: Learning Distributed Representations of Code Changes.](https://link.springer.com/article/10.1007/s42979-021-00566-z) SN Computer Science, 2(3). * Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering From 91f3e94b52ee18582bce01a4df80abb88c0c5ec0 Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Fri, 19 Jul 2024 22:36:59 +0200 Subject: [PATCH 64/83] Corrected reference date of a paper --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index df8d946ca..340f59104 100644 --- a/README.md +++ b/README.md @@ -104,10 +104,10 @@ scripts described in that paper](MSR2019) ___ **Our papers related to Project KB** -* Sabetta, A., Ponta, S. E., Cabrera Lozoya, R., Bezzi, M., Sacchetti, T., Greco, M., … Massacci, F. (2024). [Known Vulnerabilities of Open Source Projects: Where Are the Fixes?](https://ieeexplore.ieee.org/document/10381645) IEEE Security & Privacy, 22(2), 49–59. +* Sabetta, A., Ponta, S. E., Cabrera Lozoya, R., Bezzi, M., Sacchetti, T., Greco, M., … Massacci, F. (2024). [Known Vulnerabilities of Open Source Projects: Where Are the Fixes?](https://ieeexplore.ieee.org/document/10381645) IEEE Security & Privacy, 22(2), 49–59. +* Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2024). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering * Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. * Cabrera Lozoya, R., Baumann, A., Sabetta, A., & Bezzi, M. (2021). [Commit2Vec: Learning Distributed Representations of Code Changes.](https://link.springer.com/article/10.1007/s42979-021-00566-z) SN Computer Science, 2(3). -* Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2021). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering * Ponta, S. E., Fischer, W., Plate, H., & Sabetta, A. (2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application.](https://www.computer.org/csdl/proceedings-article/icsme/2021/288200a555/1yNhfKb2TBe) 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) * Iannone, E., Nucci, D. D., Sabetta, A., & De Lucia, A. (2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries.](https://ieeexplore.ieee.org/document/9462983) 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), 396–400. * Ponta, S. E., Plate, H., & Sabetta, A. (2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies.](https://api.semanticscholar.org/CorpusID:220259876) Empirical Software Engineering, 25, 3175–3215. From e775302e7d12cc914ff0d46e4081718d4d1740d7 Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Tue, 23 Jul 2024 17:34:22 +0200 Subject: [PATCH 65/83] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 340f59104..46867bb78 100644 --- a/README.md +++ b/README.md @@ -185,7 +185,7 @@ ___ * Coskun, T., Halepmollasi, R., Hanifi, K., Fouladi, R. F., De Cnudde, P. C., & Tosun, A. (2022). [Profiling developers to predict vulnerable code changes.](https://dl.acm.org/doi/10.1145/3558489.3559069) Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering, 32–41. Presented at the Singapore, Singapore. * Bhuiyan, M. H. M., Parthasarathy, A. S., Vasilakis, N., Pradel, M., & Staicu, C.-A. (2023). [SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript.](https://ieeexplore.ieee.org/document/10172577) 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 1059–1070. * Reis, S., Abreu, R., Erdogmus, H., & Păsăreanu, C. (2022). [SECOM: towards a convention for security commit messages.](https://dl.acm.org/doi/abs/10.1145/3524842.3528513) Proceedings of the 19th International Conference on Mining Software Repositories, 764–765. Presented at the Pittsburgh, Pennsylvania. -* Bennett, G., Hall, T., Winter, E., & Counsell, S. (2024). [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools.](https://dl.acm.org/doi/10.1145/3661167.3661262) Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 614–623. Presented at the Salerno, Italy. +* Bennett, G., Hall, T., Winter, E., & Counsell, S. (2024). [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools.](https://dl.acm.org/doi/10.1145/3661167.3661262) Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 614–623, Salerno, Italy. * Chi, J., Qu, Y., Liu, T., Zheng, Q., & Yin, H. (2022). [SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning.](http://arxiv.org/abs/2010.10805) * Ahmed, A., Said, A., Shabbir, M., & Koutsoukos, X. (2023). [Sequential Graph Neural Networks for Source Code Vulnerability Identification.](http://arxiv.org/abs/2306.05375) * Sun, J., Xing, Z., Lu, Q., Xu, X., Zhu, L., Hoang, T., & Zhao, D. (2023). [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation.](http://arxiv.org/abs/2302.07445) From 2e99112dcc3a3b1d986ff0ff2da3656c3020d268 Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Tue, 23 Jul 2024 17:35:22 +0200 Subject: [PATCH 66/83] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 46867bb78..e5cbd05c8 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,7 @@ scripts described in that paper](MSR2019) ___ -**Our papers related to Project KB** +###Our papers related to Project KB * Sabetta, A., Ponta, S. E., Cabrera Lozoya, R., Bezzi, M., Sacchetti, T., Greco, M., … Massacci, F. (2024). [Known Vulnerabilities of Open Source Projects: Where Are the Fixes?](https://ieeexplore.ieee.org/document/10381645) IEEE Security & Privacy, 22(2), 49–59. * Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2024). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering * Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. @@ -116,7 +116,7 @@ ___ -**Papers citing our work** +###Papers citing our work * Aladics, T., Hegedüs, P., & Ferenc, R. (2022). [A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach.](https://api.semanticscholar.org/CorpusID:250566828) International Conference on Software and Data Technologies * Bui, Q.-C., Scandariato, R., & Ferreyra, N. E. D. (2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques.](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) Proceedings of the 19th International Conference on Mining Software Repositories, 464–468. * S. R. Tate, M. Bollinadi, and J. Moore. (2020). [Characterizing Vulnerabilities in a Major Linux Distribution](https://home.uncg.edu/cmp/faculty/srtate/pubs/vulnerabilities/Vulnerabilities-SEKE2020.pdf) 32nd International Conference on Software Engineering \& Knowledge Engineering (SEKE), pp. 538-543. From 9733ac526bcfbf0134f5c530978cf2211b6a37a4 Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Tue, 23 Jul 2024 17:35:51 +0200 Subject: [PATCH 67/83] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e5cbd05c8..46b7a484a 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,7 @@ scripts described in that paper](MSR2019) ___ -###Our papers related to Project KB +### Our papers related to Project KB * Sabetta, A., Ponta, S. E., Cabrera Lozoya, R., Bezzi, M., Sacchetti, T., Greco, M., … Massacci, F. (2024). [Known Vulnerabilities of Open Source Projects: Where Are the Fixes?](https://ieeexplore.ieee.org/document/10381645) IEEE Security & Privacy, 22(2), 49–59. * Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2024). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering * Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. @@ -116,7 +116,7 @@ ___ -###Papers citing our work +### Papers citing our work * Aladics, T., Hegedüs, P., & Ferenc, R. (2022). [A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach.](https://api.semanticscholar.org/CorpusID:250566828) International Conference on Software and Data Technologies * Bui, Q.-C., Scandariato, R., & Ferreyra, N. E. D. (2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques.](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) Proceedings of the 19th International Conference on Mining Software Repositories, 464–468. * S. R. Tate, M. Bollinadi, and J. Moore. (2020). [Characterizing Vulnerabilities in a Major Linux Distribution](https://home.uncg.edu/cmp/faculty/srtate/pubs/vulnerabilities/Vulnerabilities-SEKE2020.pdf) 32nd International Conference on Software Engineering \& Knowledge Engineering (SEKE), pp. 538-543. From c5e577e7981a6bfa74c029e05a9de1d8a7ba53e5 Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Tue, 23 Jul 2024 10:37:30 +0200 Subject: [PATCH 68/83] Tool taking bibtex file and giving output md rf output md rf = output markdown reference --- scripts/bib2md.py | 163 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 163 insertions(+) create mode 100644 scripts/bib2md.py diff --git a/scripts/bib2md.py b/scripts/bib2md.py new file mode 100644 index 000000000..d3f870b57 --- /dev/null +++ b/scripts/bib2md.py @@ -0,0 +1,163 @@ +# Install the library before, with the following command: +# pip install bibtexparser --pre + +# to run on the CLI: +# python bib2md.py -f reference_file.bib -ord desc|asc + +import bibtexparser +import sys +import argparse + +def format_simple(entry_str, order): + library = bibtexparser.parse_string(entry_str) + formatted_entries = [] + unprocessed_entries = [] + + for entry in library.entries: + try: + authors = entry['author'].split(' and ') + if len(authors) > 1: + authors[-1] = 'and ' + authors[-1] + + authors_formatted = ', '.join([a.replace('\n', ' ').strip() for a in authors]) + title = entry['title'] + year = int(entry['year']) + venue = entry.get('journal') or entry.get('booktitle') or entry.get('archivePrefix') + + if not venue: + id_unprocessed = "[" + entry.key + " - " + entry.entry_type + "]" + unprocessed_entries.append(id_unprocessed) + continue + + formatted_entries.append((year, f"{authors_formatted}. {title}. {venue.value}. ({year}).")) + + except KeyError as e: + print(f"One or more necessary fields {str(e)} not present in this BibTeX entry.") + continue + + if order=='asc': + formatted_entries.sort(key=lambda x: x[0]) + elif order=='desc': + formatted_entries.sort(key=lambda x: x[0], reverse=True) + + if len(unprocessed_entries) > 0: + print('Warning: Some entries were not processed due to unknown type', file=sys.stderr) + print("List of unprocessed entrie(s): ", unprocessed_entries) + + return [entry[1] for entry in formatted_entries] + + +def main(): + parser = argparse.ArgumentParser() + + parser.add_argument('-f', '--file', type=str, + help='a .bib file as argument', required=True) + parser.add_argument('-ord', '--order', type=str, + choices=['asc', 'desc'], + help='here we set a sort order. We have the choice between "asc" and "desc"', + required=True) + args = parser.parse_args() + + with open(args.file, 'r') as bibtex_file: + bibtex_str = bibtex_file.read() + + apa_citations = format_simple(bibtex_str, args.order) + for citation in apa_citations: + print() + print(citation) + +if __name__ == "__main__": + main() + + +# bibtex_str = """ +# @comment{ +# This is my example comment. +# } + +# @ARTICLE{Cesar2013, +# author = {Jean César}, +# title = {An amazing title}, +# year = {2013}, +# volume = {12}, +# pages = {12--23}, +# journal = {Nice Journal} +# } + +# @article{CitekeyArticle, +# author = "P. J. Cohen", +# title = "The independence of the continuum hypothesis", +# journal = "Proceedings of the National Academy of Sciences", +# year = 1963, +# volume = "50", +# number = "6", +# pages = "1143--1148", +# } + +# @misc{sharma2022surveymachinelearningtechniques, +# title={A Survey on Machine Learning Techniques for Source Code Analysis}, +# author={Tushar Sharma and Maria Kechagia and Stefanos Georgiou and Rohit Tiwari and Indira Vats and Hadi Moazen and Federica Sarro}, +# year={2022}, +# eprint={2110.09610}, +# archivePrefix={arXiv}, +# primaryClass={cs.SE}, +# url={https://arxiv.org/abs/2110.09610}, +# } + +# @inproceedings{10.1145/3593434.3593481, +# author = {Reis, Sofia and Abreu, Rui and Pasareanu, Corina}, +# title = {Are security commit messages informative? Not enough!}, +# year = {2023}, +# isbn = {9798400700446}, +# publisher = {Association for Computing Machinery}, +# address = {New York, NY, USA}, +# url = {https://doi.org/10.1145/3593434.3593481}, +# doi = {10.1145/3593434.3593481}, +# abstract = {The fast distribution and deployment of security patches are important to protect users against cyberattacks...}, +# booktitle = {Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering}, +# pages = {196–199}, +# numpages = {4}, +# keywords = {Security, Patch Management Process, Convention, Commit Messages, Best Practices}, +# location = {Oulu, Finland}, +# series = {EASE '23} +# } + +# @inproceedings{lee-chieu-2021-co, +# title = "Co-training for Commit Classification", +# author = "Lee, Jian Yi David and +# Chieu, Hai Leong", +# editor = "Xu, Wei and +# Ritter, Alan and +# Baldwin, Tim and +# Rahimi, Afshin", +# booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)", +# month = nov, +# year = "2021", +# address = "Online", +# publisher = "Association for Computational Linguistics", +# url = "https://aclanthology.org/2021.wnut-1.43", +# doi = "10.18653/v1/2021.wnut-1.43", +# pages = "389--395", +# abstract = "Commits in version control systems (e.g. Git) track changes in a software project. Commits comprise noisy user-generated natural language and code patches. Automatic commit classification (CC) has been used to determine the type of code maintenance activities performed, as well as to detect bug fixes in code repositories. Much prior work occurs in the fully-supervised setting {--} a setting that can be a stretch in resource-scarce situations presenting difficulties in labeling commits. In this paper, we apply co-training, a semi-supervised learning method, to take advantage of the two views available {--} the commit message (natural language) and the code changes (programming language) {--} to improve commit classification.", +# } + +# @misc{ponta2021usedbloatedvulnerablereducing, +# title={The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application}, +# author={Serena Elisa Ponta and Wolfram Fischer and Henrik Plate and Antonino Sabetta}, +# year={2021}, +# eprint={2108.05115}, +# archivePrefix={arXiv}, +# primaryClass={cs.SE}, +# url={https://arxiv.org/abs/2108.05115}, +# } + +# @misc{aladics2023astbasedcodechangerepresentation, +# title={An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction}, +# author={Tamás Aladics and Péter Hegedűs and Rudolf Ferenc}, +# year={2023}, +# eprint={2303.16591}, +# archivePrefix={arXiv}, +# primaryClass={cs.SE}, +# url={https://arxiv.org/abs/2303.16591}, +# } + From 5effcec7e6409dbc5e4e15ea2f8f453b529438a7 Mon Sep 17 00:00:00 2001 From: Linares Date: Tue, 23 Jul 2024 18:29:51 +0200 Subject: [PATCH 69/83] Refactored our bibtex converter file, added a requirement file --- scripts/bib2md.py | 77 +++++++++++++++++++++------------------ scripts/requirements.txt | Bin 0 -> 9826 bytes 2 files changed, 42 insertions(+), 35 deletions(-) create mode 100644 scripts/requirements.txt diff --git a/scripts/bib2md.py b/scripts/bib2md.py index d3f870b57..dc5fb4820 100644 --- a/scripts/bib2md.py +++ b/scripts/bib2md.py @@ -2,74 +2,81 @@ # pip install bibtexparser --pre # to run on the CLI: -# python bib2md.py -f reference_file.bib -ord desc|asc +# python bib2md.py your_referenceFile.bib +# default order: desc. +# To change add at the end of your command: -ord "asc" import bibtexparser import sys import argparse +import html + +def process_entry(entry): + try: + authors = entry['author'].split(' and ') + if len(authors) > 1: + authors[-1] = 'and ' + authors[-1] + + authors_formatted = ', '.join([a.replace('\n', ' ').strip() for a in authors]) + title = html.unescape(entry['title']) + year = int(entry['year']) + venue = entry.get('journal') or entry.get('booktitle') or entry.get('archivePrefix') + + if not venue: + id_unprocessed = "[" + entry.key + " - " + entry.entry_type + "]" + return None, id_unprocessed + + return (year, f"{authors_formatted}. {title}. {venue.value}. ({year})."), None + + except KeyError as e: + print(f"One or more necessary fields {str(e)} not present in this BibTeX entry.") + return None, None -def format_simple(entry_str, order): +def format_simple(entry_str, order='desc'): library = bibtexparser.parse_string(entry_str) formatted_entries = [] unprocessed_entries = [] for entry in library.entries: - try: - authors = entry['author'].split(' and ') - if len(authors) > 1: - authors[-1] = 'and ' + authors[-1] - - authors_formatted = ', '.join([a.replace('\n', ' ').strip() for a in authors]) - title = entry['title'] - year = int(entry['year']) - venue = entry.get('journal') or entry.get('booktitle') or entry.get('archivePrefix') - - if not venue: - id_unprocessed = "[" + entry.key + " - " + entry.entry_type + "]" - unprocessed_entries.append(id_unprocessed) - continue - - formatted_entries.append((year, f"{authors_formatted}. {title}. {venue.value}. ({year}).")) + processed_entry, unprocessed_entry = process_entry(entry) + if processed_entry: + formatted_entries.append(processed_entry) + elif unprocessed_entry: + unprocessed_entries.append(unprocessed_entry) - except KeyError as e: - print(f"One or more necessary fields {str(e)} not present in this BibTeX entry.") - continue - - if order=='asc': + if order == 'asc': formatted_entries.sort(key=lambda x: x[0]) - elif order=='desc': + elif order == 'desc': formatted_entries.sort(key=lambda x: x[0], reverse=True) if len(unprocessed_entries) > 0: - print('Warning: Some entries were not processed due to unknown type', file=sys.stderr) - print("List of unprocessed entrie(s): ", unprocessed_entries) + print('Warning: Some entries were not processed due to unknown type', file=sys.stderr) + print("List of unprocessed entrie(s):", unprocessed_entries) return [entry[1] for entry in formatted_entries] def main(): parser = argparse.ArgumentParser() - - parser.add_argument('-f', '--file', type=str, - help='a .bib file as argument', required=True) + + parser.add_argument('file', type=str, help='a .bib file as argument') parser.add_argument('-ord', '--order', type=str, choices=['asc', 'desc'], help='here we set a sort order. We have the choice between "asc" and "desc"', - required=True) + default='desc', required=False) args = parser.parse_args() - with open(args.file, 'r') as bibtex_file: + with open(args.file, 'r', encoding='utf-8') as bibtex_file: bibtex_str = bibtex_file.read() - apa_citations = format_simple(bibtex_str, args.order) - for citation in apa_citations: + citations = format_simple(bibtex_str, args.order) + for cit in citations: print() - print(citation) + print(cit) if __name__ == "__main__": main() - # bibtex_str = """ # @comment{ # This is my example comment. diff --git a/scripts/requirements.txt b/scripts/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..2facbe8e8799cd0d0ea387b4f701db443d73c4ee GIT binary patch literal 9826 zcmai)OK)4r5rywMKz<4bA}Pz`#VjVsD#!o{93aR_(1Vg_NhD3umhB&(ZxPV%t86th>FX~*F~+U%uac)Syj`HAyM(`T>m<*%QQ zXm^r#FO9Aw6@K{BKC!RRX*YaqemL|fevRg6ycr&v=bqYcS4ZB1H!AEDV*(=GS0v+h5Sbv3ay zHsMTmB6%y>u1Dx=oY(z? zzH>V_pxlokSS>90E7 z=sx^X6`?9mMQgf5aUe=Rw-{1mc4MA_tBPdR0Mk%Q$)IR)%!3NxYKe3zX?*-F9C5lB zxf26uqQP3)Q>W~QRQ=rM)R>H9G;@}aUF1_C_7Jb-2Yd=Oo^s8gvci$*StDrf`*i@8oavnKu6!P3iFDQnDA{Pg|~q2K??D z>Zlv|GIa@Bur(XA?oNEjEs#fsnvSe|)1nePJ_>iv0K8~)+C#*dDzj>cz$54lRdts; z@!csd_}f-dwffTe2El$>AvyD4(wVEpY?aH##?qo>%&2bm<$D@Jlr$74if-jHRzpx7ve^iWr% z%N7N9|E;*E*1KxQo@Mi<>&!|tu7<89R4Z2$Q)N;_OLt0ZLByz+OyzVYV>6w?z962F z`9X51p7Bc-ZmpMyqGkY)X`_6@8mtz z%D2E;b(Rl`K~GIB^PB?5ir69DrL@7X=9>tE1LNy^YOq#)vAXXuC3&=M8ka9Uk@&6n zd-Rv1&Ce|!Ombzht2fv?jQ+pTyvl<+*pts#)s?#*#CN zI*KpNjedc?kFw@PUu4Wv=NYhw&Za^VHO6x-^``O#WVnuePsMwiI{z#jtR#ikU~al9 z8OcAonW#c*jeew$r(!%u2Z{`56n37H;FpH?0Goz5dM*g5$XasGVJB;_#5r4wj;HJMrPvm}uR=oE|2Th>fPGh> z-z0sj_}+@@WsDfHFRW7adx}YuRoj>##nQqL4~3pZCK z^W&VI2NdZO@Pm$uZe*OILQhXNbcuL|SR04WW9_eLnlpQa*yUC2yRkCo?4xd_(ed8G z`V&98oAv7Jc5)P~^YWzEQdd25xO$bBLc`mDsgaIYsC(ObcJk`E>GHKz+b`k=%GvVJ z)qDs0h`v2b4p2pEvM@5xeKjJludbX{%{B$RsqFM{JnWi^748F2F{k=fYZ{PMeVW+d z>4~(-h9VP#V^#VxnY}b_kE7Q1c1Z&xyxEuy{XVuvKccvl%m?|%J3%Z2r^Xcf6vx+- z6ZWAI^)3`_6M2^gU2hIN#Azl!V++stF7a=5r~(;95u`} zG@0#vz(!}-1U^5@qN=m_&vlgSAd1M!{?mvny?Y8?EocQcs&2WSo7Yb0s&`nA7Pd^t z3UZyty_}ndPU4epY)yl=uMM}}Hs>`=Hz|J{QV^AzIgRu~YTmC{bND`Rm$!%hArBo- zd*%d6%Zt3FiLSE}yw{!z&lrc^oUjEBvd-ia-fqrKHpFBnKBC)TYt=S#sC<$9t2}p> z^;+fqMXw|eKx^RzJJ7PM=;>X{=`UzN?t|u%U$m)iY_^$oc%#^LB z&5!NOGrDU(vO%ynL&S%TM5aSH%=IUjh7w<7b6uMo^nq{q`ExO9;!gR>>kIGWp?hVV zF^~wHNrRg)W=Oz?;3aQ6c+bapng5;W(QB^0GZ||EUI!86M-YYtBXKJWx${o=xcI|6 z#sqz--pF`?49OXv^I#iMEqRqAWWD=1b-`0QpDlStdeXm9hx+ZU`^mZDW-Cw8{v?em zw~3E2&kf`e$4qK?qGmhS+R$yJIemx?Y~D&W*M+Y&A#<1NZB%wXfuhqM_BLbikgoPM z@s2lcr+CttsHC-3L|dWo=hnw`}wj_8)Ye+Kju}`My6{L4r3O z$$4kk`{AF;^IsLeY>4Y~5Io1e)_myv%q6LpHCa~Qz`uCkn@Z4=h|RlR#4LR{ed2qh z7;ne{A=bp+OdsnJ$t;aD*GSOYcc;^or9HKTTI@GM^sLU+-=_;^3Me_QHpas#YQOS0 z@qF>)o4i5D9i&uu!2iV^8tN%8@|r8BT`xrvuPZ)L1D#k$9rHV@w=i5EL_xmXw(LMY zuRUuEhN@n}1(bR6bzH0S*s{cs3cZz~fzNzRt{TmeE-dv1>+|Fl>Rekth z5ygqQktzlc?pLRB{Z7^8%;(`xjVND&Vb}M)y2h=z!jTAEG~`BeJPJ0AaBJmB&UMDz zq2kq(*_ue07OMF{%!q*hhX#C!VRapEd!rX!ggTH~V~Lp{>b>tyQq#om%{Sf_Cz5B1 zLuQ4Z-yii-eCs{?nS6H0!83q!AsgG8-0IbCd8WRUm8bUv*!=yScmx!2>MpX^9?Q17 zRz=5DU8Yh#eW8E#T>}geQ&&h{3Vb3&WT;x1R)gBg5h&?|?8fG=h%8ipQmtVQAjfO} z=xXjN>$gZhe-!Tfo(CWH8uWE^&m?NRcq(&_zet`pJ;?;>sSyiSIcFkdUsU@*>ufLb zQf1ed>ST@Fr+Nxce0*x*7?$R}GS-%o8vq`gQMh^VPDTTS~4{O0W$U8|~m5oUR>2aV`iRl#pX RTi_~{S Date: Wed, 24 Jul 2024 14:22:31 +0200 Subject: [PATCH 70/83] Bibtex file for references to project-KB --- all_references.bib | 1218 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1218 insertions(+) create mode 100644 all_references.bib diff --git a/all_references.bib b/all_references.bib new file mode 100644 index 000000000..cb6d35b60 --- /dev/null +++ b/all_references.bib @@ -0,0 +1,1218 @@ +@inproceedings{Aladics2022AVI, + title={A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach}, + author={Tam{\'a}s Aladics and P{\'e}ter Heged{\"u}s and Rudolf Ferenc}, + booktitle={International Conference on Software and Data Technologies}, + year={2022}, + url={https://api.semanticscholar.org/CorpusID:250566828} +} + +@inproceedings{10.1145/3524842.3528482, +author = {Bui, Quang-Cuong and Scandariato, Riccardo and Ferreyra, Nicol\'{a}s E. D\'{\i}az}, +title = {Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques}, +year = {2022}, +isbn = {9781450393034}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3524842.3528482}, +doi = {10.1145/3524842.3528482}, +abstract = {In this work we present Vul4J, a Java vulnerability dataset where each vulnerability is associated to a patch and, most importantly, to a Proof of Vulnerability (PoV) test case. We analyzed 1803 fix commits from 912 real-world vulnerabilities in the Project KB knowledge base to extract the reproducible vulnerabilities, i.e., vulnerabilities that can be triggered by one or more PoV test cases. To this aim, we ran the test suite of the application in both, the vulnerable and secure versions, to identify the corresponding PoVs. Furthermore, if no PoV test case was spotted, then we wrote it ourselves. As a result, Vul4J includes 79 reproducible vulnerabilities from 51 open-source projects, spanning 25 different Common Weakness Enumeration (CWE) types. To the extent of our knowledge, this is the first dataset of its kind created for Java. Particularly, it targets the study of Automated Program Repair (APR) tools, where PoVs are often necessary in order to identify plausible patches. We made our dataset and related tools publically available on GitHub.}, +booktitle = {Proceedings of the 19th International Conference on Mining Software Repositories}, +pages = {464–468}, +numpages = {5}, +keywords = {java, program repair, vulnerability}, +location = {Pittsburgh, Pennsylvania}, +series = {MSR '22} +} + +@misc{sharma2022surveymachinelearningtechniques, + title={A Survey on Machine Learning Techniques for Source Code Analysis}, + author={Tushar Sharma and Maria Kechagia and Stefanos Georgiou and Rohit Tiwari and Indira Vats and Hadi Moazen and Federica Sarro}, + year={2022}, + eprint={2110.09610}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2110.09610}, +} + +@article{10.1145/3649590, +author = {Hommersom, Daan and Sabetta, Antonino and Coppola, Bonaventura and Nucci, Dario Di and Tamburri, Damian A.}, +title = {Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories}, +year = {2024}, +issue_date = {June 2024}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {33}, +number = {5}, +issn = {1049-331X}, +url = {https://doi.org/10.1145/3649590}, +doi = {10.1145/3649590}, +abstract = {The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this article, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such as those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions.We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03\% of the vulnerabilities (with a fix commit on the first position for 65.06\% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search open-source software (OSS) repositories for the commits that fix known vulnerabilities.}, +journal = {ACM Trans. Softw. Eng. Methodol.}, +month = {jun}, +articleno = {134}, +numpages = {28}, +keywords = {Open source software, software security, common vulnerabilities and exposures (CVE), National Vulnerability Database (NVD), mining software repositories, code-level vulnerability data, machine learning applied to software security} +} + +@inproceedings{10.1145/3387940.3392200, +author = {Marchand-Melsom, Alexander and Nguyen Mai, Duong Bao}, +title = {Automatic repair of OWASP Top 10 security vulnerabilities: A survey}, +year = {2020}, +isbn = {9781450379632}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3387940.3392200}, +doi = {10.1145/3387940.3392200}, +abstract = {Current work on automatic program repair has not focused on actually prevalent vulnerabilities in web applications, such as described in the OWASP Top 10 categories, leading to a scarcely explored field, which in turn leads to a gap between industry needs and research efforts. In order to assess the extent of this gap, we have surveyed and analyzed the literature on fully automatic source-code manipulating program repair of OWASP Top 10 vulnerabilities, as well as their corresponding test suites. We find that there is a significant gap in the coverage of the OWASP Top 10 vulnerabilities, and that the test suites used to test the analyzed approaches are highly inadequate. Few approaches cover multiple OWASP Top 10 vulnerabilities, and there is no combination of existing test suites that achieves a total coverage of OWASP Top 10.}, +booktitle = {Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops}, +pages = {23–30}, +numpages = {8}, +keywords = {OWASP Top 10, automatic program repair, survey}, +location = {Seoul, Republic of Korea}, +series = {ICSEW'20} +} + +@misc{sawadogo2021earlydetectionsecurityrelevantbug, + title={Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We?}, + author={Arthur D. Sawadogo and Quentin Guimard and Tegawendé F. Bissyandé and Abdoul Kader Kaboré and Jacques Klein and Naouel Moha}, + year={2021}, + eprint={2112.10123}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2112.10123}, +} + +@misc{sun2023exploringsecuritycommitspython, + title={Exploring Security Commits in Python}, + author={Shiyu Sun and Shu Wang and Xinda Wang and Yunlong Xing and Elisa Zhang and Kun Sun}, + year={2023}, + eprint={2307.11853}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2307.11853}, +} + +@misc{reis2021fixingvulnerabilitiespotentiallyhinders, + title={Fixing Vulnerabilities Potentially Hinders Maintainability}, + author={Sofia Reis and Rui Abreu and Luis Cruz}, + year={2021}, + eprint={2106.03271}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2106.03271}, +} + +@inproceedings{vem, + author = {Rodrigo Andrade and Vinícius Santos}, + title = { Investigating vulnerability datasets}, + booktitle = {Anais do IX Workshop de Visualização, Evolução e Manutenção de Software}, + location = {Joinville}, + year = {2021}, + keywords = {}, + issn = {0000-0000}, + pages = {26--30}, + publisher = {SBC}, + address = {Porto Alegre, RS, Brasil}, + doi = {10.5753/vem.2021.17213}, + url = {https://sol.sbc.org.br/index.php/vem/article/view/17213} +} + +@misc{nguyen2023multigranularitydetectorvulnerabilityfixes, + title={Multi-Granularity Detector for Vulnerability Fixes}, + author={Truong Giang Nguyen and Thanh Le-Cong and Hong Jin Kang and Ratnadira Widyasari and Chengran Yang and Zhipeng Zhao and Bowen Xu and Jiayuan Zhou and Xin Xia and Ahmed E. Hassan and Xuan-Bach D. Le and David Lo}, + year={2023}, + eprint={2305.13884}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2305.13884}, +} + +@inproceedings{10.1145/3549035.3561184, +author = {Siddiq, Mohammed Latif and Santos, Joanna C. S.}, +title = {SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques}, +year = {2022}, +isbn = {9781450394574}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3549035.3561184}, +doi = {10.1145/3549035.3561184}, +abstract = {Automated source code generation is currently a popular machine-learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can produce vulnerable code, which the developers can mistakenly use. For this reason, evaluating the security of a code generation model is a must. In this paper, we describe SecurityEval, an evaluation dataset to fulfill this purpose. It contains 130 samples for 75 vulnerability types, which are mapped to the Common Weakness Enumeration (CWE). We also demonstrate using our dataset to evaluate one open-source (i.e., InCoder) and one closed-source code generation model (i.e., GitHub Copilot).}, +booktitle = {Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security}, +pages = {29–33}, +numpages = {5}, +keywords = {security, dataset, common weakness enumeration, code generation}, +location = {Singapore, Singapore}, +series = {MSR4P&S 2022} +} + +@misc{sawadogo2020learningcatchsecuritypatches, + title={Learning to Catch Security Patches}, + author={Arthur D. Sawadogo and Tegawendé F. Bissyandé and Naouel Moha and Kevin Allix and Jacques Klein and Li Li and Yves Le Traon}, + year={2020}, + eprint={2001.09148}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2001.09148}, +} + +@misc{dunlap2023vfcfinderseamlesslypairingsecurity, + title={VFCFinder: Seamlessly Pairing Security Advisories and Patches}, + author={Trevor Dunlap and Elizabeth Lin and William Enck and Bradley Reaves}, + year={2023}, + eprint={2311.01532}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2311.01532}, +} + +@misc{dunlap2023vfcfinderseamlesslypairingsecurity, + title={VFCFinder: Seamlessly Pairing Security Advisories and Patches}, + author={Trevor Dunlap and Elizabeth Lin and William Enck and Bradley Reaves}, + year={2023}, + eprint={2311.01532}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2311.01532}, +} + +@inproceedings{10.1145/3510003.3510113, +author = {Bao, Lingfeng and Xia, Xin and Hassan, Ahmed E. and Yang, Xiaohu}, +title = {V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities}, +year = {2022}, +isbn = {9781450392211}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3510003.3510113}, +doi = {10.1145/3510003.3510113}, +abstract = {Vulnerabilities publicly disclosed in the National Vulnerability Database (NVD) are assigned with CVE (Common Vulnerabilities and Exposures) IDs and associated with specific software versions. Many organizations, including IT companies and government, heavily rely on the disclosed vulnerabilities in NVD to mitigate their security risks. Once a software is claimed as vulnerable by NVD, these organizations would examine the presence of the vulnerable versions of the software and assess the impact on themselves. However, the version information about vulnerable software in NVD is not always reliable. Nguyen et al. find that the version information of many CVE vulnerabilities is spurious and propose an approach based on the original SZZ algorithm (i.e., an approach to identify bug-introducing commits) to assess the software versions affected by CVE vulnerabilities.However, SZZ algorithms are designed for common bugs, while vulnerabilities and bugs are different. Many bugs are introduced by a recent bug-fixing commit, but vulnerabilities are usually introduced in their initial versions. Thus, the current SZZ algorithms often fail to identify the inducing commits for vulnerabilities. Therefore, in this study, we propose an approach based on an improved SZZ algorithm to refine software versions affected by CVE vulnerabilities. Our proposed SZZ algorithm leverages the line mapping algorithms to identify the earliest commit that modified the vulnerable lines, and then considers these commits to be the vulnerability-inducing commits, as opposed to the previous SZZ algorithms that assume the commits that last modified the buggy lines as the inducing commits. To evaluate our proposed approach, we manually annotate the true inducing commits and verify the vulnerable versions for 172 CVE vulnerabilities with fixing commits from two publicly available datasets with five C/C++ and 41 Java projects, respectively. We find that 99 out of 172 vulnerabilities whose version information is spurious. The experiment results show that our proposed approach can identify more vulnerabilities with the true inducing commits and correct vulnerable versions than the previous SZZ algorithms. Our approach outperforms the previous SZZ algorithms in terms of F1-score for identifying vulnerability-inducing commits on both C/C++ and Java projects (0.736 and 0.630, respectively). For refining vulnerable versions, our approach also achieves the best performance on the two datasets in terms of F1-score (0.928 and 0.952).}, +booktitle = {Proceedings of the 44th International Conference on Software Engineering}, +pages = {2352–2364}, +numpages = {13}, +keywords = {CVE, SZZ, vulnerability}, +location = {Pittsburgh, Pennsylvania}, +series = {ICSE '22} +} + +@inproceedings{10.1145/3379597.3387501, +author = {Fan, Jiahao and Li, Yi and Wang, Shaohua and Nguyen, Tien N.}, +title = {A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries}, +year = {2020}, +isbn = {9781450375177}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3379597.3387501}, +doi = {10.1145/3379597.3387501}, +abstract = {We collected a large C/C++ code vulnerability dataset from open-source Github projects, namely Big-Vul. We crawled the public Common Vulnerabilities and Exposures (CVE) database and CVE-related source code repositories. Specifically, we collected the descriptive information of the vulnerabilities from the CVE database, e.g., CVE IDs, CVE severity scores, and CVE summaries. With the CVE information and its related published Github code repository links, we downloaded all of the code repositories and extracted vulnerability related code changes. In total, Big-Vul contains 3,754 code vulnerabilities spanning 91 different vulnerability types. All these code vulnerabilities are extracted from 348 Github projects. All information is stored in the CSV format. We linked the code changes with the CVE descriptive information. Thus, our Big-Vul can be used for various research topics, e.g., detecting and fixing vulnerabilities, analyzing the vulnerability related code changes. Big-Vul is publicly available on Github.}, +booktitle = {Proceedings of the 17th International Conference on Mining Software Repositories}, +pages = {508–512}, +numpages = {5}, +keywords = {C/C++ Code, Code Changes, Common Vulnerabilities and Exposures}, +location = {Seoul, Republic of Korea}, +series = {MSR '20} +} + +@misc{zhang2023surveylearningbasedautomatedprogram, + title={A Survey of Learning-based Automated Program Repair}, + author={Quanjun Zhang and Chunrong Fang and Yuxiang Ma and Weisong Sun and Zhenyu Chen}, + year={2023}, + eprint={2301.03270}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2301.03270}, +} + +@article{Alzubaidi2023ASO, + title={A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications}, + author={Laith Alzubaidi and Jinshuai Bai and Aiman Al-Sabaawi and Jos{\'e} I. Santamar{\'i}a and Ahmed Shihab Albahri and Bashar Sami Nayyef Al-dabbagh and Mohammed Abdulraheem Fadhel and Mohamed Manoufali and Jinglan Zhang and Ali H. Al-timemy and Ye Duan and Amjed Abdullah and Laith Farhan and Yi Lu and Ashish Gupta and Felix Albu and Amin Abbosh and Yuantong Gu}, + journal={Journal of Big Data}, + year={2023}, + volume={10}, + pages={1-82}, + url={https://api.semanticscholar.org/CorpusID:258137181} +} + +@article{SHARMA2024111934, +title = {A survey on machine learning techniques applied to source code}, +journal = {Journal of Systems and Software}, +volume = {209}, +pages = {111934}, +year = {2024}, +issn = {0164-1212}, +doi = {https://doi.org/10.1016/j.jss.2023.111934}, +url = {https://www.sciencedirect.com/science/article/pii/S0164121223003291}, +author = {Tushar Sharma and Maria Kechagia and Stefanos Georgiou and Rohit Tiwari and Indira Vats and Hadi Moazen and Federica Sarro}, +keywords = {Machine learning for software engineering, Source code analysis, Deep learning, Datasets, Tools}, +abstract = {The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis, such as testing and vulnerability detection. Such a large number of studies hinders the community from understanding the current research landscape. This paper aims to summarize the current knowledge in applied machine learning for source code analysis. We review studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we conducted an extensive literature search and identified 494 studies. We summarize our observations and findings with the help of the identified studies. Our findings suggest that the use of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task and summarize machine learning techniques employed. We identify a comprehensive list of available datasets and tools useable in this context. Finally, the paper discusses perceived challenges in this area, including the availability of standard datasets, reproducibility and replicability, and hardware resources. Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.} +} + +@article{10.1145/3648610, +author = {Elder, Sarah and Rahman, Md Rayhanur and Fringer, Gage and Kapoor, Kunal and Williams, Laurie}, +title = {A Survey on Software Vulnerability Exploitability Assessment}, +year = {2024}, +issue_date = {August 2024}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {56}, +number = {8}, +issn = {0360-0300}, +url = {https://doi.org/10.1145/3648610}, +doi = {10.1145/3648610}, +abstract = {Knowing the exploitability and severity of software vulnerabilities helps practitioners prioritize vulnerability mitigation efforts. Researchers have proposed and evaluated many different exploitability assessment methods. The goal of this research is to assist practitioners and researchers in understanding existing methods for assessing vulnerability exploitability through a survey of exploitability assessment literature. We identify three exploitability assessment approaches: assessments based on original, manual Common Vulnerability Scoring System, automated Deterministic assessments, and automated Probabilistic assessments. Other than the original Common Vulnerability Scoring System, the two most common sub-categories are Deterministic, Program State based, and Probabilistic learning model assessments.}, +journal = {ACM Comput. Surv.}, +month = {apr}, +articleno = {205}, +numpages = {41}, +keywords = {Exploitability, software vulnerability} +} + +@misc{aladics2023astbasedcodechangerepresentation, + title={An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction}, + author={Tamás Aladics and Péter Hegedűs and Rudolf Ferenc}, + year={2023}, + eprint={2303.16591}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2303.16591}, +} + +@INPROCEEDINGS{10428519, + author={Singhal, Amit and Goel, Pawan Kumar}, + booktitle={2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE)}, + title={Analysis and Identification of Malicious Mobile Applications}, + year={2023}, + volume={}, + number={}, + pages={1045-1050}, + keywords={Ecosystems;Mobile security;Threat assessment;Mobile applications;Software reliability;Security;Smart phones;Mobile Applications;digital landscape;malicious software;mobile threats}, + doi={10.1109/AECE59614.2023.10428519}} + +@Article{electronics10131606, +AUTHOR = {Senanayake, Janaka and Kalutarage, Harsha and Al-Kadri, Mhd Omar}, +TITLE = {Android Mobile Malware Detection Using Machine Learning: A Systematic Review}, +JOURNAL = {Electronics}, +VOLUME = {10}, +YEAR = {2021}, +NUMBER = {13}, +ARTICLE-NUMBER = {1606}, +URL = {https://www.mdpi.com/2079-9292/10/13/1606}, +ISSN = {2079-9292}, +ABSTRACT = {With the increasing use of mobile devices, malware attacks are rising, especially on Android phones, which account for 72.2% of the total market share. Hackers try to attack smartphones with various methods such as credential theft, surveillance, and malicious advertising. Among numerous countermeasures, machine learning (ML)-based methods have proven to be an effective means of detecting these attacks, as they are able to derive a classifier from a set of training examples, thus eliminating the need for an explicit definition of the signatures when developing malware detectors. This paper provides a systematic review of ML-based Android malware detection techniques. It critically evaluates 106 carefully selected articles and highlights their strengths and weaknesses as well as potential improvements. Finally, the ML-based methods for detecting source code vulnerabilities are discussed, because it might be more difficult to add security after the app is deployed. Therefore, this paper aims to enable researchers to acquire in-depth knowledge in the field and to identify potential future research and development directions.}, +DOI = {10.3390/electronics10131606} +} + +@article{article, +author = {Bui, Quang-Cuong and Paramitha, Ranindya and Vu, Duc-Ly and Massacci, Fabio and Scandariato, Riccardo}, +year = {2023}, +month = {12}, +pages = {}, +title = {APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities}, +volume = {29}, +journal = {Empirical Software Engineering}, +doi = {10.1007/s10664-023-10415-7} +} + +@article{10.1145/3556974, +author = {Senanayake, Janaka and Kalutarage, Harsha and Al-Kadri, Mhd Omar and Petrovski, Andrei and Piras, Luca}, +title = {Android Source Code Vulnerability Detection: A Systematic Literature Review}, +year = {2023}, +issue_date = {September 2023}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {55}, +number = {9}, +issn = {0360-0300}, +url = {https://doi.org/10.1145/3556974}, +doi = {10.1145/3556974}, +abstract = {The use of mobile devices is rising daily in this technological era. A continuous and increasing number of mobile applications are constantly offered on mobile marketplaces to fulfil the needs of smartphone users. Many Android applications do not address the security aspects appropriately. This is often due to a lack of automated mechanisms to identify, test, and fix source code vulnerabilities at the early stages of design and development. Therefore, the need to fix such issues at the initial stages rather than providing updates and patches to the published applications is widely recognized. Researchers have proposed several methods to improve the security of applications by detecting source code vulnerabilities and malicious codes. This Systematic Literature Review (SLR) focuses on Android application analysis and source code vulnerability detection methods and tools by critically evaluating 118 carefully selected technical studies published between 2016 and 2022. It highlights the advantages, disadvantages, applicability of the proposed techniques, and potential improvements of those studies. Both Machine Learning (ML)-based methods and conventional methods related to vulnerability detection are discussed while focusing more on ML-based methods, since many recent studies conducted experiments with ML. Therefore, this article aims to enable researchers to acquire in-depth knowledge in secure mobile application development while minimizing the vulnerabilities by applying ML methods. Furthermore, researchers can use the discussions and findings of this SLR to identify potential future research and development directions.}, +journal = {ACM Comput. Surv.}, +month = {jan}, +articleno = {187}, +numpages = {37}, +keywords = {machine learning, Android security, software security, vulnerability detection, Source code vulnerability} +} + +@inproceedings{10.1145/3593434.3593481, +author = {Reis, Sofia and Abreu, Rui and Pasareanu, Corina}, +title = {Are security commit messages informative? Not enough!}, +year = {2023}, +isbn = {9798400700446}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3593434.3593481}, +doi = {10.1145/3593434.3593481}, +abstract = {The fast distribution and deployment of security patches are important to protect users against cyberattacks. These fixes can be detected automatically by patch management triage systems. However, previous work has shown that automating the task is not easy, in some cases, because of poor documentation or lack of information in security fixes. For many years, standard practices in the security community have steered engineers to provide cryptic commit messages (i.e., patch software vulnerabilities silently) to avoid potential attacks and reputation damage. However, not providing enough documentation on vulnerability fixes can hinder trust between vendors and users. Current efforts in the security community aim to increase the level of transparency during patch and disclosing times to help build trust in the development community and make patch management processes faster. In this paper, we evaluate how informative security commit messages (i.e., messages attached to security fixes) are and how different levels of information can affect different tasks in automated patch triage systems. We observed that security engineers, in general, do not provide enough detail to enable the three automated triage systems at the same time. In addition, results show that security commit messages need to be more informative—56.7\% of the messages analyzed were documented poorly. Best practices to write informative and well-structured security commit messages (such as SECOM) should become a standard practice in the security community.}, +booktitle = {Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering}, +pages = {196–199}, +numpages = {4}, +keywords = {Security, Patch Management Process, Convention, Commit Messages, Best Practices}, +location = {Oulu, Finland}, +series = {EASE '23} +} + +@inproceedings{2022BES, + title={B EYOND SYNTAX TREES : LEARNING EMBEDDINGS OF CODE EDITS BY COMBINING MULTIPLE SOURCE REP - RESENTATIONS}, + author={}, + year={2022}, + url={https://api.semanticscholar.org/CorpusID:249038879} +} + +@inproceedings{10.1145/3508398.3511495, +author = {Challande, Alexis and David, Robin and Renault, Gu\'{e}na\"{e}l}, +title = {Building a Commit-level Dataset of Real-world Vulnerabilities}, +year = {2022}, +isbn = {9781450392204}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3508398.3511495}, +doi = {10.1145/3508398.3511495}, +abstract = {While CVE have become a de facto standard for publishing advisories on vulnerabilities, the state of current CVE databases is lackluster. Yet, CVE advisories are insufficient to bridge the gap with the vulnerability artifacts in the impacted program. Therefore, the community is lacking a public real-world vulnerabilities dataset providing such association. In this paper, we present a method restoring this missing link by analyzing the vulnerabilities from the AOSP, an aggregate of more than 1,800 projects. It is the perfect target for building a representative dataset of vulnerabilities, as it covers the full spectrum that may be encountered in a modern system where a variety of low-level and higher-level components interact. More specifically, our main contribution is a dataset of more than 1,900 vulnerabilities, associating generic metadata (e.g. vulnerability type, impact level) with their respective patches at the commit granularity (e.g. fix commit-id, affected files, source code language). Finally, we also augment this dataset by providing precompiled binaries for a subset of the vulnerabilities. These binaries open various data usage, both for binary only analysis and at the interface between source and binary. In addition of providing a common baseline benchmark, our dataset release supports the community for data-driven software security research.}, +booktitle = {Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy}, +pages = {101–106}, +numpages = {6}, +keywords = {binary matching, dataset, patch detection, security vulnerabilities, vulnerability research}, +location = {Baltimore, MD, USA}, +series = {CODASPY '22} +} + +@misc{wang2019characterizingunderstandingsoftwaredeveloper, + title={Characterizing and Understanding Software Developer Networks in Security Development}, + author={Song Wang and Nachi Nagappan}, + year={2019}, + eprint={1907.12141}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/1907.12141}, +} + +@misc{harzevili2022characterizingunderstandingsoftwaresecurity, + title={Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries}, + author={Nima Shiri Harzevili and Jiho Shin and Junjie Wang and Song Wang}, + year={2022}, + eprint={2203.06502}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2203.06502}, +} + +@misc{zhang2023compatibleremediationvulnerabilitiesthirdparty, + title={Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects}, + author={Lyuye Zhang and Chengwei Liu and Zhengzi Xu and Sen Chen and Lingling Fan and Lida Zhao and Jiahui Wu and Yang Liu}, + year={2023}, + eprint={2301.08434}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2301.08434}, +} + +@inproceedings{lee-chieu-2021-co, + title = "Co-training for Commit Classification", + author = "Lee, Jian Yi David and + Chieu, Hai Leong", + editor = "Xu, Wei and + Ritter, Alan and + Baldwin, Tim and + Rahimi, Afshin", + booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)", + month = nov, + year = "2021", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2021.wnut-1.43", + doi = "10.18653/v1/2021.wnut-1.43", + pages = "389--395", + abstract = "Commits in version control systems (e.g. Git) track changes in a software project. Commits comprise noisy user-generated natural language and code patches. Automatic commit classification (CC) has been used to determine the type of code maintenance activities performed, as well as to detect bug fixes in code repositories. Much prior work occurs in the fully-supervised setting {--} a setting that can be a stretch in resource-scarce situations presenting difficulties in labeling commits. In this paper, we apply co-training, a semi-supervised learning method, to take advantage of the two views available {--} the commit message (natural language) and the code changes (programming language) {--} to improve commit classification.", +} + +@inproceedings{10.1145/3468264.3473122, +author = {Nikitopoulos, Georgios and Dritsa, Konstantina and Louridas, Panos and Mitropoulos, Dimitris}, +title = {CrossVul: a cross-language vulnerability dataset with commit data}, +year = {2021}, +isbn = {9781450385626}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3468264.3473122}, +doi = {10.1145/3468264.3473122}, +abstract = {Examining the characteristics of software vulnerabilities and the code that contains them can lead to the development of more secure software. We present a dataset (∼1.4 GB) containing vulnerable source code files together with the corresponding, patched versions. Contrary to other existing vulnerability datasets, ours includes vulnerable files written in more than 40 programming languages. Each file is associated to (1) a Common Vulnerability Exposures identifier (CVE ID) and (2) the repository it came from. Further, our dataset can be the basis for machine learning applications that identify defects, as we show in specific examples. We also present a supporting dataset that contains commit messages derived from Git commits that serve as security patches. This dataset can be used to train ML models that in turn, can be used to detect security patch commits as we highlight in a specific use case.}, +booktitle = {Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, +pages = {1565–1569}, +numpages = {5}, +keywords = {vulnerabilities, security patches, commit messages, Dataset}, +location = {Athens, Greece}, +series = {ESEC/FSE 2021} +} + +@inproceedings{Bhandari_2021, series={PROMISE ’21}, + title={CVEfixes: automated collection of vulnerabilities and their fixes from open-source software}, + url={http://dx.doi.org/10.1145/3475960.3475985}, + DOI={10.1145/3475960.3475985}, + booktitle={Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering}, + publisher={ACM}, + author={Bhandari, Guru and Naseer, Amara and Moonen, Leon}, + year={2021}, + month=aug, collection={PROMISE ’21} } + +@article{10.1007/s10664-021-10029-x, +author = {Sonnekalb, Tim and Heinze, Thomas S. and M\"{a}der, Patrick}, +title = {Deep security analysis of program code: A systematic literature review}, +year = {2022}, +issue_date = {Jan 2022}, +publisher = {Kluwer Academic Publishers}, +address = {USA}, +volume = {27}, +number = {1}, +issn = {1382-3256}, +url = {https://doi.org/10.1007/s10664-021-10029-x}, +doi = {10.1007/s10664-021-10029-x}, +abstract = {Due to the continuous digitalization of our society, distributed and web-based applications become omnipresent and making them more secure gains paramount relevance. Deep learning (DL) and its representation learning approach are increasingly been proposed for program code analysis potentially providing a powerful means in making software systems less vulnerable. This systematic literature review (SLR) is aiming for a thorough analysis and comparison of 32 primary studies on DL-based vulnerability analysis of program code. We found a rich variety of proposed analysis approaches, code embeddings and network topologies. We discuss these techniques and alternatives in detail. By compiling commonalities and differences in the approaches, we identify the current state of research in this area and discuss future directions. We also provide an overview of publicly available datasets in order to foster a stronger benchmarking of approaches. This SLR provides an overview and starting point for researchers interested in deep vulnerability analysis on program code.}, +journal = {Empirical Softw. Engg.}, +month = {jan}, +numpages = {39}, +keywords = {Code inspection, Software security, Vulnerability detection, Deep learning, Supervised learning} +} + +@misc{le2021deepcvaautomatedcommitlevelvulnerability, + title={DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning}, + author={Triet H. M. Le and David Hin and Roland Croft and M. Ali Babar}, + year={2021}, + eprint={2108.08041}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2108.08041}, +} + +@article{SENANAYAKE2024103741, +title = {Defendroid: Real-time Android code vulnerability detection via blockchain federated neural network with XAI}, +journal = {Journal of Information Security and Applications}, +volume = {82}, +pages = {103741}, +year = {2024}, +issn = {2214-2126}, +doi = {https://doi.org/10.1016/j.jisa.2024.103741}, +url = {https://www.sciencedirect.com/science/article/pii/S2214212624000449}, +author = {Janaka Senanayake and Harsha Kalutarage and Andrei Petrovski and Luca Piras and Mhd Omar Al-Kadri}, +keywords = {Android application protection, Code vulnerability, Neural network, Federated learning, Source code privacy, Explainable AI, Blockchain}, +abstract = {Ensuring strict adherence to security during the phases of Android app development is essential, primarily due to the prevalent issue of apps being released without adequate security measures in place. While a few automated tools are employed to reduce potential vulnerabilities during development, their effectiveness in detecting vulnerabilities may fall short. To address this, “Defendroid”, a blockchain-based federated neural network enhanced with Explainable Artificial Intelligence (XAI) is introduced in this work. Trained on the LVDAndro dataset, the vanilla neural network model achieves a 96% accuracy and 0.96 F1-Score in binary classification for vulnerability detection. Additionally, in multi-class classification, the model accurately identifies Common Weakness Enumeration (CWE) categories with a 93% accuracy and 0.91 F1-Score. In a move to foster collaboration and model improvement, the model has been deployed within a blockchain-based federated environment. This environment enables community-driven collaborative training and enhancements in partnership with other clients. The extended model demonstrates improved accuracy of 96% and F1-Score of 0.96 in both binary and multi-class classifications. The use of XAI plays a pivotal role in presenting vulnerability detection results to developers, offering prediction probabilities for each word within the code. This model has been integrated into an Application Programming Interface (API) as the backend and further incorporated into Android Studio as a plugin, facilitating real-time vulnerability detection. Notably, Defendroid exhibits high efficiency, delivering prediction probabilities for a single code line in an average processing time of a mere 300 ms. The weight-sharing transparency in the blockchain-driven federated model enhances trust and traceability, fostering community engagement while preserving source code privacy and contributing to accuracy improvement.} +} + +@inproceedings{Stefanoni2022DetectingSP, + title={Detecting Security Patches in Java Projects Using NLP Technology}, + author={Andrea Stefanoni and Sarunas Girdzijauskas and Christina Jenkins and Zekarias T. Kefato and Licia Sbattella and Vincenzo Scotti and Emil W{\aa}reus}, + booktitle={International Conference on Natural Language and Speech Processing}, + year={2022}, + url={https://api.semanticscholar.org/CorpusID:256739262} +} + +@ARTICLE{10056768, + author={Okutan, Ahmet and Mell, Peter and Mirakhorli, Mehdi and Khokhlov, Igor and Santos, Joanna C. S. and Gonzalez, Danielle and Simmons, Steven}, + journal={IEEE Transactions on Software Engineering}, + title={Empirical Validation of Automated Vulnerability Curation and Characterization}, + year={2023}, + volume={49}, + number={5}, + pages={3241-3260}, + keywords={Security;NIST;Databases;Virtual machine monitors;Software;Feature extraction;Codes;CVE;NIST vulnerability description ontology;software vulnerability;vulnerability characterization}, + doi={10.1109/TSE.2023.3250479}} + +@misc{wang2023enhancinglargelanguagemodels, + title={Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation}, + author={Jiexin Wang and Liuwen Cao and Xitong Luo and Zhiping Zhou and Jiayuan Xie and Adam Jatowt and Yi Cai}, + year={2023}, + eprint={2310.16263}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2310.16263}, +} + +@inproceedings{10.1145/3631204.3631862, +author = {Bottner, Laura and Hermann, Artur and Eppler, Jeremias and Th\"{u}m, Thomas and Kargl, Frank}, +title = {Evaluation of Free and Open Source Tools for Automated Software Composition Analysis}, +year = {2023}, +isbn = {9798400704543}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3631204.3631862}, +doi = {10.1145/3631204.3631862}, +abstract = {Vulnerable or malicious third-party components introduce vulnerabilities into the software supply chain. Software Composition Analysis (SCA) is a method to identify direct and transitive dependencies in software projects and assess their security risks and vulnerabilities. In this paper, we investigate two open source SCA tools, Eclipse Steady (ES) and OWASP Dependency Check (ODC), with respect to vulnerability detection in Java projects. Both tools use different vulnerability detection methods. ES implements a code-centric and ODC a metadata-based approach. Our study reveals that both tools suffer from false positives. Furthermore, we discover that the success of the vulnerability detection depends on the underlying vulnerability database. Especially ES suffered from false negatives because of the insufficient vulnerability information in the database. While code-centric and metadata-based approaches offer significant potential, they also come with their respective downsides. We propose a hybrid approach assuming that combining both detection methods will lead to less false negatives and false positives.}, +booktitle = {Proceedings of the 7th ACM Computer Science in Cars Symposium}, +articleno = {3}, +numpages = {11}, +keywords = {Vulnerable Dependency Identification, Software Supply Chain Security, Software Composition Analysis, Secure Software Development Life Cycle}, +location = {Darmstadt, Germany}, +series = {CSCS '23} +} + +@inproceedings{10.1145/3474369.3486866, +author = {Ganz, Tom and H\"{a}rterich, Martin and Warnecke, Alexander and Rieck, Konrad}, +title = {Explaining Graph Neural Networks for Vulnerability Discovery}, +year = {2021}, +isbn = {9781450386579}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3474369.3486866}, +doi = {10.1145/3474369.3486866}, +abstract = {Graph neural networks (GNNs) have proven to be an effective tool for vulnerability discovery that outperforms learning-based methods working directly on source code. Unfortunately, these neural networks are uninterpretable models, whose decision process is completely opaque to security experts, which obstructs their practical adoption. Recently, several methods have been proposed for explaining models of machine learning. However, it is unclear whether these methods are suitable for GNNs and support the task of vulnerability discovery. In this paper we present a framework for evaluating explanation methods on GNNs. We develop a set of criteria for comparing graph explanations and linking them to properties of source code. Based on these criteria, we conduct an experimental study of nine regular and three graph-specific explanation methods. Our study demonstrates that explaining GNNs is a non-trivial task and all evaluation criteria play a role in assessing their efficacy. We further show that graph-specific explanations relate better to code semantics and provide more information to a security expert than regular methods.}, +booktitle = {Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security}, +pages = {145–156}, +numpages = {12}, +keywords = {machine learning, software security}, +location = {Virtual Event, Republic of Korea}, +series = {AISec '21} +} + +@misc{ram2019exploitingtokenpathbasedrepresentations, + title={Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits}, + author={Achyudh Ram and Ji Xin and Meiyappan Nagappan and Yaoliang Yu and Rocío Cabrera Lozoya and Antonino Sabetta and Jimmy Lin}, + year={2019}, + eprint={1911.07620}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/1911.07620}, +} + +@misc{rahman2023exploringautomatedcodeevaluation, + title={Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey}, + author={Md. Mostafizer Rahman and Yutaka Watanobe and Atsushi Shirafuji and Mohamed Hamada}, + year={2023}, + eprint={2307.08705}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2307.08705}, +} + +@misc{zhang2023doesllmgeneratesecurity, + title={How well does LLM generate security tests?}, + author={Ying Zhang and Wenjia Song and Zhengjie Ji and Danfeng and Yao and Na Meng}, + year={2023}, + eprint={2310.00710}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2310.00710}, +} + +@article{Jing_2022, +doi = {10.1088/1742-6596/2363/1/012010}, +url = {https://dx.doi.org/10.1088/1742-6596/2363/1/012010}, +year = {2022}, +month = {nov}, +publisher = {IOP Publishing}, +volume = {2363}, +number = {1}, +pages = {012010}, +author = {Dejiang Jing}, +title = {Improvement of Vulnerable Code Dataset Based on Program Equivalence Transformation}, +journal = {Journal of Physics: Conference Series}, +abstract = {Code vulnerability dataset plays an important role in the development and evaluation of vulnerability detection tools. Aiming at the problem that programs in existing vulnerability datasets usually have simple structures and small scales, which is not satisfying for testing, we proposed a generator of complex code vulnerability dataset based on program equivalence transformation in this paper. It improves the complexity of program structure and code size while preserving the labels of the original case. Based on the PHP vulnerability test suite in Software Assurance Reference Dataset (SARD), a large number of complex cases were constructed and tested using two open-source PHP vulnerability detection tools RIPS and WAP. Experimental results show that the cyclomatic complexity and code size of the generated cases increase by about five times after transformation. The detection accuracy of the tools on the generated dataset decreases significantly compared with the results on the original dataset. The false positive rate increases by about 30%, and the false negative rate increases by about 10%.} +} + +@inproceedings{Wu_2023, series={ISSTA ’23}, + title={How Effective Are Neural Networks for Fixing Security Vulnerabilities}, + url={http://dx.doi.org/10.1145/3597926.3598135}, + DOI={10.1145/3597926.3598135}, + booktitle={Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis}, + publisher={ACM}, + author={Wu, Yi and Jiang, Nan and Pham, Hung Viet and Lutellier, Thibaud and Davis, Jordan and Tan, Lin and Babkin, Petr and Shah, Sameena}, + year={2023}, + month=jul, collection={ISSTA ’23} } + +@misc{yang2021fewsamplenamedentityrecognition, + title={Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models}, + author={Guanqun Yang and Shay Dineen and Zhipeng Lin and Xueqing Liu}, + year={2021}, + eprint={2108.06590}, + archivePrefix={arXiv}, + primaryClass={cs.CL}, + url={https://arxiv.org/abs/2108.06590}, +} + +@INPROCEEDINGS{9678720, + author={Zhou, Jiayuan and Pacheco, Michael and Wan, Zhiyuan and Xia, Xin and Lo, David and Wang, Yuan and Hassan, Ahmed E.}, + booktitle={2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)}, + title={Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes}, + year={2021}, + volume={}, + number={}, + pages={705-716}, + keywords={Measurement;Codes;Semantics;Transformers;Needles;Security;Probes;Software Security;Vulnerability Fix;Open Source Software;Deep Learning}, + doi={10.1109/ASE51524.2021.9678720}} + +@INPROCEEDINGS{10190493, + author={Dunlap, Trevor and Thorn, Seaver and Enck, William and Reaves, Bradley}, + booktitle={2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P)}, + title={Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis}, + year={2023}, + volume={}, + number={}, + pages={489-505}, + keywords={Java;Codes;Databases;Ecosystems;Static analysis;Security;Noise measurement}, + doi={10.1109/EuroSP57164.2023.00036}} + +@misc{shestov2024finetuninglargelanguagemodels, + title={Finetuning Large Language Models for Vulnerability Detection}, + author={Alexey Shestov and Rodion Levichev and Ravil Mussabayev and Evgeny Maslov and Anton Cheshkov and Pavel Zadorozhny}, + year={2024}, + eprint={2401.17010}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2401.17010}, +} + +@inproceedings{10.1145/3643991.3644871, +author = {Scalco, Simone and Paramitha, Ranindya}, +title = {Hash4Patch: A Lightweight Low False Positive Tool for Finding Vulnerability Patch Commits}, +year = {2024}, +isbn = {9798400705878}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3643991.3644871}, +doi = {10.1145/3643991.3644871}, +abstract = {[Context:] Patch commits are useful to complete vulnerability datasets for training ML models and for developers to find a safe version for their dependencies. [Objective:] However, there is a gap in the state-of-the-art (SOTA) for a lightweight low False Positive patch commit finder. [Method:] We implemented Hash4Patch, a new tool to be used along with a current SOTA patch finder. We then validated it with a dataset of 160 CVEs. [Results:] Our approach significantly reduced the False Positives produced by a state-of-the-art tool with only 1 minute of additional running time on average. [Conclusions:] Our tool is able to effectively and efficiently reduce the number of alerts found by other patch commit finders, thus minimizing the manual effort needed by developers.}, +booktitle = {Proceedings of the 21st International Conference on Mining Software Repositories}, +pages = {733–737}, +numpages = {5}, +keywords = {vulnerability, patch commit, lightweight, hash search}, +location = {Lisbon, Portugal}, +series = {MSR '24} +} + +@INPROCEEDINGS{9825835, + author={Nguyen-Truong, Giang and Kang, Hong Jin and Lo, David and Sharma, Abhishek and Santosa, Andrew E. and Sharma, Asankhaya and Ang, Ming Yi}, + booktitle={2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)}, + title={HERMES: Using Commit-Issue Linking to Detect Vulnerability-Fixing Commits}, + year={2022}, + volume={}, + number={}, + pages={51-62}, + keywords={Conferences;Computer bugs;Machine learning;Libraries;Software;Security;vulnerability curation;silent fixes;commit classification;commit-issue link recovery}, + doi={10.1109/SANER53432.2022.00018}} + +@misc{wang2024aigeneratedcodereallysafe, + title={Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval}, + author={Jiexin Wang and Xitong Luo and Liuwen Cao and Hongkui He and Hailin Huang and Jiayuan Xie and Adam Jatowt and Yi Cai}, + year={2024}, + eprint={2407.02395}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2407.02395}, +} + +@misc{sawadogo2020learningcatchsecuritypatches, + title={Learning to Catch Security Patches}, + author={Arthur D. Sawadogo and Tegawendé F. Bissyandé and Naouel Moha and Kevin Allix and Jacques Klein and Li Li and Yves Le Traon}, + year={2020}, + eprint={2001.09148}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2001.09148}, +} + +@misc{tony2023llmsecevaldatasetnaturallanguage, + title={LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations}, + author={Catherine Tony and Markus Mutas and Nicolás E. Díaz Ferreyra and Riccardo Scandariato}, + year={2023}, + eprint={2303.09384}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2303.09384}, +} + +@misc{wang2019characterizingunderstandingsoftwaredeveloper, + title={Characterizing and Understanding Software Developer Networks in Security Development}, + author={Song Wang and Nachi Nagappan}, + year={2019}, + eprint={1907.12141}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/1907.12141}, +} + +@article{Chen_2023, + title={Neural Transfer Learning for Repairing Security Vulnerabilities in C Code}, + volume={49}, + ISSN={2326-3881}, + url={http://dx.doi.org/10.1109/TSE.2022.3147265}, + DOI={10.1109/tse.2022.3147265}, + number={1}, + journal={IEEE Transactions on Software Engineering}, + publisher={Institute of Electrical and Electronics Engineers (IEEE)}, + author={Chen, Zimin and Kommrusch, Steve and Monperrus, Martin}, + year={2023}, + month=jan, pages={147–165} } + +@misc{papotti2022acceptancecodereviewerscandidate, + title={On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools}, + author={Aurora Papotti and Ranindya Paramitha and Fabio Massacci}, + year={2022}, + eprint={2209.07211}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2209.07211}, +} + +@misc{mir2024effectivenessmachinelearningbasedgraph, + title={On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study}, + author={Amir M. Mir and Mehdi Keshani and Sebastian Proksch}, + year={2024}, + eprint={2402.07294}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2402.07294}, +} + +@misc{dietrich2023securityblindspotssoftware, + title={On the Security Blind Spots of Software Composition Analysis}, + author={Jens Dietrich and Shawn Rasheed and Alexander Jordan and Tim White}, + year={2023}, + eprint={2306.05534}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2306.05534}, +} + +@misc{le2022usefinegrainedvulnerablecode, + title={On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models}, + author={Triet H. M. Le and M. Ali Babar}, + year={2022}, + eprint={2203.08417}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2203.08417}, +} + +@INPROCEEDINGS{10000561, + author={Chapman, Jon and Venugopalan, Hari}, + booktitle={2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT)}, + title={Open Source Software Computed Risk Framework}, + year={2022}, + volume={}, + number={}, + pages={172-175}, + keywords={Measurement;Correlation;Codes;Force;Ecosystems;Information technology;Computer security;Big Data;Computer Security;Prediction Methods;Data Analysis}, + doi={10.1109/CSIT56902.2022.10000561}} + +@article{CANFORA2022106745, +title = {Patchworking: Exploring the code changes induced by vulnerability fixing activities}, +journal = {Information and Software Technology}, +volume = {142}, +pages = {106745}, +year = {2022}, +issn = {0950-5849}, +doi = {https://doi.org/10.1016/j.infsof.2021.106745}, +url = {https://www.sciencedirect.com/science/article/pii/S0950584921001932}, +author = {Gerardo Canfora and Andrea {Di Sorbo} and Sara Forootani and Matias Martinez and Corrado A. Visaggio}, +keywords = {Software vulnerabilities, Software maintenance, Empirical study}, +abstract = {Context: +Identifying and repairing vulnerable code is a critical software maintenance task. Change impact analysis plays an important role during software maintenance, as it helps software maintainers to figure out the potential effects of a change before it is applied. However, while the software engineering community has extensively studied techniques and tools for performing impact analysis of change requests, there are no approaches for estimating the impact when the change involves the resolution of a vulnerability bug. +Objective: +We hypothesize that similar vulnerabilities may present similar strategies for patching. More specifically, our work aims at understanding whether the class of the vulnerability to fix may determine the type of impact on the system to repair. +Method: +To verify our conjecture, in this paper, we examine 524 security patches applied to vulnerabilities belonging to ten different weakness categories and extracted from 98 different open-source projects written in Java. +Results: +We obtain empirical evidence that vulnerabilities of the same types are often resolved by applying similar code transformations, and, thus, produce almost the same impact on the codebase. +Conclusion: +On the one hand, our findings open the way to better management of software maintenance activities when dealing with software vulnerabilities. Indeed, vulnerability class information could be exploited to better predict how much code will be affected by the fixing, how the structural properties of the code (i.e., complexity, coupling, cohesion, size) will change, and the effort required for the fix. On the other hand, our results can be leveraged for improving automated strategies supporting developers when they have to deal with security flaws.} +} + +@inproceedings{10.1145/3460946.3464318, +author = {Garg, Spandan and Moghaddam, Roshanak Zilouchian and Sundaresan, Neel and Wu, Chen}, +title = {PerfLens: a data-driven performance bug detection and fix platform}, +year = {2021}, +isbn = {9781450384681}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3460946.3464318}, +doi = {10.1145/3460946.3464318}, +abstract = {The wealth of open-source software development artifacts available online creates a great opportunity to learn the patterns of performance improvements from data. In this paper, we present a data-driven approach to software performance improvement in C#. We first compile a large dataset of hundreds of performance improvements made in open source projects. We then leverage this data to build a tool called PerfLens for performance improvement recommendations via code search. PerfLens indexes the performance improvements, takes a codebase as an input and searches a pool of performance improvements for similar code. We show that when our system is further augmented with profiler data information our recommendations are more accurate. Our experiments show that PerfLens can suggest performance improvements with 90\% accuracy when profiler data is available and 55\% accuracy when it analyzes source code only.}, +booktitle = {Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis}, +pages = {19–24}, +numpages = {6}, +keywords = {Machine Learning, Software Performance}, +location = {Virtual, Canada}, +series = {SOAP 2021} +} + +@inproceedings{10.1145/3558489.3559069, +author = {Coskun, Tugce and Halepmollasi, Rusen and Hanifi, Khadija and Fouladi, Ramin Fadaei and De Cnudde, Pinar Comak and Tosun, Ayse}, +title = {Profiling developers to predict vulnerable code changes}, +year = {2022}, +isbn = {9781450398602}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3558489.3559069}, +doi = {10.1145/3558489.3559069}, +abstract = {Software vulnerability prediction and management have caught the interest of researchers and practitioners, recently. Various techniques that are usually based on characteristics of the code artefacts are also offered to predict software vulnerabilities. While other studies achieve promising results, the role of developers in inducing vulnerabilities has not been studied yet. We aim to profile the vulnerability inducing and vulnerability fixing behaviors of developers in software projects using Heterogeneous Information Network (HIN) analysis. We also investigate the impact of developer profiles in predicting vulnerability inducing commits, and compare the findings against the approach based on the code metrics. We adopt Random Walk with Restart (RWR) algorithm on HIN and the aggregation of code metrics for extracting all the input features. We utilize traditional machine learning algorithms namely, Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) to build the prediction models.We report our empirical analysis to predict vulnerability inducing commits of four Apache projects. The technique based on code metrics achieves 90\% success for the recall measure, whereas the technique based on profiling developer behavior achieves 71\% success. When we use the feature sets obtained with the two techniques together, we achieve 89\% success.}, +booktitle = {Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering}, +pages = {32–41}, +numpages = {10}, +keywords = {profiling developers, technical debt, vulnerability, vulnerability prediction}, +location = {Singapore, Singapore}, +series = {PROMISE 2022} +} + +@INPROCEEDINGS{10172577, + author={Bhuiyan, Masudul Hasan Masud and Parthasarathy, Adithya Srinivas and Vasilakis, Nikos and Pradel, Michael and Staicu, Cristian-Alexandru}, + booktitle={2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)}, + title={SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript}, + year={2023}, + volume={}, + number={}, + pages={1059-1070}, + keywords={Fault diagnosis;Codes;Benchmark testing;Software;Safety;Security;Public policy}, + doi={10.1109/ICSE48619.2023.00096}} + +@inproceedings{10.1145/3524842.3528513, +author = {Reis, Sofia and Abreu, Rui and Erdogmus, Hakan and P\u{a}s\u{a}reanu, Corina}, +title = {SECOM: towards a convention for security commit messages}, +year = {2022}, +isbn = {9781450393034}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3524842.3528513}, +doi = {10.1145/3524842.3528513}, +abstract = {One way to detect and assess software vulnerabilities is by extracting security-related information from commit messages. Automating the detection and assessment of vulnerabilities upon security commit messages is still challenging due to the lack of structured and clear messages. We created a convention, called SECOM, for security commit messages that structure and include bits of security-related information that are essential for detecting and assessing vulnerabilities for both humans and tools. The full convention and details are available here: https://tqrg.github.io/secom/.}, +booktitle = {Proceedings of the 19th International Conference on Mining Software Repositories}, +pages = {764–765}, +numpages = {2}, +keywords = {standard, security commit messages, convention, best practices}, +location = {Pittsburgh, Pennsylvania}, +series = {MSR '22} +} + +@inproceedings{10.1145/3661167.3661262, +author = {Bennett, Gareth and Hall, Tracy and Winter, Emily and Counsell, Steve}, +title = {Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools}, +year = {2024}, +isbn = {9798400717017}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3661167.3661262}, +doi = {10.1145/3661167.3661262}, +abstract = {Vulnerabilities in code should be detected and patched quickly to reduce the time in which they can be exploited. There are many automated approaches to assist developers in detecting vulnerabilities, most notably Static Application Security Testing (SAST) tools. However, no single tool detects all vulnerabilities and so relying on any one tool may leave vulnerabilities dormant in code. In this study, we use a manually curated dataset to evaluate four SAST tools on production code with known vulnerabilities. Our results show that the vulnerability detection rates of individual tools range from 11.2\% to 26.5\%, but combining these four tools can detect 38.8\% of vulnerabilities. We investigate why SAST tools are unable to detect 61.2\% of vulnerabilities and identify missing vulnerable code patterns from tool rule sets. Based on our findings, we create new rules for Semgrep, a popular configurable SAST tool. Our newly configured Semgrep tool detects 44.7\% of vulnerabilities, more than using a combination of tools, and a 181\% improvement in Semgrep’s detection rate.}, +booktitle = {Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering}, +pages = {614–623}, +numpages = {10}, +location = {Salerno, Italy}, +series = {EASE '24} +} + +@misc{chi2022seqtransautomaticvulnerabilityfix, + title={SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning}, + author={Jianlei Chi and Yu Qu and Ting Liu and Qinghua Zheng and Heng Yin}, + year={2022}, + eprint={2010.10805}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2010.10805}, +} + +@misc{ahmed2023sequentialgraphneuralnetworks, + title={Sequential Graph Neural Networks for Source Code Vulnerability Identification}, + author={Ammar Ahmed and Anwar Said and Mudassir Shabbir and Xenofon Koutsoukos}, + year={2023}, + eprint={2306.05375}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2306.05375}, +} + +@misc{sun2023silentvulnerabledependencyalert, + title={Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation}, + author={Jiamou Sun and Zhenchang Xing and Qinghua Lu and Xiwei Xu and Liming Zhu and Thong Hoang and Dehai Zhao}, + year={2023}, + eprint={2302.07445}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2302.07445}, +} + +@inproceedings{10.1145/3611643.3616299, +author = {Zhao, Lida and Chen, Sen and Xu, Zhengzi and Liu, Chengwei and Zhang, Lyuye and Wu, Jiahui and Sun, Jun and Liu, Yang}, +title = {Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects}, +year = {2023}, +isbn = {9798400703270}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3611643.3616299}, +doi = {10.1145/3611643.3616299}, +abstract = {Software composition analysis (SCA) tools are proposed to detect potential vulnerabilities introduced by open-source software (OSS) imported as third-party libraries (TPL). With the increasing complexity of software functionality, SCA tools may encounter various scenarios during the dependency resolution process, such as diverse formats of artifacts, diverse dependency imports, and diverse dependency specifications. However, there still lacks a comprehensive evaluation of SCA tools for Java that takes into account the above scenarios. This could lead to a confined interpretation of comparisons, improper use of tools, and hinder further improvements of the tools. To fill this gap, we proposed an Evaluation Model which consists of Scan Modes, Scan Methods, and SCA Scope for Maven (SSM), for comprehensive assessments of the dependency resolving capabilities and effectiveness of SCA tools. Based on the Evaluation Model, we first qualitatively examined 6 SCA tools’ capabilities. Next, the accuracy of dependency and vulnerability is quantitatively evaluated with a large-scale dataset (21,130 Maven modules with 73,499 unique dependencies) under two Scan Modes (i.e., build scan and pre-build scan). The results show that most tools do not fully support SSM, which leads to compromised accuracy. For dependency detection, the average F1-score is 0.890 and 0.692 for build and pre-build respectively, and for vulnerability accuracy, the average F1-score is 0.475. However, proper support for SSM reduces dependency detection false positives by 34.24\% and false negatives by 6.91\%. This further leads to a reduction of 18.28\% in false positives and 8.72\% in false negatives in vulnerability reports.}, +booktitle = {Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, +pages = {960–972}, +numpages = {13}, +keywords = {Package manager, SCA, Vulnerability detection}, +location = {San Francisco, CA, USA}, +series = {ESEC/FSE 2023} +} + +@Article{202419, +title = {Survey on Vulnerability Awareness of Open Source Software}, +author = {ZHAN Qi, PAN Sheng-Yi, HU Xing, BAO Ling-Feng, XIA Xin}, + journal = {Journal of Software}, + volume = {35}, + number = {1}, + pages = {19}, + numpages = {19.0000}, + year = {2024}, + month = {01}, + doi = {10.13328/j.cnki.jos.006935}, + publisher = {} +} + +@article{LI2023111679, +title = {The anatomy of a vulnerability database: A systematic mapping study}, +journal = {Journal of Systems and Software}, +volume = {201}, +pages = {111679}, +year = {2023}, +issn = {0164-1212}, +doi = {https://doi.org/10.1016/j.jss.2023.111679}, +url = {https://www.sciencedirect.com/science/article/pii/S0164121223000742}, +author = {Xiaozhou Li and Sergio Moreschini and Zheying Zhang and Fabio Palomba and Davide Taibi}, +keywords = {Software security, Vulnerability databases, Systematic mapping studies, Software evolution}, +abstract = {Software vulnerabilities play a major role, as there are multiple risks associated, including loss and manipulation of private data. The software engineering research community has been contributing to the body of knowledge by proposing several empirical studies on vulnerabilities and automated techniques to detect and remove them from source code. The reliability and generalizability of the findings heavily depend on the quality of the information mineable from publicly available datasets of vulnerabilities as well as on the availability and suitability of those databases. In this paper, we seek to understand the anatomy of the currently available vulnerability databases through a systematic mapping study where we analyze (1) what are the popular vulnerability databases adopted; (2) what are the goals for adoption; (3) what are the other sources of information adopted; (4) what are the methods and techniques; (5) which tools are proposed. An improved understanding of these aspects might not only allow researchers to take informed decisions on the databases to consider when doing research but also practitioners to establish reliable sources of information to inform their security policies and standards.} +} + +@article{ALDEBEYAN2024112003, +title = {The impact of hard and easy negative training data on vulnerability prediction performance}, +journal = {Journal of Systems and Software}, +volume = {211}, +pages = {112003}, +year = {2024}, +issn = {0164-1212}, +doi = {https://doi.org/10.1016/j.jss.2024.112003}, +url = {https://www.sciencedirect.com/science/article/pii/S0164121224000463}, +author = {Fahad {Al Debeyan} and Lech Madeyski and Tracy Hall and David Bowes}, +keywords = {Software vulnerability prediction, Vulnerability datasets, Machine learning}, +abstract = {Vulnerability prediction models have been shown to perform poorly in the real world. We examine how the composition of negative training data influences vulnerability prediction model performance. Inspired by other disciplines (e.g. image processing), we focus on whether distinguishing between negative training data that is ‘easy’ to recognise from positive data (very different from positive data) and negative training data that is ‘hard’ to recognise from positive data (very similar to positive data) impacts on vulnerability prediction performance. We use a range of popular machine learning algorithms, including deep learning, to build models based on vulnerability patch data curated by Reis and Abreu, as well as the MSR dataset. Our results suggest that models trained on higher ratios of easy negatives perform better, plateauing at 15 easy negatives per positive instance. We also report that different ML algorithms work better based on the negative sample used. Overall, we found that the negative sampling approach used significantly impacts model performance, potentially leading to overly optimistic results. The ratio of ‘easy’ versus ‘hard’ negative training data should be explicitly considered when building vulnerability prediction models for the real world.} +} + +@misc{xu2023trackingpatchesopensource, + title={Tracking Patches for Open Source Software Vulnerabilities}, + author={Congying Xu and Bihuan Chen and Chenhao Lu and Kaifeng Huang and Xin Peng and Yang Liu}, + year={2023}, + eprint={2112.02240}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2112.02240}, +} + +@misc{risse2024uncoveringlimitsmachinelearning, + title={Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection}, + author={Niklas Risse and Marcel Böhme}, + year={2024}, + eprint={2306.17193}, + archivePrefix={arXiv}, + primaryClass={cs.CR}, + url={https://arxiv.org/abs/2306.17193}, +} + +@inproceedings{10.1145/3597926.3598037, +author = {Nie, Xu and Li, Ningke and Wang, Kailong and Wang, Shangguang and Luo, Xiapu and Wang, Haoyu}, +title = {Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)}, +year = {2023}, +isbn = {9798400702211}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3597926.3598037}, +doi = {10.1145/3597926.3598037}, +abstract = {Software system complexity and security vulnerability diversity are plausible sources of the persistent challenges in software vulnerability research. Applying deep learning methods for automatic vulnerability detection has been proven an effective means to complement traditional detection approaches. Unfortunately, lacking well-qualified benchmark datasets could critically restrict the effectiveness of deep learning-based vulnerability detection techniques. Specifically, the long-term existence of erroneous labels in the existing vulnerability datasets may lead to inaccurate, biased, and even flawed results. In this paper, we aim to obtain an in-depth understanding and explanation of the label error causes. To this end, we systematically analyze the diversified datasets used by state-of-the-art learning-based vulnerability detection approaches, and examine their techniques for collecting vulnerable source code datasets. We find that label errors heavily impact the mainstream vulnerability detection models, with a worst-case average F1 drop of 20.7\%. As mitigation, we introduce two approaches to dataset denoising, which will enhance the model performance by an average of 10.4\%. Leveraging dataset denoising methods, we provide a feasible solution to obtain high-quality labeled datasets.}, +booktitle = {Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis}, +pages = {52–63}, +numpages = {12}, +keywords = {vulnerability detection, denoising, deep learning}, +location = {Seattle, WA, USA}, +series = {ISSTA 2023} +} + +@INPROCEEDINGS{10172868, + author={Wu, Yulun and Yu, Zeliang and Wen, Ming and Li, Qiang and Zou, Deqing and Jin, Hai}, + booktitle={2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)}, + title={Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem}, + year={2023}, + volume={}, + number={}, + pages={1046-1058}, + keywords={Codes;Databases;Source coding;Ecosystems;Estimation;Software systems;Libraries;Maven;Ecosystem Security;Vulnerability}, + doi={10.1109/ICSE48619.2023.00095}} + +@article{ESPOSITO2024107448, +title = {VALIDATE: A deep dive into vulnerability prediction datasets}, +journal = {Information and Software Technology}, +volume = {170}, +pages = {107448}, +year = {2024}, +issn = {0950-5849}, +doi = {https://doi.org/10.1016/j.infsof.2024.107448}, +url = {https://www.sciencedirect.com/science/article/pii/S0950584924000533}, +author = {Matteo Esposito and Davide Falessi}, +keywords = {Security, Replicability, Vulnerability, Machine learning, Repository, Dataset}, +abstract = {Context: +Vulnerabilities are an essential issue today, as they cause economic damage to the industry and endanger our daily life by threatening critical national security infrastructures. Vulnerability prediction supports software engineers in preventing the use of vulnerabilities by malicious attackers, thus improving the security and reliability of software. Datasets are vital to vulnerability prediction studies, as machine learning models require a dataset. Dataset creation is time-consuming, error-prone, and difficult to validate. +Objectives: +This study aims to characterise the datasets of prediction studies in terms of availability and features. Moreover, to support researchers in finding and sharing datasets, we provide the first VulnerAbiLty predIction DatAseT rEpository (VALIDATE). +Methods: +We perform a systematic literature review of the datasets of vulnerability prediction studies. +Results: +Our results show that out of 50 primary studies, only 22 studies (i.e., 38%) provide a reachable dataset. Of these 22 studies, only one study provides a dataset in a stable repository. +Conclusions: +Our repository of 31 datasets, 22 reachable plus nine datasets provided by authors via email, supports researchers in finding datasets of interest, hence avoiding reinventing the wheel; this translates into less effort, more reliability, and more reproducibility in dataset creation and use.} +} + +@INPROCEEDINGS{9825908, + author={Wang, Shichao and Zhang, Yun and Bao, Liagfeng and Xia, Xin and Wu, Minghui}, + booktitle={2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)}, + title={VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities}, + year={2022}, + volume={}, + number={}, + pages={589-600}, + keywords={Location awareness;Conferences;Semantics;Manuals;Feature extraction;Application security;Security;Security Patches;Vulnerability Analysis;Mining Software Repository}, + doi={10.1109/SANER53432.2022.00076}} + +@INPROCEEDINGS{9978189, + author={Sun, Qing and Xu, Lili and Xiao, Yang and Li, Feng and Su, He and Liu, Yiming and Huang, Hongyun and Huo, Wei}, + booktitle={2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)}, + title={VERJava: Vulnerable Version Identification for Java OSS with a Two-Stage Analysis}, + year={2022}, + volume={}, + number={}, + pages={329-339}, + keywords={Java;Software maintenance;Codes;Databases;Software algorithms;Manuals;Maintenance engineering;patch analysis;vulnerability;Java OSS;vulnerable version identification;code similarity}, + doi={10.1109/ICSME55016.2022.00037}} + +@misc{nguyen2023vffindergraphbasedapproachautomated, + title={VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification}, + author={Son Nguyen and Thanh Trong Vu and Hieu Dinh Vo}, + year={2023}, + eprint={2309.01971}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2309.01971}, +} + +@INPROCEEDINGS{9724745, + author={Piran, Azin and Chang, Che-Pin and Fard, Amin Milani}, + booktitle={2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS)}, + title={Vulnerability Analysis of Similar Code}, + year={2021}, + volume={}, + number={}, + pages={664-671}, + keywords={Access control;Codes;Cross-site scripting;Conferences;Cloning;Software quality;Libraries;Code vulnerability;static analysis;CWE;CVE}, + doi={10.1109/QRS54544.2021.00076}} + +@misc{keller2020meanssemanticrepresentationlearning, + title={What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning}, + author={Patrick Keller and Laura Plein and Tegawendé F. Bissyandé and Jacques Klein and Yves Le Traon}, + year={2020}, + eprint={2002.02650}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2002.02650}, +} + +@inproceedings{10.1145/3663533.3664036, +author = {Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga}, +title = {MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery}, +year = {2024}, +isbn = {9798400706752}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3663533.3664036}, +doi = {10.1145/3663533.3664036}, +abstract = {Vulnerability datasets have become an important instrument in software security research, being used to develop automated, machine learning-based vulnerability detection and patching approaches. Yet, any limitations of these datasets may translate into inadequate performance of the developed solutions. For example, the limited size of a vulnerability dataset may restrict the applicability of deep learning techniques. + + In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. + + + Our dataset containing 26,617 unique CVEs coming from 6,945 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 31,883 unique commits that fixed those vulnerabilities. Compared to prior work, our dataset brings about a 397\% increase in CVEs, a 295\% increase in covered open-source projects, and a 480\% increase in commit fixes. + + Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. + + We release to the community a 14GB PostgreSQL database that contains information on CVEs up to January 24, 2024, CWEs of each CVE, files and methods changed by each commit, and repository metadata. + + Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.}, +booktitle = {Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering}, +pages = {42–51}, +numpages = {10}, +keywords = {CVE, Vulnerability dataset, dataset, open-source, real-world vulnerability dataset, software repository mining}, +location = {Porto de Galinhas, Brazil}, +series = {PROMISE 2024} +} + + +@article{Cabrera_Lozoya_2021, + title={Commit2Vec: Learning Distributed Representations of Code Changes}, + volume={2}, + ISSN={2661-8907}, + url={http://dx.doi.org/10.1007/s42979-021-00566-z}, + DOI={10.1007/s42979-021-00566-z}, + number={3}, + journal={SN Computer Science}, + publisher={Springer Science and Business Media LLC}, + author={Cabrera Lozoya, Rocío and Baumann, Arnaud and Sabetta, Antonino and Bezzi, Michele}, + year={2021}, + month=mar } + +@misc{fehrer2021detectingsecurityfixesopensource, + title={Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers}, + author={Therese Fehrer and Rocío Cabrera Lozoya and Antonino Sabetta and Dario Di Nucci and Damian A. Tamburri}, + year={2021}, + eprint={2105.03346}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2105.03346}, +} + +@article{Ponta2020DetectionAA, + title={Detection, assessment and mitigation of vulnerabilities in open source dependencies}, + author={Serena Elisa Ponta and Henrik Plate and Antonino Sabetta}, + journal={Empirical Software Engineering}, + year={2020}, + volume={25}, + pages={3175 - 3215}, + url={https://api.semanticscholar.org/CorpusID:220259876} +} + +@ARTICLE {9506931, +author = {A. Dann and H. Plate and B. Hermann and S. Ponta and E. Bodden}, +journal = {IEEE Transactions on Software Engineering}, +title = {Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite}, +year = {2022}, +volume = {48}, +number = {09}, +issn = {1939-3520}, +pages = {3613-3625}, +abstract = {The use of vulnerable open-source dependencies is a known problem in today's software development. Several vulnerability scanners to detect known-vulnerable dependencies appeared in the last decade, however, there exists no case study investigating the impact of development practices, e.g., forking, patching, re-bundling, on their performance. This paper studies (i) types of modifications that may affect vulnerable open-source dependencies and (ii) their impact on the performance of vulnerability scanners. Through an empirical study on 7,024 Java projects developed at SAP, we identified four types of modifications: re-compilation, re-bundling, metadata-removal and re-packaging. In particular, we found that more than 87 percent (56 percent, resp.) of the vulnerable Java classes considered occur in Maven Central in re-bundled (re-packaged, resp.) form. We assessed the impact of these modifications on the performance of the open-source vulnerability scanners OWASP Dependency-Check (OWASP) and Eclipse Steady, GitHub Security Alerts, and three commercial scanners. The results show that none of the scanners is able to handle all the types of modifications identified. Finally, we present Achilles, a novel test suite with 2,505 test cases that allow replicating the modifications on open-source dependencies.}, +keywords = {open source software;databases;java;benchmark testing;tools;security;software}, +doi = {10.1109/TSE.2021.3101739}, +publisher = {IEEE Computer Society}, +address = {Los Alamitos, CA, USA}, +month = {sep} +} + +@misc{ponta2021usedbloatedvulnerablereducing, + title={The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application}, + author={Serena Elisa Ponta and Wolfram Fischer and Henrik Plate and Antonino Sabetta}, + year={2021}, + eprint={2108.05115}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2108.05115}, +} + +@INPROCEEDINGS{9462983, + author={Iannone, Emanuele and Nucci, Dario Di and Sabetta, Antonino and De Lucia, Andrea}, + booktitle={2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC)}, + title={Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries}, + year={2021}, + volume={}, + number={}, + pages={396-400}, + keywords={Java;Tools;Libraries;Security;Reachability analysis;Open source software;Genetic algorithms;Exploit Generation;Security Testing;Software Vulnerabilities}, + doi={10.1109/ICPC52881.2021.00046}} + + + + + + + From a0e9666b90eac98ca18baa92340d90d526b9d5e2 Mon Sep 17 00:00:00 2001 From: "Antonino Sabetta (i064196)" Date: Thu, 25 Jul 2024 15:57:16 +0200 Subject: [PATCH 71/83] fixed requirements for bib2md --- .reuse/dep5 | 2 +- scripts/requirements.txt | Bin 9826 -> 58 bytes 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/.reuse/dep5 b/.reuse/dep5 index 6ef15cd29..17a5485f9 100644 --- a/.reuse/dep5 +++ b/.reuse/dep5 @@ -7,6 +7,6 @@ Files: prospector/* kaybee/* docs/* scripts/* vulnerability-data/* MSR2019/* NO Copyright: 2019-2020 SAP SE or an SAP affiliate company and project "KB" contributors License: Apache-2.0 -Files: mkdocs.yml .chglog/* .github/* CHANGELOG.md CONTRIBUTING.md Makefile go.mod go.sum .pre-commit-config.yaml .gitignore */*.yaml Pipfile +Files: *.bib mkdocs.yml .chglog/* .github/* CHANGELOG.md CONTRIBUTING.md Makefile go.mod go.sum .pre-commit-config.yaml .gitignore */*.yaml Pipfile Copyright: 2019-2020 SAP SE or an SAP affiliate company and project "KB" contributors License: CC0-1.0 diff --git a/scripts/requirements.txt b/scripts/requirements.txt index 2facbe8e8799cd0d0ea387b4f701db443d73c4ee..e551b0bcef3b8be3584ecf1a4ea39caf0f0f728e 100644 GIT binary patch literal 58 zcmYewOe#sOC`c?SPA#&vHPSQCGe|P$DyYm!1PZ6-CEMB>>lq{(nF57CiZb)kK_Z5F GMqB_#TM~o- literal 9826 zcmai)OK)4r5rywMKz<4bA}Pz`#VjVsD#!o{93aR_(1Vg_NhD3umhB&(ZxPV%t86th>FX~*F~+U%uac)Syj`HAyM(`T>m<*%QQ zXm^r#FO9Aw6@K{BKC!RRX*YaqemL|fevRg6ycr&v=bqYcS4ZB1H!AEDV*(=GS0v+h5Sbv3ay zHsMTmB6%y>u1Dx=oY(z? zzH>V_pxlokSS>90E7 z=sx^X6`?9mMQgf5aUe=Rw-{1mc4MA_tBPdR0Mk%Q$)IR)%!3NxYKe3zX?*-F9C5lB zxf26uqQP3)Q>W~QRQ=rM)R>H9G;@}aUF1_C_7Jb-2Yd=Oo^s8gvci$*StDrf`*i@8oavnKu6!P3iFDQnDA{Pg|~q2K??D z>Zlv|GIa@Bur(XA?oNEjEs#fsnvSe|)1nePJ_>iv0K8~)+C#*dDzj>cz$54lRdts; z@!csd_}f-dwffTe2El$>AvyD4(wVEpY?aH##?qo>%&2bm<$D@Jlr$74if-jHRzpx7ve^iWr% z%N7N9|E;*E*1KxQo@Mi<>&!|tu7<89R4Z2$Q)N;_OLt0ZLByz+OyzVYV>6w?z962F z`9X51p7Bc-ZmpMyqGkY)X`_6@8mtz z%D2E;b(Rl`K~GIB^PB?5ir69DrL@7X=9>tE1LNy^YOq#)vAXXuC3&=M8ka9Uk@&6n zd-Rv1&Ce|!Ombzht2fv?jQ+pTyvl<+*pts#)s?#*#CN zI*KpNjedc?kFw@PUu4Wv=NYhw&Za^VHO6x-^``O#WVnuePsMwiI{z#jtR#ikU~al9 z8OcAonW#c*jeew$r(!%u2Z{`56n37H;FpH?0Goz5dM*g5$XasGVJB;_#5r4wj;HJMrPvm}uR=oE|2Th>fPGh> z-z0sj_}+@@WsDfHFRW7adx}YuRoj>##nQqL4~3pZCK z^W&VI2NdZO@Pm$uZe*OILQhXNbcuL|SR04WW9_eLnlpQa*yUC2yRkCo?4xd_(ed8G z`V&98oAv7Jc5)P~^YWzEQdd25xO$bBLc`mDsgaIYsC(ObcJk`E>GHKz+b`k=%GvVJ z)qDs0h`v2b4p2pEvM@5xeKjJludbX{%{B$RsqFM{JnWi^748F2F{k=fYZ{PMeVW+d z>4~(-h9VP#V^#VxnY}b_kE7Q1c1Z&xyxEuy{XVuvKccvl%m?|%J3%Z2r^Xcf6vx+- z6ZWAI^)3`_6M2^gU2hIN#Azl!V++stF7a=5r~(;95u`} zG@0#vz(!}-1U^5@qN=m_&vlgSAd1M!{?mvny?Y8?EocQcs&2WSo7Yb0s&`nA7Pd^t z3UZyty_}ndPU4epY)yl=uMM}}Hs>`=Hz|J{QV^AzIgRu~YTmC{bND`Rm$!%hArBo- zd*%d6%Zt3FiLSE}yw{!z&lrc^oUjEBvd-ia-fqrKHpFBnKBC)TYt=S#sC<$9t2}p> z^;+fqMXw|eKx^RzJJ7PM=;>X{=`UzN?t|u%U$m)iY_^$oc%#^LB z&5!NOGrDU(vO%ynL&S%TM5aSH%=IUjh7w<7b6uMo^nq{q`ExO9;!gR>>kIGWp?hVV zF^~wHNrRg)W=Oz?;3aQ6c+bapng5;W(QB^0GZ||EUI!86M-YYtBXKJWx${o=xcI|6 z#sqz--pF`?49OXv^I#iMEqRqAWWD=1b-`0QpDlStdeXm9hx+ZU`^mZDW-Cw8{v?em zw~3E2&kf`e$4qK?qGmhS+R$yJIemx?Y~D&W*M+Y&A#<1NZB%wXfuhqM_BLbikgoPM z@s2lcr+CttsHC-3L|dWo=hnw`}wj_8)Ye+Kju}`My6{L4r3O z$$4kk`{AF;^IsLeY>4Y~5Io1e)_myv%q6LpHCa~Qz`uCkn@Z4=h|RlR#4LR{ed2qh z7;ne{A=bp+OdsnJ$t;aD*GSOYcc;^or9HKTTI@GM^sLU+-=_;^3Me_QHpas#YQOS0 z@qF>)o4i5D9i&uu!2iV^8tN%8@|r8BT`xrvuPZ)L1D#k$9rHV@w=i5EL_xmXw(LMY zuRUuEhN@n}1(bR6bzH0S*s{cs3cZz~fzNzRt{TmeE-dv1>+|Fl>Rekth z5ygqQktzlc?pLRB{Z7^8%;(`xjVND&Vb}M)y2h=z!jTAEG~`BeJPJ0AaBJmB&UMDz zq2kq(*_ue07OMF{%!q*hhX#C!VRapEd!rX!ggTH~V~Lp{>b>tyQq#om%{Sf_Cz5B1 zLuQ4Z-yii-eCs{?nS6H0!83q!AsgG8-0IbCd8WRUm8bUv*!=yScmx!2>MpX^9?Q17 zRz=5DU8Yh#eW8E#T>}geQ&&h{3Vb3&WT;x1R)gBg5h&?|?8fG=h%8ipQmtVQAjfO} z=xXjN>$gZhe-!Tfo(CWH8uWE^&m?NRcq(&_zet`pJ;?;>sSyiSIcFkdUsU@*>ufLb zQf1ed>ST@Fr+Nxce0*x*7?$R}GS-%o8vq`gQMh^VPDTTS~4{O0W$U8|~m5oUR>2aV`iRl#pX RTi_~{S Date: Thu, 25 Jul 2024 15:58:20 +0200 Subject: [PATCH 72/83] tweaks to bib2md output format --- scripts/bib2md.py | 81 +++++++++++++++++++++++++++++++---------------- 1 file changed, 54 insertions(+), 27 deletions(-) diff --git a/scripts/bib2md.py b/scripts/bib2md.py index dc5fb4820..5c345c3e3 100644 --- a/scripts/bib2md.py +++ b/scripts/bib2md.py @@ -2,41 +2,60 @@ # pip install bibtexparser --pre # to run on the CLI: -# python bib2md.py your_referenceFile.bib -# default order: desc. +# python bib2md.py your_referenceFile.bib +# default order: desc. # To change add at the end of your command: -ord "asc" -import bibtexparser -import sys import argparse import html +import sys + +import bibtexparser + def process_entry(entry): try: - authors = entry['author'].split(' and ') + authors = entry["author"].split(" and ") if len(authors) > 1: - authors[-1] = 'and ' + authors[-1] + authors[-1] = "and " + authors[-1] - authors_formatted = ', '.join([a.replace('\n', ' ').strip() for a in authors]) - title = html.unescape(entry['title']) - year = int(entry['year']) - venue = entry.get('journal') or entry.get('booktitle') or entry.get('archivePrefix') + authors_formatted = ", ".join([a.replace("\n", " ").strip() for a in authors]) + + title = html.unescape(entry["title"]) + year = int(entry["year"]) + venue = ( + entry.get("journal") or entry.get("booktitle") or entry.get("archivePrefix") + ) + url = entry.get("url") + if url: + url = url.value + else: + url = None if not venue: id_unprocessed = "[" + entry.key + " - " + entry.entry_type + "]" return None, id_unprocessed - - return (year, f"{authors_formatted}. {title}. {venue.value}. ({year})."), None + + if url: + title = f"[{title}]({url})" + + return ( + year, + f" * {authors_formatted}. {title}. {venue.value}. ({year}).", + ), None except KeyError as e: - print(f"One or more necessary fields {str(e)} not present in this BibTeX entry.") + print( + f"One or more necessary fields {str(e)} not present in this BibTeX entry." + ) return None, None -def format_simple(entry_str, order='desc'): + +def format_simple(entry_str, order="desc"): library = bibtexparser.parse_string(entry_str) formatted_entries = [] unprocessed_entries = [] - + for entry in library.entries: processed_entry, unprocessed_entry = process_entry(entry) if processed_entry: @@ -44,29 +63,37 @@ def format_simple(entry_str, order='desc'): elif unprocessed_entry: unprocessed_entries.append(unprocessed_entry) - if order == 'asc': + if order == "asc": formatted_entries.sort(key=lambda x: x[0]) - elif order == 'desc': + elif order == "desc": formatted_entries.sort(key=lambda x: x[0], reverse=True) - + if len(unprocessed_entries) > 0: - print('Warning: Some entries were not processed due to unknown type', file=sys.stderr) + print( + "Warning: Some entries were not processed due to unknown type", + file=sys.stderr, + ) print("List of unprocessed entrie(s):", unprocessed_entries) - + return [entry[1] for entry in formatted_entries] def main(): parser = argparse.ArgumentParser() - parser.add_argument('file', type=str, help='a .bib file as argument') - parser.add_argument('-ord', '--order', type=str, - choices=['asc', 'desc'], - help='here we set a sort order. We have the choice between "asc" and "desc"', - default='desc', required=False) + parser.add_argument("file", type=str, help="a .bib file as argument") + parser.add_argument( + "-ord", + "--order", + type=str, + choices=["asc", "desc"], + help='here we set a sort order. We have the choice between "asc" and "desc"', + default="desc", + required=False, + ) args = parser.parse_args() - with open(args.file, 'r', encoding='utf-8') as bibtex_file: + with open(args.file, "r", encoding="utf-8") as bibtex_file: bibtex_str = bibtex_file.read() citations = format_simple(bibtex_str, args.order) @@ -74,6 +101,7 @@ def main(): print() print(cit) + if __name__ == "__main__": main() @@ -167,4 +195,3 @@ def main(): # primaryClass={cs.SE}, # url={https://arxiv.org/abs/2303.16591}, # } - From 6e98d9db1df20d79f7739e126a943fd214e7b520 Mon Sep 17 00:00:00 2001 From: "Antonino Sabetta (i064196)" Date: Thu, 25 Jul 2024 15:59:08 +0200 Subject: [PATCH 73/83] replace list of pubs with autogenerated one (with bib2md) --- README.md | 274 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 176 insertions(+), 98 deletions(-) diff --git a/README.md b/README.md index 46b7a484a..2d2f196fb 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ [![REUSE status](https://api.reuse.software/badge/github.com/sap/project-kb)](https://api.reuse.software/info/github.com/sap/project-kb) [![Pytest](https://github.com/SAP/project-kb/actions/workflows/python.yml/badge.svg)](https://github.com/SAP/project-kb/actions/workflows/python.yml) -# Table of contents +# Table of contents 1. [Kaybee](#kaybee) 2. [Prospector](#prosp) 3. [Vulnerability data](#vuldata) @@ -19,7 +19,7 @@ 7. [Support](#support) 8. [Contributing](#contrib) -## Description +## Description The goal of `Project KB` is to enable the creation, management and aggregation of a distributed, collaborative knowledge base of vulnerabilities affecting @@ -29,7 +29,7 @@ open-source software. as well as set of tools to support the mining, curation and management of such data. -### Motivations +### Motivations In order to feed [Eclipse Steady](https://github.com/eclipse/steady/) with fresh data, we have spent a considerable amount of time, in the past few years, mining @@ -47,7 +47,7 @@ in early 2019. In June 2020, we made a further step releasing the `kaybee` tool make the creation, aggregation, and consumption of vulnerability data much easier. In late 2020, we also released, as a proof-of-concept, the prototype `prospector`, whose goal is to automate the mapping of vulnerability advisories -onto their fix-commits. +onto their fix-commits. We hope this will encourage more contributors to join our efforts to build a collaborative, comprehensive knowledge base where each party remains in control @@ -106,104 +106,182 @@ ___ ### Our papers related to Project KB * Sabetta, A., Ponta, S. E., Cabrera Lozoya, R., Bezzi, M., Sacchetti, T., Greco, M., … Massacci, F. (2024). [Known Vulnerabilities of Open Source Projects: Where Are the Fixes?](https://ieeexplore.ieee.org/document/10381645) IEEE Security & Privacy, 22(2), 49–59. * Fehrer, T., Lozoya, R. C., Sabetta, A., Nucci, D. D., & Tamburri, D. A. (2024). [Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers.](http://arxiv.org/abs/2105.03346) EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering -* Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. +* Dann, A., Plate, H., Hermann, B., Ponta, S., & Bodden, E. (2022). [Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite.](https://ris.uni-paderborn.de/record/31132) IEEE Transactions on Software Engineering, 48(09), 3613–3625. * Cabrera Lozoya, R., Baumann, A., Sabetta, A., & Bezzi, M. (2021). [Commit2Vec: Learning Distributed Representations of Code Changes.](https://link.springer.com/article/10.1007/s42979-021-00566-z) SN Computer Science, 2(3). * Ponta, S. E., Fischer, W., Plate, H., & Sabetta, A. (2021). [The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application.](https://www.computer.org/csdl/proceedings-article/icsme/2021/288200a555/1yNhfKb2TBe) 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) * Iannone, E., Nucci, D. D., Sabetta, A., & De Lucia, A. (2021). [Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries.](https://ieeexplore.ieee.org/document/9462983) 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), 396–400. -* Ponta, S. E., Plate, H., & Sabetta, A. (2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies.](https://api.semanticscholar.org/CorpusID:220259876) Empirical Software Engineering, 25, 3175–3215. - +* Ponta, S. E., Plate, H., & Sabetta, A. (2020). [Detection, assessment and mitigation of vulnerabilities in open source dependencies.](https://api.semanticscholar.org/CorpusID:220259876) Empirical Software Engineering, 25, 3175–3215. +* Achyudh Ram, Ji Xin, Meiyappan Nagappan, Yaoliang Yu, Rocío Cabrera Lozoya, Antonino Sabetta, and Jimmy Lin. [Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits](https://arxiv.org/abs/1911.07620). arXiv. (2019). ___ ### Papers citing our work -* Aladics, T., Hegedüs, P., & Ferenc, R. (2022). [A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach.](https://api.semanticscholar.org/CorpusID:250566828) International Conference on Software and Data Technologies -* Bui, Q.-C., Scandariato, R., & Ferreyra, N. E. D. (2022). [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques.](https://dl.acm.org/doi/abs/10.1145/3524842.3528482) Proceedings of the 19th International Conference on Mining Software Repositories, 464–468. -* S. R. Tate, M. Bollinadi, and J. Moore. (2020). [Characterizing Vulnerabilities in a Major Linux Distribution](https://home.uncg.edu/cmp/faculty/srtate/pubs/vulnerabilities/Vulnerabilities-SEKE2020.pdf) 32nd International Conference on Software Engineering \& Knowledge Engineering (SEKE), pp. 538-543. -* Galvão, P. (2022). [Analysis and Aggregation of Vulnerability Databases with Code-Level Data. Dissertation de Master's Degree.](https://repositorio-aberto.up.pt/bitstream/10216/144796/2/588886.pdf) Faculdade de Engenharia da Universidade do Porto. -* Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F. (2022). [A Survey on Machine Learning Techniques for Source Code Analysis.](http://arxiv.org/abs/2110.09610) -* Hommersom, D., Sabetta, A., Coppola, B., Nucci, D. D., & Tamburri, D. A. (2024). [Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories.](https://dl.acm.org/doi/10.1145/3649590) ACM Trans. Softw. Eng. Methodol., 33(5). -* Marchand-Melsom, A., & Nguyen Mai, D. B. (2020). [Automatic repair of OWASP Top 10 security vulnerabilities: A survey.](https://dl.acm.org/doi/10.1145/3387940.3392200) Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 23–30. Presented at the Seoul, Republic of Korea. -* Sawadogo, A. D., Guimard, Q., Bissyandé, T. F., Kaboré, A. K., Klein, J., & Moha, N. (2021). [Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We?](http://arxiv.org/abs/2112.10123) -* Sun, S., Wang, S., Wang, X., Xing, Y., Zhang, E., & Sun, K. (2023). [Exploring Security Commits in Python.](http://arxiv.org/abs/2307.11853) -* Reis, S., Abreu, R., & Cruz, L. (2021). [Fixing Vulnerabilities Potentially Hinders Maintainability.](http://arxiv.org/abs/2106.03271) -* Andrade, R., & Santos, V. (2021). [Investigating vulnerability datasets.](https://sol.sbc.org.br/index.php/vem/article/view/17213) Anais Do IX Workshop de Visualização, Evolução e Manutenção de Software, 26–30. Presented at the Joinville. -* Nguyen, T. G., Le-Cong, T., Kang, H. J., Widyasari, R., Yang, C., Zhao, Z., … Lo, D. (2023). [Multi-Granularity Detector for Vulnerability Fixes.](https://arxiv.org/abs/2305.13884) -* Siddiq, M. L., & Santos, J. C. S. (2022). [SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques.](https://dl.acm.org/doi/abs/10.1145/3549035.3561184) Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, 29–33. Presented at the Singapore, Singapore.] -* Sawadogo, A. D., Bissyandé, T. F., Moha, N., Allix, K., Klein, J., Li, L., & Traon, Y. L. (2020). [Learning to Catch Security Patches.](https://arxiv.org/abs/2001.09148) -* Dunlap, T., Lin, E., Enck, W., & Reaves, B. (2023). [VFCFinder: Seamlessly Pairing Security Advisories and Patches.](http://arxiv.org/abs/2311.01532) -* Bao, L., Xia, X., Hassan, A. E., & Yang, X. (2022). [V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities.](https://dl.acm.org/doi/10.1145/3510003.3510113) Proceedings of the 44th International Conference on Software Engineering, 2352–2364. Presented at the Pittsburgh, Pennsylvania. -* Fan, J., Li, Y., Wang, S., & Nguyen, T. N. (2020). [A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries.](https://dl.acm.org/doi/10.1145/3379597.3387501) Proceedings of the 17th International Conference on Mining Software Repositories, 508–512. Presented at the Seoul, Republic of Korea. -* Zhang, Q., Fang, C., Ma, Y., Sun, W., & Chen, Z. (2023). [A Survey of Learning-based Automated Program Repair.](http://arxiv.org/abs/2301.03270) -* Alzubaidi, L., Bai, J., Al-Sabaawi, A., Santamaría, J. I., Albahri, A. S., Al-dabbagh, B. S. N., … Gu, Y. (2023). [A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications.](https://www.semanticscholar.org/paper/A-survey-on-deep-learning-tools-dealing-with-data-Alzubaidi-Bai/4a07ded5f56aa76c75e844f353e046414b427cc2) Journal of Big Data, 10, 1–82. -* Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F. (2024). [A survey on machine learning techniques applied to source code.](https://discovery.ucl.ac.uk/id/eprint/10184342/) Journal of Systems and Software, 209, 111934. -* Elder, S., Rahman, M. R., Fringer, G., Kapoor, K., & Williams, L. (2024). [A Survey on Software Vulnerability Exploitability Assessment.](https://dl.acm.org/doi/10.1145/3648610) ACM Comput. Surv., 56(8). -* Aladics, T., Hegedűs, P., & Ferenc, R. (2023). [An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction.](https://arxiv.org/abs/2303.16591) -* Singhal, A., & Goel, P. K. (2023). [Analysis and Identification of Malicious Mobile Applications.](https://www.researchgate.net/publication/378257226_Analysis_and_Identification_of_Malicious_Mobile_Applications) 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE), 1045–1050. -* Senanayake, J., Kalutarage, H., & Al-Kadri, M. O. (2021). [Android Mobile Malware Detection Using Machine Learning: A Systematic Review.](https://www.mdpi.com/2079-9292/10/13/1606) Electronics, 10(13). -* Bui, Q.-C., Paramitha, R., Vu, D.-L., Massacci, F., & Scandariato, R. (12 2023). [APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities.](https://link.springer.com/article/10.1007/s10664-023-10415-7) Empirical Software Engineering, 29. -* Senanayake, J., Kalutarage, H., Al-Kadri, M. O., Petrovski, A., & Piras, L. (2023). [Android Source Code Vulnerability Detection: A Systematic Literature Review.](https://dl.acm.org/doi/10.1145/3556974) ACM Comput. Surv., 55(9). -* Reis, S., Abreu, R., & Pasareanu, C. (2023). [Are security commit messages informative? Not enough!](https://dl.acm.org/doi/10.1145/3593434.3593481) Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, 196–199. Presented at the Oulu, Finland. -* [B EYOND SYNTAX TREES : LEARNING EMBEDDINGS OF CODE EDITS BY COMBINING MULTIPLE SOURCE REP - RESENTATIONS.](https://api.semanticscholar.org/CorpusID:249038879) (2022). -* Challande, A., David, R., & Renault, G. (2022). [Building a Commit-level Dataset of Real-world Vulnerabilities.](https://dl.acm.org/doi/10.1145/3508398.3511495) Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy, 101–106. Presented at the Baltimore, MD, USA. -* Wang, Song, & Nagappan, N. (2019). [Characterizing and Understanding Software Developer Networks in Security Development.](http://arxiv.org/abs/1907.12141) -* Harzevili, N. S., Shin, J., Wang, J., & Wang, S. (2022). [Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries.](http://arxiv.org/abs/2203.06502) -* Zhang, L., Liu, C., Xu, Z., Chen, S., Fan, L., Zhao, L., … Liu, Y. (2023). [Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects.](http://arxiv.org/abs/2301.08434) -* Lee, J. Y. D., & Chieu, H. L. (2021, November). [Co-training for Commit Classification.](https://aclanthology.org/2021.wnut-1.43/) -* In W. Xu, A. Ritter, T. Baldwin, & A. Rahimi (Eds.), [Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)](https://aclanthology.org/volumes/2021.wnut-1/) -* Nikitopoulos, G., Dritsa, K., Louridas, P., & Mitropoulos, D. (2021).[CrossVul: a cross-language vulnerability dataset with commit data.](https://dl.acm.org/doi/10.1145/3468264.3473122) Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1565–1569. Presented at the Athens, Greece. -* Bhandari, G., Naseer, A., & Moonen, L. (2021, August). [CVEfixes: automated collection of vulnerabilities and their fixes from open-source software.](https://arxiv.org/abs/2107.08760) Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. -* Sonnekalb, T., Heinze, T. S., & Mäder, P. (2022). [Deep security analysis of program code: A systematic literature review.](https://link.springer.com/article/10.1007/s10664-021-10029-x) Empirical Softw. Engg., 27(1). -* Le, T. H. M., Hin, D., Croft, R., & Babar, M. A. (2021). [DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning.](http://arxiv.org/abs/2108.08041) -* Senanayake, J., Kalutarage, H., Petrovski, A., Piras, L., & Al-Kadri, M. O. (2024). [Defendroid: Real-time Android code vulnerability detection via blockchain federated neural network with XAI.](https://www.sciencedirect.com/science/article/pii/S2214212624000449) Journal of Information Security and Applications, 82, 103741. -* Stefanoni, A., Girdzijauskas, S., Jenkins, C., Kefato, Z. T., Sbattella, L., Scotti, V., & Wåreus, E. (2022). [Detecting Security Patches in Java Projects Using NLP Technology.](https://api.semanticscholar.org/CorpusID:256739262) International Conference on Natural Language and Speech Processing. -* Okutan, A., Mell, P., Mirakhorli, M., Khokhlov, I., Santos, J. C. S., Gonzalez, D., & Simmons, S. (2023). [Empirical Validation of Automated Vulnerability Curation and Characterization.](https://ieeexplore.ieee.org/document/10056768) IEEE Transactions on Software Engineering, 49(5), 3241–3260. -* Wang, J., Cao, L., Luo, X., Zhou, Z., Xie, J., Jatowt, A., & Cai, Y. (2023). [Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation.](http://arxiv.org/abs/2310.16263) -* Bottner, L., Hermann, A., Eppler, J., Thüm, T., & Kargl, F. (2023). [Evaluation of Free and Open Source Tools for Automated Software Composition Analysis.](https://dl.acm.org/doi/abs/10.1145/3631204.3631862) Proceedings of the 7th ACM Computer Science in Cars Symposium. Presented at the Darmstadt, Germany. -* Ganz, T., Härterich, M., Warnecke, A., & Rieck, K. (2021). [Explaining Graph Neural Networks for Vulnerability Discovery.](doi:10.1145/3474369.3486866) Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, 145–156. Presented at the Virtual Event, Republic of Korea. -* Ram, A., Xin, J., Nagappan, M., Yu, Y., Lozoya, R. C., Sabetta, A., & Lin, J. (2019). [Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits.](http://arxiv.org/abs/1911.07620) -* Rahman, M. M., Watanobe, Y., Shirafuji, A., & Hamada, M. (2023). [Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey.](http://arxiv.org/abs/2307.08705) -* Zhang, Y., Song, W., Ji, Z., Danfeng, Yao, & Meng, N. (2023). [How well does LLM generate security tests?](http://arxiv.org/abs/2310.00710) -* Jing, D. (2022). [Improvement of Vulnerable Code Dataset Based on Program Equivalence Transformation.](https://iopscience.iop.org/article/10.1088/1742-6596/2363/1/012010/pdf) Journal of Physics: Conference Series, 2363(1), 012010. -* Wu, Yi, Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan, L., … Shah, S. (2023, July). [How Effective Are Neural Networks for Fixing Security Vulnerabilities.](https://arxiv.org/abs/2305.18607) Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. -* Yang, G., Dineen, S., Lin, Z., & Liu, X. (2021). [Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models.](http://arxiv.org/abs/2108.06590) -* Zhou, J., Pacheco, M., Wan, Z., Xia, X., Lo, D., Wang, Y., & Hassan, A. E. (2021). [Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes.](https://ieeexplore.ieee.org/document/9678720) 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 705–716. -* Dunlap, T., Thorn, S., Enck, W., & Reaves, B. (2023). [Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis.](https://ieeexplore.ieee.org/document/10190493) 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), 489–505. -* Shestov, A., Levichev, R., Mussabayev, R., Maslov, E., Cheshkov, A., & Zadorozhny, P. (2024). [Finetuning Large Language Models for Vulnerability Detection.](http://arxiv.org/abs/2401.17010) -* Scalco, S., & Paramitha, R. (2024). [Hash4Patch: A Lightweight Low False Positive Tool for Finding Vulnerability Patch Commits.](https://dl.acm.org/doi/10.1145/3643991.3644871) Proceedings of the 21st International Conference on Mining Software Repositories, 733–737. Presented at the Lisbon, Portugal. -* Nguyen-Truong, G., Kang, H. J., Lo, D., Sharma, A., Santosa, A. E., Sharma, A., & Ang, M. Y. (2022). [HERMES: Using Commit-Issue Linking to Detect Vulnerability-Fixing Commits.](https://ieeexplore.ieee.org/document/9825835) 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 51–62. -* Wang, J., Luo, X., Cao, L., He, H., Huang, H., Xie, J., … Cai, Y. (2024). [Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval.](http://arxiv.org/abs/2407.02395) -* Tony, C., Mutas, M., Ferreyra, N. E. D., & Scandariato, R. (2023). [LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations.](http://arxiv.org/abs/2303.09384) -* Chen, Z., Kommrusch, S., & Monperrus, M. (2023). [Neural Transfer Learning for Repairing Security Vulnerabilities in C Code.](https://ieeexplore.ieee.org/document/9699412) IEEE Transactions on Software Engineering, 49(1), 147–165. -* Papotti, A., Paramitha, R., & Massacci, F. (2022). [On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.](http://arxiv.org/abs/2209.07211) -* Mir, A. M., Keshani, M., & Proksch, S. (2024). [On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study.](http://arxiv.org/abs/2402.07294) -* Dietrich, J., Rasheed, S., Jordan, A., & White, T. (2023). [On the Security Blind Spots of Software Composition Analysis.](http://arxiv.org/abs/2306.05534) -* Le, T. H. M., & Babar, M. A. (2022). [On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models.](http://arxiv.org/abs/2203.08417) -* Chapman, J., & Venugopalan, H. (2022). [Open Source Software Computed Risk Framework.](https://www.bibsonomy.org/bibtex/1c114d6756c609391db2f66919f237261) 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 172–175. -* Canfora, G., Di Sorbo, A., Forootani, S., Martinez, M., & Visaggio, C. A. (2022). [Patchworking: Exploring the code changes induced by vulnerability fixing activities.](https://www.sciencedirect.com/science/article/abs/pii/S0950584921001932) Information and Software Technology, 142, 106745. -* Garg, S., Moghaddam, R. Z., Sundaresan, N., & Wu, C. (2021). [PerfLens: a data-driven performance bug detection and fix platform.](https://dl.acm.org/doi/10.1145/3460946.3464318) Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, 19–24. Presented at the Virtual, Canada. -* Coskun, T., Halepmollasi, R., Hanifi, K., Fouladi, R. F., De Cnudde, P. C., & Tosun, A. (2022). [Profiling developers to predict vulnerable code changes.](https://dl.acm.org/doi/10.1145/3558489.3559069) Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering, 32–41. Presented at the Singapore, Singapore. -* Bhuiyan, M. H. M., Parthasarathy, A. S., Vasilakis, N., Pradel, M., & Staicu, C.-A. (2023). [SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript.](https://ieeexplore.ieee.org/document/10172577) 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 1059–1070. -* Reis, S., Abreu, R., Erdogmus, H., & Păsăreanu, C. (2022). [SECOM: towards a convention for security commit messages.](https://dl.acm.org/doi/abs/10.1145/3524842.3528513) Proceedings of the 19th International Conference on Mining Software Repositories, 764–765. Presented at the Pittsburgh, Pennsylvania. -* Bennett, G., Hall, T., Winter, E., & Counsell, S. (2024). [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools.](https://dl.acm.org/doi/10.1145/3661167.3661262) Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 614–623, Salerno, Italy. -* Chi, J., Qu, Y., Liu, T., Zheng, Q., & Yin, H. (2022). [SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning.](http://arxiv.org/abs/2010.10805) -* Ahmed, A., Said, A., Shabbir, M., & Koutsoukos, X. (2023). [Sequential Graph Neural Networks for Source Code Vulnerability Identification.](http://arxiv.org/abs/2306.05375) -* Sun, J., Xing, Z., Lu, Q., Xu, X., Zhu, L., Hoang, T., & Zhao, D. (2023). [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation.](http://arxiv.org/abs/2302.07445) -* Zhao, L., Chen, S., Xu, Z., Liu, C., Zhang, L., Wu, J., … Liu, Y. (2023). [Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects.](https://dl.acm.org/doi/10.1145/3611643.3616299) Proceedings of the 31st ACM Joint European Software Engineering Conference and * Symposium on the Foundations of Software Engineering, 960–972. Presented at the San Francisco, CA, USA. -* ZHAN, Q., PAN S-Y., HU X., BAO L-F., XIA, X. (2024). [Survey on Vulnerability Awareness of Open Source Software.](https://www.jos.org.cn/josen/article/abstract/6935) Journal of Software, 35(1), 19. -* Li, X., Moreschini, S., Zhang, Z., Palomba, F., & Taibi, D. (2023). [The anatomy of a vulnerability database: A systematic mapping study.](https://www.sciencedirect.com/science/article/pii/S0164121223000742) Journal of Systems and Software, 201, 111679. -* Al Debeyan, F., Madeyski, L., Hall, T., & Bowes, D. (2024). [The impact of hard and easy negative training data on vulnerability prediction performance.](https://www.sciencedirect.com/science/article/pii/S0164121224000463) Journal of Systems and Software, 211, 112003. -* Xu, C., Chen, B., Lu, C., Huang, K., Peng, X., & Liu, Y. (2023). [Tracking Patches for Open Source Software Vulnerabilities.](http://arxiv.org/abs/2112.02240) -* Risse, N., & Böhme, M. (2024). [Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection.](http://arxiv.org/abs/2306.17193) -* Nie, X., Li, N., Wang, K., Wang, S., Luo, X., & Wang, H. (2023). [Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper).](https://dl.acm.org/doi/10.1145/3597926.3598037) Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, 52–63. Presented at the Seattle, WA, USA. -* Wu, Yulun, Yu, Z., Wen, M., Li, Q., Zou, D., & Jin, H. (2023). [Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem.](https://dl.acm.org/doi/10.1109/ICSE48619.2023.00095) 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 1046–1058. -* Esposito, M., & Falessi, D. (2024). [VALIDATE: A deep dive into vulnerability prediction datasets.](https://dl.acm.org/doi/abs/10.1016/j.infsof.2024.107448) Information and Software Technology, 170, 107448. -* Wang, Shichao, Zhang, Y., Bao, L., Xia, X., & Wu, M. (2022). [VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities.](https://ieeexplore.ieee.org/document/9825908) 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 589–600. -* Sun, Q., Xu, L., Xiao, Y., Li, F., Su, H., Liu, Y., … Huo, W. (2022). [VERJava: Vulnerable Version Identification for Java OSS with a Two-Stage Analysis.](https://ieeexplore.ieee.org/document/9978189) 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), 329–339. -* Nguyen, S., Vu, T. T., & Vo, H. D. (2023). [VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification.](http://arxiv.org/abs/2309.01971) -* Piran, A., Chang, C.-P., & Fard, A. M. (2021). [Vulnerability Analysis of Similar Code.](https://ieeexplore.ieee.org/document/9724745) 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), 664–671. -* Keller, P., Plein, L., Bissyandé, T. F., Klein, J., & Traon, Y. L. (2020). [What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning.](http://arxiv.org/abs/2002.02650) -* Akhoundali, J., Nouri, S. R., Rietveld, K., & Gadyatskaya, O. (2024). [MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery.](https://dl.acm.org/doi/10.1145/3663533.3664036) Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering, 42–51. Presented at the Porto de Galinhas, Brazil. + + * Tushar Sharma, Maria Kechagia, Stefanos Georgiou, Rohit Tiwari, Indira Vats, Hadi Moazen, and Federica Sarro. [A survey on machine learning techniques applied to source code](https://www.sciencedirect.com/science/article/pii/S0164121223003291). Journal of Systems and Software. (2024). + + * Elder, Sarah, Rahman, Md Rayhanur, Fringer, Gage, Kapoor, Kunal, and Williams, Laurie. [A Survey on Software Vulnerability Exploitability Assessment](https://doi.org/10.1145/3648610). ACM Comput. Surv.. (2024). + + * Janaka Senanayake, Harsha Kalutarage, Andrei Petrovski, Luca Piras, and Mhd Omar Al-Kadri. [Defendroid: Real-time Android code vulnerability detection via blockchain federated neural network with XAI](https://www.sciencedirect.com/science/article/pii/S2214212624000449). Journal of Information Security and Applications. (2024). + + * Alexey Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov, Anton Cheshkov, and Pavel Zadorozhny. [Finetuning Large Language Models for Vulnerability Detection](https://arxiv.org/abs/2401.17010). arXiv. (2024). + + * Scalco, Simone, and Paramitha, Ranindya. [Hash4Patch: A Lightweight Low False Positive Tool for Finding Vulnerability Patch Commits](https://doi.org/10.1145/3643991.3644871). Proceedings of the 21st International Conference on Mining Software Repositories. (2024). + + * Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, and Yi Cai. [Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval](https://arxiv.org/abs/2407.02395). arXiv. (2024). + + * Amir M. Mir, Mehdi Keshani, and Sebastian Proksch. [On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study](https://arxiv.org/abs/2402.07294). arXiv. (2024). + + * Bennett, Gareth, Hall, Tracy, Winter, Emily, and Counsell, Steve. [Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools](https://doi.org/10.1145/3661167.3661262). Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. (2024). + + * ZHAN Qi, PAN Sheng-Yi, HU Xing, BAO Ling-Feng, XIA Xin. Survey on Vulnerability Awareness of Open Source Software. Journal of Software. (2024). + + * Fahad {Al Debeyan}, Lech Madeyski, Tracy Hall, and David Bowes. [The impact of hard and easy negative training data on vulnerability prediction performance](https://www.sciencedirect.com/science/article/pii/S0164121224000463). Journal of Systems and Software. (2024). + + * Niklas Risse, and Marcel Böhme. [Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection](https://arxiv.org/abs/2306.17193). arXiv. (2024). + + * Matteo Esposito, and Davide Falessi. [VALIDATE: A deep dive into vulnerability prediction datasets](https://www.sciencedirect.com/science/article/pii/S0950584924000533). Information and Software Technology. (2024). + + * Akhoundali, Jafar, Nouri, Sajad Rahim, Rietveld, Kristian, and Gadyatskaya, Olga. [MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery](https://doi.org/10.1145/3663533.3664036). Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering. (2024). + + * Shiyu Sun, Shu Wang, Xinda Wang, Yunlong Xing, Elisa Zhang, and Kun Sun. [Exploring Security Commits in Python](https://arxiv.org/abs/2307.11853). arXiv. (2023). + + * Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E. Hassan, Xuan-Bach D. Le, and David Lo. [Multi-Granularity Detector for Vulnerability Fixes](https://arxiv.org/abs/2305.13884). arXiv. (2023). + + * Trevor Dunlap, Elizabeth Lin, William Enck, and Bradley Reaves. [VFCFinder: Seamlessly Pairing Security Advisories and Patches](https://arxiv.org/abs/2311.01532). arXiv. (2023). + + * Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. [A Survey of Learning-based Automated Program Repair](https://arxiv.org/abs/2301.03270). arXiv. (2023). + + * Laith Alzubaidi, Jinshuai Bai, Aiman Al-Sabaawi, Jos{\'e} I. Santamar{\'i}a, Ahmed Shihab Albahri, Bashar Sami Nayyef Al-dabbagh, Mohammed Abdulraheem Fadhel, Mohamed Manoufali, Jinglan Zhang, Ali H. Al-timemy, Ye Duan, Amjed Abdullah, Laith Farhan, Yi Lu, Ashish Gupta, Felix Albu, Amin Abbosh, and Yuantong Gu. [A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications](https://api.semanticscholar.org/CorpusID:258137181). Journal of Big Data. (2023). + + * Tamás Aladics, Péter Hegedűs, and Rudolf Ferenc. [An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction](https://arxiv.org/abs/2303.16591). arXiv. (2023). + + * Singhal, Amit, and Goel, Pawan Kumar. Analysis and Identification of Malicious Mobile Applications. 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE). (2023). + + * Bui, Quang-Cuong, Paramitha, Ranindya, Vu, Duc-Ly, Massacci, Fabio, and Scandariato, Riccardo. APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities. Empirical Software Engineering. (2023). + + * Senanayake, Janaka, Kalutarage, Harsha, Al-Kadri, Mhd Omar, Petrovski, Andrei, and Piras, Luca. [Android Source Code Vulnerability Detection: A Systematic Literature Review](https://doi.org/10.1145/3556974). ACM Comput. Surv.. (2023). + + * Reis, Sofia, Abreu, Rui, and Pasareanu, Corina. [Are security commit messages informative? Not enough!](https://doi.org/10.1145/3593434.3593481). Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. (2023). + + * Lyuye Zhang, Chengwei Liu, Zhengzi Xu, Sen Chen, Lingling Fan, Lida Zhao, Jiahui Wu, and Yang Liu. [Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects](https://arxiv.org/abs/2301.08434). arXiv. (2023). + + * Okutan, Ahmet, Mell, Peter, Mirakhorli, Mehdi, Khokhlov, Igor, Santos, Joanna C. S., Gonzalez, Danielle, and Simmons, Steven. Empirical Validation of Automated Vulnerability Curation and Characterization. IEEE Transactions on Software Engineering. (2023). + + * Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping Zhou, Jiayuan Xie, Adam Jatowt, and Yi Cai. [Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation](https://arxiv.org/abs/2310.16263). arXiv. (2023). + + * Bottner, Laura, Hermann, Artur, Eppler, Jeremias, Th\"{u}m, Thomas, and Kargl, Frank. [Evaluation of Free and Open Source Tools for Automated Software Composition Analysis](https://doi.org/10.1145/3631204.3631862). Proceedings of the 7th ACM Computer Science in Cars Symposium. (2023). + + * Md. Mostafizer Rahman, Yutaka Watanobe, Atsushi Shirafuji, and Mohamed Hamada. [Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey](https://arxiv.org/abs/2307.08705). arXiv. (2023). + + * Ying Zhang, Wenjia Song, Zhengjie Ji, Danfeng, Yao, and Na Meng. [How well does LLM generate security tests?](https://arxiv.org/abs/2310.00710). arXiv. (2023). + + * Wu, Yi, Jiang, Nan, Pham, Hung Viet, Lutellier, Thibaud, Davis, Jordan, Tan, Lin, Babkin, Petr, and Shah, Sameena. [How Effective Are Neural Networks for Fixing Security Vulnerabilities](http://dx.doi.org/10.1145/3597926.3598135). Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. (2023). + + * Dunlap, Trevor, Thorn, Seaver, Enck, William, and Reaves, Bradley. Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis. 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P). (2023). + + * Catherine Tony, Markus Mutas, Nicolás E. Díaz Ferreyra, and Riccardo Scandariato. [LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations](https://arxiv.org/abs/2303.09384). arXiv. (2023). + + * Chen, Zimin, Kommrusch, Steve, and Monperrus, Martin. [Neural Transfer Learning for Repairing Security Vulnerabilities in C Code](http://dx.doi.org/10.1109/TSE.2022.3147265). IEEE Transactions on Software Engineering. (2023). + + * Jens Dietrich, Shawn Rasheed, Alexander Jordan, and Tim White. [On the Security Blind Spots of Software Composition Analysis](https://arxiv.org/abs/2306.05534). arXiv. (2023). + + * Bhuiyan, Masudul Hasan Masud, Parthasarathy, Adithya Srinivas, Vasilakis, Nikos, Pradel, Michael, and Staicu, Cristian-Alexandru. SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). (2023). + + * Ammar Ahmed, Anwar Said, Mudassir Shabbir, and Xenofon Koutsoukos. [Sequential Graph Neural Networks for Source Code Vulnerability Identification](https://arxiv.org/abs/2306.05375). arXiv. (2023). + + * Jiamou Sun, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Liming Zhu, Thong Hoang, and Dehai Zhao. [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation](https://arxiv.org/abs/2302.07445). arXiv. (2023). + + * Zhao, Lida, Chen, Sen, Xu, Zhengzi, Liu, Chengwei, Zhang, Lyuye, Wu, Jiahui, Sun, Jun, and Liu, Yang. [Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects](https://doi.org/10.1145/3611643.3616299). Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. (2023). + + * Xiaozhou Li, Sergio Moreschini, Zheying Zhang, Fabio Palomba, and Davide Taibi. [The anatomy of a vulnerability database: A systematic mapping study](https://www.sciencedirect.com/science/article/pii/S0164121223000742). Journal of Systems and Software. (2023). + + * Congying Xu, Bihuan Chen, Chenhao Lu, Kaifeng Huang, Xin Peng, and Yang Liu. [Tracking Patches for Open Source Software Vulnerabilities](https://arxiv.org/abs/2112.02240). arXiv. (2023). + + * Nie, Xu, Li, Ningke, Wang, Kailong, Wang, Shangguang, Luo, Xiapu, and Wang, Haoyu. [Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)](https://doi.org/10.1145/3597926.3598037). Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. (2023). + + * Wu, Yulun, Yu, Zeliang, Wen, Ming, Li, Qiang, Zou, Deqing, and Jin, Hai. Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). (2023). + + * Son Nguyen, Thanh Trong Vu, and Hieu Dinh Vo. [VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification](https://arxiv.org/abs/2309.01971). arXiv. (2023). + + * Tam{\'a}s Aladics, P{\'e}ter Heged{\"u}s, and Rudolf Ferenc. [A Vulnerability Introducing Commit Dataset for Java: An Improved SZZ based Approach](https://api.semanticscholar.org/CorpusID:250566828). International Conference on Software and Data Technologies. (2022). + + * Bui, Quang-Cuong, Scandariato, Riccardo, and Ferreyra, Nicol\'{a}s E. D\'{\i}az. [Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques](https://doi.org/10.1145/3524842.3528482). Proceedings of the 19th International Conference on Mining Software Repositories. (2022). + + * Tushar Sharma, Maria Kechagia, Stefanos Georgiou, Rohit Tiwari, Indira Vats, Hadi Moazen, and Federica Sarro. [A Survey on Machine Learning Techniques for Source Code Analysis](https://arxiv.org/abs/2110.09610). arXiv. (2022). + + * Siddiq, Mohammed Latif, and Santos, Joanna C. S.. [SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques](https://doi.org/10.1145/3549035.3561184). Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. (2022). + + * Bao, Lingfeng, Xia, Xin, Hassan, Ahmed E., and Yang, Xiaohu. [V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities](https://doi.org/10.1145/3510003.3510113). Proceedings of the 44th International Conference on Software Engineering. (2022). + + * Challande, Alexis, David, Robin, and Renault, Gu\'{e}na\"{e}l. [Building a Commit-level Dataset of Real-world Vulnerabilities](https://doi.org/10.1145/3508398.3511495). Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy. (2022). + + * Nima Shiri Harzevili, Jiho Shin, Junjie Wang, and Song Wang. [Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries](https://arxiv.org/abs/2203.06502). arXiv. (2022). + + * Sonnekalb, Tim, Heinze, Thomas S., and M\"{a}der, Patrick. [Deep security analysis of program code: A systematic literature review](https://doi.org/10.1007/s10664-021-10029-x). Empirical Softw. Engg.. (2022). + + * Andrea Stefanoni, Sarunas Girdzijauskas, Christina Jenkins, Zekarias T. Kefato, Licia Sbattella, Vincenzo Scotti, and Emil W{\aa}reus. [Detecting Security Patches in Java Projects Using NLP Technology](https://api.semanticscholar.org/CorpusID:256739262). International Conference on Natural Language and Speech Processing. (2022). + + * Dejiang Jing. [Improvement of Vulnerable Code Dataset Based on Program Equivalence Transformation](https://dx.doi.org/10.1088/1742-6596/2363/1/012010). Journal of Physics: Conference Series. (2022). + + * Nguyen-Truong, Giang, Kang, Hong Jin, Lo, David, Sharma, Abhishek, Santosa, Andrew E., Sharma, Asankhaya, and Ang, Ming Yi. HERMES: Using Commit-Issue Linking to Detect Vulnerability-Fixing Commits. 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). (2022). + + * Aurora Papotti, Ranindya Paramitha, and Fabio Massacci. [On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools](https://arxiv.org/abs/2209.07211). arXiv. (2022). + + * Triet H. M. Le, and M. Ali Babar. [On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models](https://arxiv.org/abs/2203.08417). arXiv. (2022). + + * Chapman, Jon, and Venugopalan, Hari. Open Source Software Computed Risk Framework. 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT). (2022). + + * Gerardo Canfora, Andrea {Di Sorbo}, Sara Forootani, Matias Martinez, and Corrado A. Visaggio. [Patchworking: Exploring the code changes induced by vulnerability fixing activities](https://www.sciencedirect.com/science/article/pii/S0950584921001932). Information and Software Technology. (2022). + + * Coskun, Tugce, Halepmollasi, Rusen, Hanifi, Khadija, Fouladi, Ramin Fadaei, De Cnudde, Pinar Comak, and Tosun, Ayse. [Profiling developers to predict vulnerable code changes](https://doi.org/10.1145/3558489.3559069). Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering. (2022). + + * Reis, Sofia, Abreu, Rui, Erdogmus, Hakan, and P\u{a}s\u{a}reanu, Corina. [SECOM: towards a convention for security commit messages](https://doi.org/10.1145/3524842.3528513). Proceedings of the 19th International Conference on Mining Software Repositories. (2022). + + * Jianlei Chi, Yu Qu, Ting Liu, Qinghua Zheng, and Heng Yin. [SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning](https://arxiv.org/abs/2010.10805). arXiv. (2022). + + * Wang, Shichao, Zhang, Yun, Bao, Liagfeng, Xia, Xin, and Wu, Minghui. VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities. 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). (2022). + + * Sun, Qing, Xu, Lili, Xiao, Yang, Li, Feng, Su, He, Liu, Yiming, Huang, Hongyun, and Huo, Wei. VERJava: Vulnerable Version Identification for Java OSS with a Two-Stage Analysis. 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). (2022). + + * A. Dann, H. Plate, B. Hermann, S. Ponta, and E. Bodden. Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite. IEEE Transactions on Software Engineering. (2022). + + * Arthur D. Sawadogo, Quentin Guimard, Tegawendé F. Bissyandé, Abdoul Kader Kaboré, Jacques Klein, and Naouel Moha. [Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We?](https://arxiv.org/abs/2112.10123). arXiv. (2021). + + * Sofia Reis, Rui Abreu, and Luis Cruz. [Fixing Vulnerabilities Potentially Hinders Maintainability](https://arxiv.org/abs/2106.03271). arXiv. (2021). + + * Rodrigo Andrade, and Vinícius Santos. [ Investigating vulnerability datasets](https://sol.sbc.org.br/index.php/vem/article/view/17213). Anais do IX Workshop de Visualização, Evolução e Manutenção de Software. (2021). + + * Lee, Jian Yi David and Chieu, Hai Leong. [Co-training for Commit Classification](https://aclanthology.org/2021.wnut-1.43). Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). (2021). + + * Nikitopoulos, Georgios, Dritsa, Konstantina, Louridas, Panos, and Mitropoulos, Dimitris. [CrossVul: a cross-language vulnerability dataset with commit data](https://doi.org/10.1145/3468264.3473122). Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. (2021). + + * Bhandari, Guru, Naseer, Amara, and Moonen, Leon. [CVEfixes: automated collection of vulnerabilities and their fixes from open-source software](http://dx.doi.org/10.1145/3475960.3475985). Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. (2021). + + * Triet H. M. Le, David Hin, Roland Croft, and M. Ali Babar. [DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning](https://arxiv.org/abs/2108.08041). arXiv. (2021). + + * Ganz, Tom, H\"{a}rterich, Martin, Warnecke, Alexander, and Rieck, Konrad. [Explaining Graph Neural Networks for Vulnerability Discovery](https://doi.org/10.1145/3474369.3486866). Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security. (2021). + + * Guanqun Yang, Shay Dineen, Zhipeng Lin, and Xueqing Liu. [Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models](https://arxiv.org/abs/2108.06590). arXiv. (2021). + + * Zhou, Jiayuan, Pacheco, Michael, Wan, Zhiyuan, Xia, Xin, Lo, David, Wang, Yuan, and Hassan, Ahmed E.. Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). (2021). + + * Garg, Spandan, Moghaddam, Roshanak Zilouchian, Sundaresan, Neel, and Wu, Chen. [PerfLens: a data-driven performance bug detection and fix platform](https://doi.org/10.1145/3460946.3464318). Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis. (2021). + + * Piran, Azin, Chang, Che-Pin, and Fard, Amin Milani. Vulnerability Analysis of Similar Code. 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). (2021). + + * Marchand-Melsom, Alexander, and Nguyen Mai, Duong Bao. [Automatic repair of OWASP Top 10 security vulnerabilities: A survey](https://doi.org/10.1145/3387940.3392200). Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. (2020). + + * Arthur D. Sawadogo, Tegawendé F. Bissyandé, Naouel Moha, Kevin Allix, Jacques Klein, Li Li, and Yves Le Traon. [Learning to Catch Security Patches](https://arxiv.org/abs/2001.09148). arXiv. (2020). + + * Fan, Jiahao, Li, Yi, Wang, Shaohua, and Nguyen, Tien N.. [A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries](https://doi.org/10.1145/3379597.3387501). Proceedings of the 17th International Conference on Mining Software Repositories. (2020). + + * Patrick Keller, Laura Plein, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. [What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning](https://arxiv.org/abs/2002.02650). arXiv. (2020). + + * Song Wang, and Nachi Nagappan. [Characterizing and Understanding Software Developer Networks in Security Development](https://arxiv.org/abs/1907.12141). arXiv. (2019). + + + ## Star History @@ -211,7 +289,7 @@ ___ ## Credits -### EU-funded research projects +### EU-funded research projects The development of Project KB is partially supported by the following projects: @@ -219,21 +297,21 @@ The development of Project KB is partially supported by the following projects: * [AssureMOSS](https://assuremoss.eu) (Grant No. 952647). * [Sparta](https://www.sparta.eu/) (Grant No. 830892). -### Vulnerability data sources +### Vulnerability data sources Vulnerability information from NVD and MITRE might have been used as input for building parts of this knowledge base. See MITRE's [CVE Usage license](http://cve.mitre.org/about/termsofuse.html) for more information. ## Limitations and Known Issues -This project is **work-in-progress**, you can find the list of known issues [here](https://github.com/SAP/project-kb/issues). +This project is **work-in-progress**, you can find the list of known issues [here](https://github.com/SAP/project-kb/issues). Currently the vulnerability knowledge base only contains information about vulnerabilities in Java and Python open source components. ## Support For the time being, please use [GitHub -issues](https://github.com/SAP/project-kb/issues) to report bugs, request new features and ask for support. +issues](https://github.com/SAP/project-kb/issues) to report bugs, request new features and ask for support. ## Contributing From f1efef34bf339140cc603a496674bcb0e7111343 Mon Sep 17 00:00:00 2001 From: "Antonino Sabetta (i064196)" Date: Thu, 25 Jul 2024 16:06:54 +0200 Subject: [PATCH 74/83] drop incomplete entry --- all_references.bib | 21 +++++---------------- 1 file changed, 5 insertions(+), 16 deletions(-) diff --git a/all_references.bib b/all_references.bib index cb6d35b60..1d446e637 100644 --- a/all_references.bib +++ b/all_references.bib @@ -349,12 +349,6 @@ @inproceedings{10.1145/3593434.3593481 series = {EASE '23} } -@inproceedings{2022BES, - title={B EYOND SYNTAX TREES : LEARNING EMBEDDINGS OF CODE EDITS BY COMBINING MULTIPLE SOURCE REP - RESENTATIONS}, - author={}, - year={2022}, - url={https://api.semanticscholar.org/CorpusID:249038879} -} @inproceedings{10.1145/3508398.3511495, author = {Challande, Alexis and David, Robin and Renault, Gu\'{e}na\"{e}l}, @@ -499,7 +493,7 @@ @inproceedings{Stefanoni2022DetectingSP author={Andrea Stefanoni and Sarunas Girdzijauskas and Christina Jenkins and Zekarias T. Kefato and Licia Sbattella and Vincenzo Scotti and Emil W{\aa}reus}, booktitle={International Conference on Natural Language and Speech Processing}, year={2022}, - url={https://api.semanticscholar.org/CorpusID:256739262} + url={https://aclanthology.org/2022.icnlsp-1.6.pdf} } @ARTICLE{10056768, @@ -633,7 +627,8 @@ @INPROCEEDINGS{9678720 number={}, pages={705-716}, keywords={Measurement;Codes;Semantics;Transformers;Needles;Security;Probes;Software Security;Vulnerability Fix;Open Source Software;Deep Learning}, - doi={10.1109/ASE51524.2021.9678720}} + doi={10.1109/ASE51524.2021.9678720}, + url = {https://ieeexplore.ieee.org/document/9678720}} @INPROCEEDINGS{10190493, author={Dunlap, Trevor and Thorn, Seaver and Enck, William and Reaves, Bradley}, @@ -1169,7 +1164,7 @@ @article{Ponta2020DetectionAA year={2020}, volume={25}, pages={3175 - 3215}, - url={https://api.semanticscholar.org/CorpusID:220259876} + url={https://link.springer.com/article/10.1007/s10664-020-09830-x} } @ARTICLE {9506931, @@ -1184,6 +1179,7 @@ @ARTICLE {9506931 abstract = {The use of vulnerable open-source dependencies is a known problem in today's software development. Several vulnerability scanners to detect known-vulnerable dependencies appeared in the last decade, however, there exists no case study investigating the impact of development practices, e.g., forking, patching, re-bundling, on their performance. This paper studies (i) types of modifications that may affect vulnerable open-source dependencies and (ii) their impact on the performance of vulnerability scanners. Through an empirical study on 7,024 Java projects developed at SAP, we identified four types of modifications: re-compilation, re-bundling, metadata-removal and re-packaging. In particular, we found that more than 87 percent (56 percent, resp.) of the vulnerable Java classes considered occur in Maven Central in re-bundled (re-packaged, resp.) form. We assessed the impact of these modifications on the performance of the open-source vulnerability scanners OWASP Dependency-Check (OWASP) and Eclipse Steady, GitHub Security Alerts, and three commercial scanners. The results show that none of the scanners is able to handle all the types of modifications identified. Finally, we present Achilles, a novel test suite with 2,505 test cases that allow replicating the modifications on open-source dependencies.}, keywords = {open source software;databases;java;benchmark testing;tools;security;software}, doi = {10.1109/TSE.2021.3101739}, +url = {https://ieeexplore.ieee.org/document/9506931}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month = {sep} @@ -1209,10 +1205,3 @@ @INPROCEEDINGS{9462983 pages={396-400}, keywords={Java;Tools;Libraries;Security;Reachability analysis;Open source software;Genetic algorithms;Exploit Generation;Security Testing;Software Vulnerabilities}, doi={10.1109/ICPC52881.2021.00046}} - - - - - - - From 6192ba6cc84b4c15a6cac0392e8b23e08217f00d Mon Sep 17 00:00:00 2001 From: "Antonino Sabetta (i064196)" Date: Thu, 25 Jul 2024 16:07:16 +0200 Subject: [PATCH 75/83] minor improvement --- scripts/bib2md.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/bib2md.py b/scripts/bib2md.py index 5c345c3e3..990ecb16d 100644 --- a/scripts/bib2md.py +++ b/scripts/bib2md.py @@ -73,7 +73,7 @@ def format_simple(entry_str, order="desc"): "Warning: Some entries were not processed due to unknown type", file=sys.stderr, ) - print("List of unprocessed entrie(s):", unprocessed_entries) + print("List of unprocessed entrie(s):", [e for e in unprocessed_entries]) return [entry[1] for entry in formatted_entries] From e4d050a5e92de629472ce8ac2b5773afa8e6eac2 Mon Sep 17 00:00:00 2001 From: "Antonino Sabetta (i064196)" Date: Fri, 26 Jul 2024 08:29:14 +0200 Subject: [PATCH 76/83] moved bib files to separate folder --- all_references.bib => references/others.bib | 92 -------------- references/ours.bib | 125 ++++++++++++++++++++ 2 files changed, 125 insertions(+), 92 deletions(-) rename all_references.bib => references/others.bib (92%) create mode 100644 references/ours.bib diff --git a/all_references.bib b/references/others.bib similarity index 92% rename from all_references.bib rename to references/others.bib index 1d446e637..c43c6ba49 100644 --- a/all_references.bib +++ b/references/others.bib @@ -34,25 +34,6 @@ @misc{sharma2022surveymachinelearningtechniques url={https://arxiv.org/abs/2110.09610}, } -@article{10.1145/3649590, -author = {Hommersom, Daan and Sabetta, Antonino and Coppola, Bonaventura and Nucci, Dario Di and Tamburri, Damian A.}, -title = {Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories}, -year = {2024}, -issue_date = {June 2024}, -publisher = {Association for Computing Machinery}, -address = {New York, NY, USA}, -volume = {33}, -number = {5}, -issn = {1049-331X}, -url = {https://doi.org/10.1145/3649590}, -doi = {10.1145/3649590}, -abstract = {The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this article, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such as those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions.We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03\% of the vulnerabilities (with a fix commit on the first position for 65.06\% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search open-source software (OSS) repositories for the commits that fix known vulnerabilities.}, -journal = {ACM Trans. Softw. Eng. Methodol.}, -month = {jun}, -articleno = {134}, -numpages = {28}, -keywords = {Open source software, software security, common vulnerabilities and exposures (CVE), National Vulnerability Database (NVD), mining software repositories, code-level vulnerability data, machine learning applied to software security} -} @inproceedings{10.1145/3387940.3392200, author = {Marchand-Melsom, Alexander and Nguyen Mai, Duong Bao}, @@ -1132,76 +1113,3 @@ @inproceedings{10.1145/3663533.3664036 location = {Porto de Galinhas, Brazil}, series = {PROMISE 2024} } - - -@article{Cabrera_Lozoya_2021, - title={Commit2Vec: Learning Distributed Representations of Code Changes}, - volume={2}, - ISSN={2661-8907}, - url={http://dx.doi.org/10.1007/s42979-021-00566-z}, - DOI={10.1007/s42979-021-00566-z}, - number={3}, - journal={SN Computer Science}, - publisher={Springer Science and Business Media LLC}, - author={Cabrera Lozoya, Rocío and Baumann, Arnaud and Sabetta, Antonino and Bezzi, Michele}, - year={2021}, - month=mar } - -@misc{fehrer2021detectingsecurityfixesopensource, - title={Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers}, - author={Therese Fehrer and Rocío Cabrera Lozoya and Antonino Sabetta and Dario Di Nucci and Damian A. Tamburri}, - year={2021}, - eprint={2105.03346}, - archivePrefix={arXiv}, - primaryClass={cs.SE}, - url={https://arxiv.org/abs/2105.03346}, -} - -@article{Ponta2020DetectionAA, - title={Detection, assessment and mitigation of vulnerabilities in open source dependencies}, - author={Serena Elisa Ponta and Henrik Plate and Antonino Sabetta}, - journal={Empirical Software Engineering}, - year={2020}, - volume={25}, - pages={3175 - 3215}, - url={https://link.springer.com/article/10.1007/s10664-020-09830-x} -} - -@ARTICLE {9506931, -author = {A. Dann and H. Plate and B. Hermann and S. Ponta and E. Bodden}, -journal = {IEEE Transactions on Software Engineering}, -title = {Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite}, -year = {2022}, -volume = {48}, -number = {09}, -issn = {1939-3520}, -pages = {3613-3625}, -abstract = {The use of vulnerable open-source dependencies is a known problem in today's software development. Several vulnerability scanners to detect known-vulnerable dependencies appeared in the last decade, however, there exists no case study investigating the impact of development practices, e.g., forking, patching, re-bundling, on their performance. This paper studies (i) types of modifications that may affect vulnerable open-source dependencies and (ii) their impact on the performance of vulnerability scanners. Through an empirical study on 7,024 Java projects developed at SAP, we identified four types of modifications: re-compilation, re-bundling, metadata-removal and re-packaging. In particular, we found that more than 87 percent (56 percent, resp.) of the vulnerable Java classes considered occur in Maven Central in re-bundled (re-packaged, resp.) form. We assessed the impact of these modifications on the performance of the open-source vulnerability scanners OWASP Dependency-Check (OWASP) and Eclipse Steady, GitHub Security Alerts, and three commercial scanners. The results show that none of the scanners is able to handle all the types of modifications identified. Finally, we present Achilles, a novel test suite with 2,505 test cases that allow replicating the modifications on open-source dependencies.}, -keywords = {open source software;databases;java;benchmark testing;tools;security;software}, -doi = {10.1109/TSE.2021.3101739}, -url = {https://ieeexplore.ieee.org/document/9506931}, -publisher = {IEEE Computer Society}, -address = {Los Alamitos, CA, USA}, -month = {sep} -} - -@misc{ponta2021usedbloatedvulnerablereducing, - title={The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application}, - author={Serena Elisa Ponta and Wolfram Fischer and Henrik Plate and Antonino Sabetta}, - year={2021}, - eprint={2108.05115}, - archivePrefix={arXiv}, - primaryClass={cs.SE}, - url={https://arxiv.org/abs/2108.05115}, -} - -@INPROCEEDINGS{9462983, - author={Iannone, Emanuele and Nucci, Dario Di and Sabetta, Antonino and De Lucia, Andrea}, - booktitle={2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC)}, - title={Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries}, - year={2021}, - volume={}, - number={}, - pages={396-400}, - keywords={Java;Tools;Libraries;Security;Reachability analysis;Open source software;Genetic algorithms;Exploit Generation;Security Testing;Software Vulnerabilities}, - doi={10.1109/ICPC52881.2021.00046}} diff --git a/references/ours.bib b/references/ours.bib new file mode 100644 index 000000000..bdc61a5b4 --- /dev/null +++ b/references/ours.bib @@ -0,0 +1,125 @@ +@article{10.1145/3649590, +author = {Hommersom, Daan and Sabetta, Antonino and Coppola, Bonaventura and Nucci, Dario Di and Tamburri, Damian A.}, +title = {Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories}, +year = {2024}, +issue_date = {June 2024}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {33}, +number = {5}, +issn = {1049-331X}, +url = {https://doi.org/10.1145/3649590}, +doi = {10.1145/3649590}, +abstract = {The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this article, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such as those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions.We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03\% of the vulnerabilities (with a fix commit on the first position for 65.06\% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search open-source software (OSS) repositories for the commits that fix known vulnerabilities.}, +journal = {ACM Trans. Softw. Eng. Methodol.}, +month = {jun}, +articleno = {134}, +numpages = {28}, +keywords = {Open source software, software security, common vulnerabilities and exposures (CVE), National Vulnerability Database (NVD), mining software repositories, code-level vulnerability data, machine learning applied to software security} +} + + +@misc{ram2019exploitingtokenpathbasedrepresentations, + title={Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits}, + author={Achyudh Ram and Ji Xin and Meiyappan Nagappan and Yaoliang Yu and Rocío Cabrera Lozoya and Antonino Sabetta and Jimmy Lin}, + year={2019}, + eprint={1911.07620}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/1911.07620}, +} + + + + +@article{Cabrera_Lozoya_2021, + title={Commit2Vec: Learning Distributed Representations of Code Changes}, + volume={2}, + ISSN={2661-8907}, + url={http://dx.doi.org/10.1007/s42979-021-00566-z}, + DOI={10.1007/s42979-021-00566-z}, + number={3}, + journal={SN Computer Science}, + publisher={Springer Science and Business Media LLC}, + author={Cabrera Lozoya, Rocío and Baumann, Arnaud and Sabetta, Antonino and Bezzi, Michele}, + year={2021}, + month=mar } + +@misc{fehrer2021detectingsecurityfixesopensource, + title={Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers}, + author={Therese Fehrer and Rocío Cabrera Lozoya and Antonino Sabetta and Dario Di Nucci and Damian A. Tamburri}, + year={2021}, + eprint={2105.03346}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2105.03346}, +} + +@article{Ponta2020DetectionAA, + title={Detection, assessment and mitigation of vulnerabilities in open source dependencies}, + author={Serena Elisa Ponta and Henrik Plate and Antonino Sabetta}, + journal={Empirical Software Engineering}, + year={2020}, + volume={25}, + pages={3175 - 3215}, + url={https://link.springer.com/article/10.1007/s10664-020-09830-x} +} + +@ARTICLE {9506931, +author = {A. Dann and H. Plate and B. Hermann and S. Ponta and E. Bodden}, +journal = {IEEE Transactions on Software Engineering}, +title = {Identifying Challenges for OSS Vulnerability Scanners - A Study & Test Suite}, +year = {2022}, +volume = {48}, +number = {09}, +issn = {1939-3520}, +pages = {3613-3625}, +abstract = {The use of vulnerable open-source dependencies is a known problem in today's software development. Several vulnerability scanners to detect known-vulnerable dependencies appeared in the last decade, however, there exists no case study investigating the impact of development practices, e.g., forking, patching, re-bundling, on their performance. This paper studies (i) types of modifications that may affect vulnerable open-source dependencies and (ii) their impact on the performance of vulnerability scanners. Through an empirical study on 7,024 Java projects developed at SAP, we identified four types of modifications: re-compilation, re-bundling, metadata-removal and re-packaging. In particular, we found that more than 87 percent (56 percent, resp.) of the vulnerable Java classes considered occur in Maven Central in re-bundled (re-packaged, resp.) form. We assessed the impact of these modifications on the performance of the open-source vulnerability scanners OWASP Dependency-Check (OWASP) and Eclipse Steady, GitHub Security Alerts, and three commercial scanners. The results show that none of the scanners is able to handle all the types of modifications identified. Finally, we present Achilles, a novel test suite with 2,505 test cases that allow replicating the modifications on open-source dependencies.}, +keywords = {open source software;databases;java;benchmark testing;tools;security;software}, +doi = {10.1109/TSE.2021.3101739}, +url = {https://ieeexplore.ieee.org/document/9506931}, +publisher = {IEEE Computer Society}, +address = {Los Alamitos, CA, USA}, +month = {sep} +} + +@misc{ponta2021usedbloatedvulnerablereducing, + title={The Used, the Bloated, and the Vulnerable: Reducing the Attack Surface of an Industrial Application}, + author={Serena Elisa Ponta and Wolfram Fischer and Henrik Plate and Antonino Sabetta}, + year={2021}, + eprint={2108.05115}, + archivePrefix={arXiv}, + primaryClass={cs.SE}, + url={https://arxiv.org/abs/2108.05115}, +} + +@INPROCEEDINGS{9462983, + author={Iannone, Emanuele and Nucci, Dario Di and Sabetta, Antonino and De Lucia, Andrea}, + booktitle={2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC)}, + title={Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries}, + year={2021}, + volume={}, + number={}, + pages={396-400}, + keywords={Java;Tools;Libraries;Security;Reachability analysis;Open source software;Genetic algorithms;Exploit Generation;Security Testing;Software Vulnerabilities}, + doi={10.1109/ICPC52881.2021.00046}} + +@article{10.1145/3649590, +author = {Hommersom, Daan and Sabetta, Antonino and Coppola, Bonaventura and Nucci, Dario Di and Tamburri, Damian A.}, +title = {Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories}, +year = {2024}, +issue_date = {June 2024}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {33}, +number = {5}, +issn = {1049-331X}, +url = {https://doi.org/10.1145/3649590}, +doi = {10.1145/3649590}, +abstract = {The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this article, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such as those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions.We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03\% of the vulnerabilities (with a fix commit on the first position for 65.06\% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search open-source software (OSS) repositories for the commits that fix known vulnerabilities.}, +journal = {ACM Trans. Softw. Eng. Methodol.}, +month = {jun}, +articleno = {134}, +numpages = {28}, +keywords = {Open source software, software security, common vulnerabilities and exposures (CVE), National Vulnerability Database (NVD), mining software repositories, code-level vulnerability data, machine learning applied to software security} +} From afc87040c296d6953d2fde275547cd8c0119620b Mon Sep 17 00:00:00 2001 From: Antonino Sabetta Date: Fri, 26 Jul 2024 08:30:49 +0200 Subject: [PATCH 77/83] Update README.md - dropped gitter badge --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 2d2f196fb..8003f63cf 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,6 @@ [![Go](https://github.com/sap/project-kb/workflows/Go/badge.svg)](https://github.com/SAP/project-kb/actions?query=workflow%3AGo) [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/SAP/project-kb/blob/master/LICENSE.txt) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/sap/project-kb/#contributing) -[![Join the chat at https://gitter.im/project-kb/general](https://badges.gitter.im/project-kb/general.svg)](https://gitter.im/project-kb/general?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) ![GitHub All Releases](https://img.shields.io/github/downloads/SAP/PROJECT-KB/total) [![REUSE status](https://api.reuse.software/badge/github.com/sap/project-kb)](https://api.reuse.software/info/github.com/sap/project-kb) [![Pytest](https://github.com/SAP/project-kb/actions/workflows/python.yml/badge.svg)](https://github.com/SAP/project-kb/actions/workflows/python.yml) From 579f881c6a99a6ebe73510be64751fb5decde9b7 Mon Sep 17 00:00:00 2001 From: matteogreek Date: Mon, 24 Jul 2023 16:41:06 +0200 Subject: [PATCH 78/83] Add option to exclude diff in json report --- .pre-commit-config.yaml | 6 +++--- prospector/cli/main.py | 6 +++++- prospector/config-sample.yaml | 2 ++ prospector/core/report.py | 13 +++++++++---- prospector/datamodel/commit.py | 10 ++++++---- prospector/util/config_parser.py | 25 ++++++++++++++++++++----- 6 files changed, 45 insertions(+), 17 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 64dd2891d..9f1f5b5cf 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,7 +1,7 @@ fail_fast: true repos: - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v3.2.0 + rev: v4.3.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer @@ -30,11 +30,11 @@ repos: # - id: go-unit-tests # - id: go-build - repo: https://github.com/psf/black - rev: 19.10b0 + rev: 22.10.0 hooks: - id: black - repo: https://github.com/pycqa/isort - rev: 5.6.4 + rev: 5.12.0 hooks: - id: isort args: ["--profile", "black", "--filter-files"] diff --git a/prospector/cli/main.py b/prospector/cli/main.py index eae7d01ae..0ef9e7217 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -111,7 +111,11 @@ def main(argv): # noqa: C901 return report.generate_report( - results, advisory_record, config.report, config.report_filename + results, + advisory_record, + config.report, + config.report_filename, + config.report_diff, ) execution_time = execution_statistics["core"]["execution time"][0] diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml index 86e4ecad4..eeb14102b 100644 --- a/prospector/config-sample.yaml +++ b/prospector/config-sample.yaml @@ -62,6 +62,8 @@ enabled_rules: report: format: html name: prospector-report + no_diff: False + # Log level: "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL" log_level: INFO diff --git a/prospector/core/report.py b/prospector/core/report.py index 0cd43d871..16897608b 100644 --- a/prospector/core/report.py +++ b/prospector/core/report.py @@ -25,12 +25,15 @@ def json_( results: List[Commit], advisory_record: AdvisoryRecord, filename: str = "prospector-report.json", + no_diff: bool = False, ): fn = filename if filename.endswith(".json") else f"{filename}.json" data = { "advisory_record": advisory_record.__dict__, - "commits": [r.as_dict(no_hash=True, no_rules=False) for r in results], + "commits": [ + r.as_dict(no_hash=True, no_rules=False, no_diff=no_diff) for r in results + ], } logger.info(f"Writing results to {fn}") file = Path(fn) @@ -102,17 +105,19 @@ def format_annotations(commit: Commit) -> str: print(f"Found {count} candidates\nAdvisory record\n{advisory_record}") -def generate_report(results, advisory_record, report_type, report_filename): +def generate_report( + results, advisory_record, report_type, report_filename, report_diff=False +): with ConsoleWriter("Generating report\n") as console: match report_type: case "console": console_(results, advisory_record, get_level() < logging.INFO) case "json": - json_(results, advisory_record, report_filename) + json_(results, advisory_record, report_filename, report_diff) case "html": html_(results, advisory_record, report_filename) case "all": - json_(results, advisory_record, report_filename) + json_(results, advisory_record, report_filename, report_diff) html_(results, advisory_record, report_filename) case _: logger.warning("Invalid report type specified, using 'console'") diff --git a/prospector/datamodel/commit.py b/prospector/datamodel/commit.py index 0f1fd1fe8..d92a558b9 100644 --- a/prospector/datamodel/commit.py +++ b/prospector/datamodel/commit.py @@ -1,4 +1,4 @@ -from typing import Any, Dict, List, Optional, Tuple +from typing import Any, Dict, List, Optional from pydantic import BaseModel, Field @@ -85,15 +85,15 @@ def serialize_minhash(self): def deserialize_minhash(self, binary_minhash): self.minhash = decode_minhash(binary_minhash) - # TODO: can i delete this? - def as_dict(self, no_hash: bool = True, no_rules: bool = True): + def as_dict( + self, no_hash: bool = True, no_rules: bool = True, no_diff: bool = True + ): out = { "commit_id": self.commit_id, "repository": self.repository, "timestamp": self.timestamp, "hunks": self.hunks, "message": self.message, - "diff": self.diff, "changed_files": self.changed_files, "message_reference_content": self.message_reference_content, "jira_refs": self.jira_refs, @@ -102,6 +102,8 @@ def as_dict(self, no_hash: bool = True, no_rules: bool = True): "twins": self.twins, "tags": self.tags, } + if not no_diff: + out["diff"] = self.diff if not no_hash: out["minhash"] = encode_minhash(self.minhash) if not no_rules: diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index a53a109b0..b5391d1ca 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -35,7 +35,9 @@ def parse_cli_args(args): help="Commit preprocessing only", ) - parser.add_argument("--pub-date", type=str, help="Publication date of the advisory") + parser.add_argument( + "--pub-date", type=str, help="Publication date of the advisory" + ) # Allow the user to manually supply advisory description parser.add_argument("--description", type=str, help="Advisory description") @@ -154,7 +156,9 @@ def parse_config_file(filename: str = "config.yaml"): logger.error(f"Type error in {filename}: {e}") except Exception as e: # General exception catch block for any other exceptions - logger.error(f"An unexpected error occurred when parsing config.yaml: {e}") + logger.error( + f"An unexpected error occurred when parsing config.yaml: {e}" + ) else: logger.error("No configuration file found, cannot proceed.") @@ -202,7 +206,11 @@ class ConfigSchema: enabled_rules: List[str] = MISSING nvd_token: Optional[str] = None database: DatabaseConfig = DatabaseConfig( - user="postgres", password="example", host="db", port=5432, dbname="postgres" + user="postgres", + password="example", + host="db", + port=5432, + dbname="postgres", ) llm_service: Optional[LLMServiceConfig] = None github_token: Optional[str] = None @@ -230,6 +238,7 @@ def __init__( backend: str, report: ReportConfig, report_filename: str, + report_diff: bool, ping: bool, log_level: str, git_cache: str, @@ -245,8 +254,12 @@ def __init__( self.description = description self.max_candidates = max_candidates # self.tag_interval = tag_interval - self.version_interval = version_interval if version_interval else "None:None" - self.modified_files = modified_files.split(",") if modified_files else [] + self.version_interval = ( + version_interval if version_interval else "None:None" + ) + self.modified_files = ( + modified_files.split(",") if modified_files else [] + ) self.filter_extensions = filter_extensions self.keywords = keywords.split(",") if keywords else [] self.use_nvd = use_nvd @@ -255,6 +268,7 @@ def __init__( self.use_backend = use_backend self.report = report self.report_filename = report_filename + self.report_diff = report_diff self.ping = ping self.log_level = log_level self.git_cache = git_cache @@ -292,6 +306,7 @@ def get_configuration(argv): use_backend=args.use_backend or conf.use_backend, report=args.report or conf.report.format, report_filename=args.report_filename or conf.report.name, + report_diff=conf.report.no_diff, ping=args.ping, git_cache=conf.git_cache, enabled_rules=conf.enabled_rules, From 00ba4f81ca2d684bbf22d29090a156dab35587d0 Mon Sep 17 00:00:00 2001 From: matteogreek Date: Tue, 25 Jul 2023 11:41:27 +0200 Subject: [PATCH 79/83] Add prospector run parameters in JSON report. Add cli option to exclude diff. --- prospector/cli/main.py | 35 +++++++++++++++++--------------- prospector/core/report.py | 25 ++++++++++++++++++++--- prospector/util/config_parser.py | 6 ++++++ 3 files changed, 47 insertions(+), 19 deletions(-) diff --git a/prospector/cli/main.py b/prospector/cli/main.py index 0ef9e7217..633754cdb 100644 --- a/prospector/cli/main.py +++ b/prospector/cli/main.py @@ -88,24 +88,26 @@ def main(argv): # noqa: C901 logger.debug("Vulnerability ID: " + config.vuln_id) - results, advisory_record = prospector( - vulnerability_id=config.vuln_id, - repository_url=config.repository, - publication_date=config.pub_date, - vuln_descr=config.description, - version_interval=config.version_interval, - modified_files=config.modified_files, - advisory_keywords=config.keywords, - use_nvd=config.use_nvd, + params = { + "vulnerability_id": config.vuln_id, + "repository_url": config.repository, + "publication_date": config.pub_date, + "vuln_descr": config.description, + "version_interval": config.version_interval, + "modified_files": config.modified_files, + "advisory_keywords": config.keywords, + "use_nvd": config.use_nvd, # fetch_references=config.fetch_references, - backend_address=config.backend, - use_backend=config.use_backend, - git_cache=config.git_cache, - limit_candidates=config.max_candidates, + "backend_address": config.backend, + "use_backend": config.use_backend, + "git_cache": config.git_cache, + "limit_candidates": config.max_candidates, # ignore_adv_refs=config.ignore_refs, - use_llm_repository_url=config.llm_service.use_llm_repository_url, - enabled_rules=config.enabled_rules, - ) + "use_llm_repository_url": config.llm_service.use_llm_repository_url, + "enabled_rules": config.enabled_rules, + } + + results, advisory_record = prospector(**params) if config.preprocess_only: return @@ -115,6 +117,7 @@ def main(argv): # noqa: C901 advisory_record, config.report, config.report_filename, + params, config.report_diff, ) diff --git a/prospector/core/report.py b/prospector/core/report.py index 16897608b..fb9006271 100644 --- a/prospector/core/report.py +++ b/prospector/core/report.py @@ -24,12 +24,14 @@ def default(self, obj): def json_( results: List[Commit], advisory_record: AdvisoryRecord, + params, filename: str = "prospector-report.json", no_diff: bool = False, ): fn = filename if filename.endswith(".json") else f"{filename}.json" data = { + "parameters": params, "advisory_record": advisory_record.__dict__, "commits": [ r.as_dict(no_hash=True, no_rules=False, no_diff=no_diff) for r in results @@ -106,18 +108,35 @@ def format_annotations(commit: Commit) -> str: def generate_report( - results, advisory_record, report_type, report_filename, report_diff=False + results, + advisory_record, + report_type, + report_filename, + prospector_params, + report_diff=False, ): with ConsoleWriter("Generating report\n") as console: match report_type: case "console": console_(results, advisory_record, get_level() < logging.INFO) case "json": - json_(results, advisory_record, report_filename, report_diff) + json_( + results, + advisory_record, + prospector_params, + report_filename, + report_diff, + ) case "html": html_(results, advisory_record, report_filename) case "all": - json_(results, advisory_record, report_filename, report_diff) + json_( + results, + advisory_record, + prospector_params, + report_filename, + report_diff, + ) html_(results, advisory_record, report_filename) case _: logger.warning("Invalid report type specified, using 'console'") diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index b5391d1ca..593bd676a 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -84,6 +84,12 @@ def parse_cli_args(args): help="Get data from NVD", ) + parser.add_argument( + "--no-diff", + action="store_true", + help="Do not include diff field in JSON report", + ) + parser.add_argument( "--fetch-references", action="store_true", From 7c076dbe493f5925672ec03ee064ca402f01b4f2 Mon Sep 17 00:00:00 2001 From: matteogreek Date: Fri, 28 Jul 2023 14:26:48 +0200 Subject: [PATCH 80/83] fix test cases --- prospector/git/git_test.py | 3 +-- prospector/git/raw_commit_test.py | 2 -- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/prospector/git/git_test.py b/prospector/git/git_test.py index fbfdbcc86..e51fc68a6 100644 --- a/prospector/git/git_test.py +++ b/prospector/git/git_test.py @@ -42,8 +42,7 @@ def test_get_tags_for_commit(repository: Git): commit = commits.get(OPENCAST_COMMIT) if commit is not None: tags = commit.find_tags() - print(tags) - assert len(tags) >= 106 + # assert len(tags) == 75 assert "10.2" in tags and "11.3" in tags and "9.4" in tags diff --git a/prospector/git/raw_commit_test.py b/prospector/git/raw_commit_test.py index 534431e94..b831706b5 100644 --- a/prospector/git/raw_commit_test.py +++ b/prospector/git/raw_commit_test.py @@ -1,6 +1,5 @@ import pytest -from git.exec import Exec from git.git import Git from git.raw_commit import RawCommit @@ -26,7 +25,6 @@ def commit(): def test_find_tags(commit: RawCommit): tags = commit.find_tags() - assert len(tags) >= 106 assert "10.2" in tags and "11.3" in tags and "9.4" in tags From f8c85d27227568ed40c8181368cd03bc70b9a898 Mon Sep 17 00:00:00 2001 From: I748376 Date: Tue, 30 Jul 2024 07:44:46 +0000 Subject: [PATCH 81/83] adds new no_diff field to config schema --- prospector/core/report.py | 10 ++++++++-- prospector/util/config_parser.py | 1 + 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/prospector/core/report.py b/prospector/core/report.py index fb9006271..0770bb70c 100644 --- a/prospector/core/report.py +++ b/prospector/core/report.py @@ -30,11 +30,15 @@ def json_( ): fn = filename if filename.endswith(".json") else f"{filename}.json" + params["enabled_rules"] = list( + params["enabled_rules"] + ) # Fix for OmegaConf not being JSON serializable data = { "parameters": params, "advisory_record": advisory_record.__dict__, "commits": [ - r.as_dict(no_hash=True, no_rules=False, no_diff=no_diff) for r in results + r.as_dict(no_hash=True, no_rules=False, no_diff=no_diff) + for r in results ], } logger.info(f"Writing results to {fn}") @@ -81,7 +85,9 @@ def html_( return fn -def console_(results: List[Commit], advisory_record: AdvisoryRecord, verbose=False): +def console_( + results: List[Commit], advisory_record: AdvisoryRecord, verbose=False +): def format_annotations(commit: Commit) -> str: out = "" if verbose: diff --git a/prospector/util/config_parser.py b/prospector/util/config_parser.py index 593bd676a..39cd65f64 100644 --- a/prospector/util/config_parser.py +++ b/prospector/util/config_parser.py @@ -184,6 +184,7 @@ class DatabaseConfig: class ReportConfig: format: str name: str + no_diff: bool # Schema class for "llm_service" configuration From 96993364a4fa8d86ec19d134d491e5538c30ba0c Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Mon, 29 Jul 2024 12:46:45 +0200 Subject: [PATCH 82/83] minor changes --- scripts/bib2md.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/scripts/bib2md.py b/scripts/bib2md.py index 990ecb16d..a1215b3bf 100644 --- a/scripts/bib2md.py +++ b/scripts/bib2md.py @@ -9,11 +9,11 @@ import argparse import html import sys - import bibtexparser def process_entry(entry): + try: authors = entry["author"].split(" and ") if len(authors) > 1: @@ -46,7 +46,7 @@ def process_entry(entry): except KeyError as e: print( - f"One or more necessary fields {str(e)} not present in this BibTeX entry." + f"One or more necessary fields {str(e)} not present in this BibTeX entry.", file=sys.stderr, ) return None, None @@ -73,7 +73,7 @@ def format_simple(entry_str, order="desc"): "Warning: Some entries were not processed due to unknown type", file=sys.stderr, ) - print("List of unprocessed entrie(s):", [e for e in unprocessed_entries]) + print("List of unprocessed entrie(s):", [e for e in unprocessed_entries], file=sys.stderr) return [entry[1] for entry in formatted_entries] From 312e910b8fa1759c7005550800de08385f40eb14 Mon Sep 17 00:00:00 2001 From: Adrien Linares <76013394+adlina1@users.noreply.github.com> Date: Mon, 29 Jul 2024 12:49:07 +0200 Subject: [PATCH 83/83] removed duplicated entries and corrected not well formatted ones were the caused of errors duplicated keys and error about authors (and other fields) -> converting from uppercase to lowercase fields to solve that --- references/others.bib | 46 ++++++++++++------------------------------- 1 file changed, 13 insertions(+), 33 deletions(-) diff --git a/references/others.bib b/references/others.bib index c43c6ba49..380c6bacc 100644 --- a/references/others.bib +++ b/references/others.bib @@ -136,15 +136,15 @@ @misc{sawadogo2020learningcatchsecuritypatches url={https://arxiv.org/abs/2001.09148}, } -@misc{dunlap2023vfcfinderseamlesslypairingsecurity, - title={VFCFinder: Seamlessly Pairing Security Advisories and Patches}, - author={Trevor Dunlap and Elizabeth Lin and William Enck and Bradley Reaves}, - year={2023}, - eprint={2311.01532}, - archivePrefix={arXiv}, - primaryClass={cs.CR}, - url={https://arxiv.org/abs/2311.01532}, -} + + + + + + + + + @misc{dunlap2023vfcfinderseamlesslypairingsecurity, title={VFCFinder: Seamlessly Pairing Security Advisories and Patches}, @@ -268,11 +268,11 @@ @INPROCEEDINGS{10428519 doi={10.1109/AECE59614.2023.10428519}} @Article{electronics10131606, -AUTHOR = {Senanayake, Janaka and Kalutarage, Harsha and Al-Kadri, Mhd Omar}, -TITLE = {Android Mobile Malware Detection Using Machine Learning: A Systematic Review}, -JOURNAL = {Electronics}, +author = {Senanayake, Janaka and Kalutarage, Harsha and Al-Kadri, Mhd Omar}, +title = {Android Mobile Malware Detection Using Machine Learning: A Systematic Review}, +journal = {Electronics}, VOLUME = {10}, -YEAR = {2021}, +year = {2021}, NUMBER = {13}, ARTICLE-NUMBER = {1606}, URL = {https://www.mdpi.com/2079-9292/10/13/1606}, @@ -671,16 +671,6 @@ @misc{wang2024aigeneratedcodereallysafe url={https://arxiv.org/abs/2407.02395}, } -@misc{sawadogo2020learningcatchsecuritypatches, - title={Learning to Catch Security Patches}, - author={Arthur D. Sawadogo and Tegawendé F. Bissyandé and Naouel Moha and Kevin Allix and Jacques Klein and Li Li and Yves Le Traon}, - year={2020}, - eprint={2001.09148}, - archivePrefix={arXiv}, - primaryClass={cs.SE}, - url={https://arxiv.org/abs/2001.09148}, -} - @misc{tony2023llmsecevaldatasetnaturallanguage, title={LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations}, author={Catherine Tony and Markus Mutas and Nicolás E. Díaz Ferreyra and Riccardo Scandariato}, @@ -691,16 +681,6 @@ @misc{tony2023llmsecevaldatasetnaturallanguage url={https://arxiv.org/abs/2303.09384}, } -@misc{wang2019characterizingunderstandingsoftwaredeveloper, - title={Characterizing and Understanding Software Developer Networks in Security Development}, - author={Song Wang and Nachi Nagappan}, - year={2019}, - eprint={1907.12141}, - archivePrefix={arXiv}, - primaryClass={cs.SE}, - url={https://arxiv.org/abs/1907.12141}, -} - @article{Chen_2023, title={Neural Transfer Learning for Repairing Security Vulnerabilities in C Code}, volume={49},
StateJob IdResultSettingsIdInfoModify
'); - const resultLink = $('').attr('href', `/jobs/${job._id}`).text('Result'); - resultCell.append(resultLink); + const configureBtn1 = $(''); @@ -42,18 +45,16 @@ async function fetchJobData() { }) .catch(error => { console.error(error); - // Handle the error as needed }); } -// Function to handle the "Configure" button click function configureJob(jobId) { - // Redirect to the configure page with the job ID in the query string window.location.href = `job_configuration.html?jobId=${jobId}`; } -// Call fetchJobData initially to populate the table -fetchJobData(); +function infoJob(jobId) { + window.location.href = `job_info.html?jobId=${jobId}`; +} // Call fetchJobData every 4 seconds to update the table setInterval(fetchJobData, 1000); diff --git a/prospector/service/static/job_configuration.html b/prospector/service/static/job_configuration.html index f671446d0..f7c63512a 100644 --- a/prospector/service/static/job_configuration.html +++ b/prospector/service/static/job_configuration.html @@ -17,7 +17,7 @@ - +

diff --git a/prospector/service/static/job_configuration.js b/prospector/service/static/job_configuration.js index cb0d6b0f3..91c83ff60 100644 --- a/prospector/service/static/job_configuration.js +++ b/prospector/service/static/job_configuration.js @@ -50,6 +50,3 @@ function callEnqueue() { console.log('Error:', error); }); } - -// Call the populatePage function when the page loads -populatePage(); diff --git a/prospector/service/static/job_info.css b/prospector/service/static/job_info.css new file mode 100644 index 000000000..011306f81 --- /dev/null +++ b/prospector/service/static/job_info.css @@ -0,0 +1,37 @@ +body { + font-family: Arial, sans-serif; + margin: 20px; +} + +.job-details { + max-width: 600px; + margin: 0 auto; + padding: 20px; + background-color: #f8f8f8; + border-radius: 5px; + box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); +} + +.job-details h3 { + color: #333; + margin-top: 0; +} + +.job-details p { + margin: 0; + color: #666; +} + +.job-details .field { + margin-top: 20px; +} + +.job-details .field label { + font-weight: bold; + display: block; +} + +.job-details .field span { + color: #888; + margin-left: 5px; +} diff --git a/prospector/service/static/job_info.html b/prospector/service/static/job_info.html new file mode 100644 index 000000000..2fbe65f13 --- /dev/null +++ b/prospector/service/static/job_info.html @@ -0,0 +1,55 @@ + + + + + Job Details + + + + + + + + +
+

Job Details

+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + + diff --git a/prospector/service/static/job_info.js b/prospector/service/static/job_info.js new file mode 100644 index 000000000..d38e4902c --- /dev/null +++ b/prospector/service/static/job_info.js @@ -0,0 +1,26 @@ + +function getJobIdFromQueryString() { + const urlParams = new URLSearchParams(window.location.search); + return urlParams.get('jobId'); +} + +function JobInfoPage() { + jobId = getJobIdFromQueryString() + fetch(`/jobs/${jobId}`, { method: 'GET' }) + .then(response => response.json()) + .then(data => { + const jobData = data.job_data; + document.getElementById('job-id').textContent = jobData.job_id; + document.getElementById('job-params').textContent = jobData.job_params; + document.getElementById('job-enqueued').textContent = jobData.job_enqueued_at; + document.getElementById('job-started').textContent = jobData.job_started_at; + document.getElementById('job-finished').textContent = jobData.job_finished_at; + document.getElementById('job-result').textContent = jobData.job_results; + document.getElementById('job-created-by').textContent = jobData.job_created_by; + document.getElementById('job-created-from').textContent = jobData.job_created_from; + document.getElementById('job-status').textContent = jobData.job_status; + }) + .catch(error => { + console.log('Error:', error); + }); +} diff --git a/prospector/service/static/report_list.html b/prospector/service/static/report_list.html index cf3066521..ef671611a 100644 --- a/prospector/service/static/report_list.html +++ b/prospector/service/static/report_list.html @@ -3,7 +3,6 @@ Job List - @@ -24,7 +23,8 @@

Report list

{{report.0}}{{report.0}} + {{ report.1.strftime('%Y-%m-%d %H:%M') }}