Implementation of a Machine Learning Pipeline for Exploitability Prediction

This repository contains datasets and scripts used to build the ML pipeline for my Bachelor's Degree thesis project.

Requirements

Python 3 (3.11.2 was used to work on the project and is guaranteed to work)
BeautifulSoup 4
NumPy
Pandas
Pickle
Requests
SciKit-Learn
Selenium
Tabulate

Repository structure

data/ contains all the datasets structured in the following subfolders:
- exploitdb/ contains the final outputs of the scripts responsible for data mining from Exploit Database
- nvd/ contains the raw JSON dump obtainable from NVD and the final outputs of the scripts responsible for data mining from this JSON and circl.lu / NVD APIs
- merged/ contains the output of the merging of the two datasets
- final/ contains the dataset the ML pipeline is going to use
scripts/ contains all the datasets structured in the following subfolders:
- exploitdb/ contains all the scripts interfacing with Exploit Database.
  - scraper_multithreaded.py - web scraping from Exploit DB, with multithreading support for faster scraping.
  - scraper.py - first implementation of the web scraper, without multithreading support
  - dataframe.py - manipulates the dataset we obtained from the scraper, returning as a result the final Exploit Database dataset.
- nvd/ contains all the scripts interfacing with Exploit Database.
  - parser_circl.py - collects data from circl.lu APIs for every single CVE available in the raw dump and returns an output dataset
  - parser_nvd.py - collects data from NVD APIs for every single CVE available in the raw dump and returns an output dataset
  - converter_circl.py - converts the output JSON to a CSV
  - converter_nvd.py - converts the output JSON to a CSV
- merge/ contains all the scripts related to the merging of the datasets.
  - positives_count.py - returns the number of rows where exploitable (our target variable) is true
  - merger.py - merges the datasets obtained from Exploit Database and NVD/circl.lu to obtain the final dataset the ML pipeline is going to run on.
  - metrics_are_na.py - used for data cleaning, returns the number of rows where the CVSS metrics are NA.
- ml_pipeline/ contains all the scripts related to the actual machine learning pipeline and its configuration steps and metrics collection.
  - ml_pipeline.py - script that runs the ML pipeline.
  - models.py - includes functions for hyperparameter tuning, baseline scoring and samplers scoring

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of a Machine Learning Pipeline for Exploitability Prediction

Requirements

Repository structure

About

Releases

Packages

Languages

meelunae/exploitability_prediction

Folders and files

Latest commit

History

Repository files navigation

Implementation of a Machine Learning Pipeline for Exploitability Prediction

Requirements

Repository structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages