This repository contains datasets and scripts used to build the ML pipeline for my Bachelor's Degree thesis project.
-
Python 3 (3.11.2 was used to work on the project and is guaranteed to work)
-
BeautifulSoup 4
-
NumPy
-
Pandas
-
Pickle
-
Requests
-
SciKit-Learn
-
Selenium
-
Tabulate
-
data/
contains all the datasets structured in the following subfolders:-
exploitdb/
contains the final outputs of the scripts responsible for data mining from Exploit Database -
nvd/
contains the raw JSON dump obtainable from NVD and the final outputs of the scripts responsible for data mining from this JSON and circl.lu / NVD APIs -
merged/
contains the output of the merging of the two datasets -
final/
contains the dataset the ML pipeline is going to use
-
-
scripts/
contains all the datasets structured in the following subfolders:-
exploitdb/
contains all the scripts interfacing with Exploit Database.-
scraper_multithreaded.py
- web scraping from Exploit DB, with multithreading support for faster scraping. -
scraper.py
- first implementation of the web scraper, without multithreading support -
dataframe.py
- manipulates the dataset we obtained from the scraper, returning as a result the final Exploit Database dataset.
-
-
nvd/
contains all the scripts interfacing with Exploit Database.-
parser_circl.py
- collects data from circl.lu APIs for every single CVE available in the raw dump and returns an output dataset -
parser_nvd.py
- collects data from NVD APIs for every single CVE available in the raw dump and returns an output dataset -
converter_circl.py
- converts the output JSON to a CSV -
converter_nvd.py
- converts the output JSON to a CSV
-
-
merge/
contains all the scripts related to the merging of the datasets.-
positives_count.py
- returns the number of rows where exploitable (our target variable) is true -
merger.py
- merges the datasets obtained from Exploit Database and NVD/circl.lu to obtain the final dataset the ML pipeline is going to run on. -
metrics_are_na.py
- used for data cleaning, returns the number of rows where the CVSS metrics are NA.
-
-
ml_pipeline/
contains all the scripts related to the actual machine learning pipeline and its configuration steps and metrics collection.ml_pipeline.py
- script that runs the ML pipeline.models.py
- includes functions for hyperparameter tuning, baseline scoring and samplers scoring
-