Skip to content
/ DDoS2Vec Public

Flow-Level Characterisation of Volumetric DDoS Attacks at Scale

License

Notifications You must be signed in to change notification settings

RavSS/DDoS2Vec

Repository files navigation

DDoS2Vec Artefacts

DOI

This repository contains the code necessary to repeat the experiments in the paper DDoS2Vec: Flow-Level Characterisation of Volumetric DDoS Attacks at Scale that appeared in CoNEXT 2023.

Pre-requisites

There is a number of pre-requisites that are necessary to run the experiments - both for hardware and software.

Hardware

The experiments were run on a machine with an AMD EPYC 7702 CPU (64 cores and 128 virtual cores) and roughly 1 TB of RAM. The amount of memory is not strictly necessary, but it is recommended to have at least 512 GB of RAM, especially if you have a year's worth of flow samples (sampled at a ratio of 1:4096). The more cores you have, the faster the experiments will run, but much of the underlying code (mostly from external libraries) is not greatly parallelised, so it will not scale linearly. The more memory you have (above a certain minimum), the more cross-validation folds you can run in parallel, which will greatly speed up the experiments. You also require a large amount of storage space, as the corpora can become quite large (also depending on your flow record dataset).

Software

The code is written in Python 3 (inside Jupyter notebooks) and requires the main following packages:

  • NumPy (ideally 1.24.3)
  • scikit-learn (ideally 1.2.1)
  • Gensim (ideally 4.3.0)
  • iterative-stratification (ideally 0.1.7)

The exact Python version used in our experiments is 3.10.9, but 3.8 and above will most likely work. We recommend the latest version of Python 3 (i.e. 3.11) due to performance improvements. The packages can be quickly installed via pip3 install -r requirements.txt. That also includes Matplotlib, which was used for all plot generation (only some code is included, so the actual results code is simpler to grasp and repeat). To run the notebooks themselves, you will also need Jupyter itself, but the version is not consequential and any recent version would work.

An important part of our paper was to use the filtering rules from the IXP Scrubber work for labelling flow samples. The link above contains the exact version of the rules that we used. That stated, you could theoretically use a different set of rules, as long as each flow record has a unique string label associated with it.

Process

There are three main steps (and accompanying Jupyter notebooks) that can repeat our experiments. We have significantly streamlined the notebooks, so that they are easy to follow and understand. The cells in the notebooks can all be ran in a sequential fashion, but note that the first notebook is a prerequisite for the other two.

There are a few variables that control options in each step around the start of each respective step's notebook. The main one to note is the ARTEFACTS_PATH variable, which (by default) is "./Artefacts/" and is the path to where everything generated by the notebooks will be stored. The other variables are more specific and are not necessary to change unless noted.

Corpora Generation (1_Corpora_Generation.ipynb)

This notebook is responsible for generating the corpora that are used in the later notebooks. This step cannot be skipped, as the corpora are not included in this repository nor can we share them publicly due to Non-Disclosure Agreements (NDAs). No special run order is required; cells marked as the first group can be ran in any order and finishes quickly, but the second group must be ran after the first group and is computationally intensive.

Note: As we cannot release our dataset and your dataset will most likely differ from ours in format, you must modify the code in this notebook to suit your needs. For the most part, you must implement flow_reader and return the Flow type. Additionally, you should modify the datetime values in the code to suit your flow record dataset's time ranges.

We recommend creating new flow corpus generation approaches/methods if you wish to experiment further regarding this step.

Baseline Comparison (2_Baseline_Comparison.ipynb)

This notebook is responsible for running the baseline comparison experiments - the most important part of the experiments that compare DDoS2Vec (LSA) to other approaches. Adjustments to the code are not likely to be necessary, but for convenience, there are multiple options for the notebook, such as changing the classifier used, the number of cross-validation folds, etc.

Running the experiments will generate a result for each cross-validation fold, which is saved to the artefacts path. You can analyse the result(s) by simply loading the pickled dictionary and investigating further.

Longitudinal Analysis (3_Longitudinal_Analysis.ipynb)

This notebook is responsible for running the longitudinal analysis experiments, which is everything in Section 6 of the paper. The experiments are specific to our dataset's time ranges; therefore, you will need to modify more code to suit your dataset's time ranges than the previous notebook. Additionally, the code for the plots is also included, but like the rest of the notebook, it is not well-documented due to it again being specific to our dataset. It is left in the notebook for convenience, but it is not necessary to run to generate experiment results.

Paper Results

If you require our exact results on our dataset, then we can provide the pickled dictionaries for the baseline comparison and longitudinal analysis experiments upon contact. Note that they are only a raw view of the classification and time performance metrics already shown in the paper.