Scripts and notebooks to benchmark one-class sampling strategies.
This repository contains scripts and notebooks to reproduce the experiments and analyses of the paper
Adrian Englhardt, Holger Trittenbach, Daniel Kottke, Bernhard Sick, Klemens Böhm, "Efficient SVDD sampling with approximation guarantees for the decision boundary", Machine Learning (2022), DOI: 10.1007/s10994-022-06149-0.
For more information about this research project, see also the one-class sampling project website.
The analysis and main results of the experiments can be found under notebooks:
example_intro.ipynb
: Figure 1example.ipynb
: Figure 4eval_synthetic.ipynb
: Figure 5eval_dami.ipynb
: Figure 6 and Table 2
To execute the notebooks, make sure you follow the setup, and download the raw results into data/output/
.
The experiments are implemented in Julia, some of the evaluation notebooks are written in python. This repository contains code to setup the experiments, to execute them, and to analyze the results. The one-class classifiers and some other helper methods are implemented in two separate Julia packages: SVDD.jl and OneClassActiveLearning.jl. The one-class sampling strategies are implemented in OneClassSampling.jl.
Just clone the repo.
$ git clone https://github.com/englhardt/ocs-evaluation.git
- Experiments require Julia 1.3.1, requirements are defined in
Manifest.toml
. To instantiate, start julia in theocs-evaluation
directory withjulia --project
and runjulia> ]instantiate
. See Julia documentation for general information on how to setup this project. - Notebooks require
- Julia 1.3.1 (dependencies are already installed in the previous step)
- Python 3.8 and
pipenv
. Runpipenv install
to install all dependencies
data
input
raw
: contains unprocessed data set collectionsliterature
andsemantic
downloaded from the DAMI repositorydami
: output directory of preprocess_data.jlsynthetic
: output directory of generate_synthetic_data.jl
output
: output directory of experiments; generate_experiments.jl creates the folder structure and experiments; run_experiments.jl writes results and log files
notebooks
: jupyter notebooks to analyze experimental resultseval_dami.ipynb
: Figure 6 and Table 2eval_synthetic.ipynb
: Figure 5example_intro.ipynb
: Figure 1example.ipynb
: Figure 4
scripts
config
: configuration files for experimentsconfig.jl
: high-level configuration for DAMI experiments, e.g., for number of workersconfig_syn.jl
: high-level configuration for synthetic data experiments, e.g., for number of workersconfig_dami_large.jl
: experiment config for large DAMI data setsconfig_dami.jl
: experiment config for small DAMI data setsconfig_dami_baseline_gt.jl
: experiment config for the ground-truth baselineconfig_dami_baseline_prefiltering.jl
: experiment config for the prefiltering baselineconfig_dami_baseline_rand.jl
: experiment config for the random sample baselineconfig_dami_large_outperc.jl
: experiment config for varying the outlier percentage on DAMI data setsconfig_dami_outperc.jl
: experiment config for varying the outlier percentage on small DAMI data setsconfig_synthetic.jl
: experiment config for synthetic dataconfig_precompute_parameters.jl
: experiment config to precompute classifier hyperparameters for DAMI dataconfig_precompute_parameters_gt.jl
: experiment config to precompute classifier hyperparameters for DAMI data with ground truthconfig_precompute_parameters_syn.jl
: experiment config to precompute classifier hyperparameters for synthetic dataconfig_warmup.jl
: experiment config for precomputation warmup experiments
util/setup_workers.jl
: utility script to setup multiple workers, see Infrastructure and Parallelizationutil/evaluate.jl
: utility script to setup evaluate SVDD classifier on samplesgenerate_experiments.jl
: generate experiments for one type of query strategy, e.g. DAMIgenerate_synthetic_data.jl
: generate synthetic data setsprecompute_parameters.jl
: precompute classifier hyperparametersprecompute_parameters_gt.jl
: precompute classifier hyperparameters with ground truthpreprocess_data.jl
: preprocess DAMI datarun_experiments.jl
: executes experiments
Here, we specify how to reproduce our experiments after running the steps specified in (Setup)[#setup]
- Experiment execution
To manually rerun all our experiments we provide two scripts run.sh
for the DAMI experiments and run_syn.sh
for the experiments on synthetic data. Since experiment execution takes several days on modern machines, we provide the raw results as a download. One can then skip the experiment execution and head straight to Step 2. The downloaded raw results must be extracted into data/output/
, e.g., data/output/dami
To reproduce the DAMI experiments, download semantic.tar.gz and literature.tar.gz containing the .arff files from the DAMI benchmark repository and extract into data/input/raw/.../<data set>
(e.g. data/input/raw/literature/ALOI/
or data/input/raw/semantic/Annthyroid
).
- Experiment evaluation
To analyze the results run the jupyter notebooks in the notebooks directory. Run the following to produce the figures and tables in the experiment section of the paper:
pipenv run eval
pipenv run eval_syn
Experiment execution can be parallelized over several workers. In general, one can use any ClusterManager. In this case, the node that executes run_experiments.jl
is the driver node. The driver node loads the experiments.jser
, and initiates a function call for each experiment on one of the workers via pmap
. Edit scripts/config/config_syn.jl
and scripts/config/config.jl
to add remote machines and workers.
This package is developed and maintained by Adrian Englhardt