Here I summarized some datasets publicly available for time series anomaly detection.
ODDS webpage is here. Note that the datasets contains not only time series, but also other data types (videos, texts, and graphs).
Mainpage is here. The dataset contains transactions made by credit cards in September 2013 by European cardholders, yet due to privacy and security reasons, what we see is the result of a PCA transformation.
Request access to this dataset here.
Contains 4 folders, A1, A2, A3, A4.
A1Benchmark is based on the real production traffic to some of the Yahoo! properties. The other 3 benchmarks are based on synthetic time-series. A2 and A3 Benchmarks include outliers, while the A4Benchmark includes change-point anomalies. The bechmarks based on real-data have property and geos removed. Fields in each data file are delimited with (",") characters.
Description of NAB can be found here.
Dataset repository is here.
Multivariate time series datasets collected by “iTrust, Centre for Research in Cyber Security, Singapore University of Technology and Design”. See website here to request access to the dataset and check usage requirements.
Also collected by “iTrust, Centre for Research in Cyber Security, Singapore University of Technology and Design”. See website here to request access to the dataset (it can actually be requested at the same time as when requesting for SWaT) and check usage requirements.
Dataset released here as a part of the authors' repository of their KDD 2019 paper "Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network".
Contains over 250 datasets. The link to download the dataset is here.
The maintainers of the archive also recommend reading the following papers "The UEA multivariate time series classification archive, 2018" and "Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress" before using the dataset.
Dataset webpage is here. Check the dataset description here.
-
The KDD 2018 paper "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding" by NASA is the first paper to use this dataset. They provided download link to the dataset in their repo.
-
Note that the authors of "Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network" have also used the same versions of SMAP and MSL in their repo
-
The dataset version used by the above two papers can be downloaded using the following commands:
wget https://s3-us-west-2.amazonaws.com/telemanom/data.zip && unzip data.zip && rm data.zip
cd data && wget https://raw.githubusercontent.com/khundman/telemanom/master/labeled_anomalies.csv
Dataset webpage is here.
-
The KDD 2018 paper "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding" by NASA is the first paper to use this dataset. They provided download link to the dataset in their repo.
-
Note that the authors of "Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network" have also used the same versions of SMAP and MSL in their repo
-
The dataset version used by the above two papers can be downloaded using the following commands:
wget https://s3-us-west-2.amazonaws.com/telemanom/data.zip && unzip data.zip && rm data.zip
cd data && wget https://raw.githubusercontent.com/khundman/telemanom/master/labeled_anomalies.csv
Dataset repo is here.
Datasets maintained by the Netman Lab at Tsinghua University, their group's GitHub profile can be found here.
The KPI dataset from their 2018 challenge is here, and the 2020 data is here.
This dataset was collected by eBay, and was released here in their repository of an anomaly detection model they proposed named RANSynCoders.
Check the PhysioNet Data webpage here. These datasets are all medicine-related.
One of the datasets MIT-BIH Supraventricular Arrhythmia Database was seen used in a VLDB 2022 paper TranAD: deep transformer networks for anomaly detection in multivariate time series data.
15. Datasets Related to Power Systems from IEEE Dataport
a) CYBER-PHYSICAL DATASET OF HARDWARE-IN-THE-LOOP CYBER-PHYSICAL POWER SYSTEMS TESTBED UNDER MITM ATTACKS
Dataset main page is here.
This dataset is collect by performing different Man-in-the-Middle (MiTM) attacks in the synthetic cyber-physical electric grid in RESLab Testbed at Texas AM University, US.
Dataset main page is here.
The dataset is generated by performing four scenarios of port scanning attacks on a 8-substation supervisory control and data acquisition (SCADA) system at three different environments, including the minimega at Sandia National Lab (SNL), the Common Open Research Emulator (CORE) at Texas A&M University, and the hardware-in-the-loop RESLab Testbed at Texas A&M University.
Dataset main page is here. Dataset contains both normal traffic and communication with anomalies (cyber attacks, link failure, etc.).
Download the dataset here.
The dataset can be found here which is within the code repository of a KDD 2021 paper.
Another common way I see people do is to use time series classification datasets for anomaly detection - you can preprocess the datasets by select one or a few minority classses and label them as anomalies.
Look for time series datasets for classification tasks on the UCI repo webpage here here.
Dataset mainpage is here.
Dataset webpage is here.
The dataset main page is here. The dataset providers have published a paper Residential load and rooftop PV generation: an Australian distribution network dataset describing their dataset. There also exists an GitHub repo that analyzes this dataset's characteristics. There is a paper that uses this dataset for anomaly detection purposes titled "Anomaly Detection in Smart Meter Data for Preventing Potential Smart Grid Imbalance" here.
Dataset webpage is here. It contains energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014. Readings were taken at half hourly intervals. The customers in the trial were recruited as a balanced sample representative of the Greater London population. The CSV file (Energy consumption in kWh per half hour, unique household identifier, date, and time.) is around 10GB when unzipped and contains around 167million rows.