Skip to content

elisejiuqizhang/TS-AD-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 

Repository files navigation

Public Datasets for Time Series Anomaly Detection

Time Series Anomaly Detection Datasets

Here I summarized some datasets publicly available for time series anomaly detection.

1. Outlier Detection DataSets (ODDS)

ODDS webpage is here. Note that the datasets contains not only time series, but also other data types (videos, texts, and graphs).

2. Kaggle Credit Card Fraud Detection DataSet (CCFD)

Mainpage is here. The dataset contains transactions made by credit cards in September 2013 by European cardholders, yet due to privacy and security reasons, what we see is the result of a PCA transformation.

3. Yahoo Time Series Anomaly Detection Benchmark

Request access to this dataset here.

Contains 4 folders, A1, A2, A3, A4.

A1Benchmark is based on the real production traffic to some of the Yahoo! properties. The other 3 benchmarks are based on synthetic time-series. A2 and A3 Benchmarks include outliers, while the A4Benchmark includes change-point anomalies. The bechmarks based on real-data have property and geos removed. Fields in each data file are delimited with (",") characters.

4. Numenta Anomaly Benchmark (NAB)

Description of NAB can be found here.

Dataset repository is here.

5. Secure Water Treatment (SWaT) Dataset

Multivariate time series datasets collected by “iTrust, Centre for Research in Cyber Security, Singapore University of Technology and Design”. See website here to request access to the dataset and check usage requirements.

6. Water Distribution (WADI) Dataset

Also collected by “iTrust, Centre for Research in Cyber Security, Singapore University of Technology and Design”. See website here to request access to the dataset (it can actually be requested at the same time as when requesting for SWaT) and check usage requirements.

7. Server Machine Dataset (SMD)

Dataset released here as a part of the authors' repository of their KDD 2019 paper "Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network".

8. UCR Time Series Anomaly Archive

Contains over 250 datasets. The link to download the dataset is here.

The maintainers of the archive also recommend reading the following papers "The UEA multivariate time series classification archive, 2018" and "Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress" before using the dataset.

9. Soil Moisture Active Passive (SMAP) Satellite Dataset

Dataset webpage is here. Check the dataset description here.

wget https://s3-us-west-2.amazonaws.com/telemanom/data.zip && unzip data.zip && rm data.zip

cd data && wget https://raw.githubusercontent.com/khundman/telemanom/master/labeled_anomalies.csv

10. Mars Science Laboratory (MSL) Curiosity Rover Dataset

Dataset webpage is here.

wget https://s3-us-west-2.amazonaws.com/telemanom/data.zip && unzip data.zip && rm data.zip

cd data && wget https://raw.githubusercontent.com/khundman/telemanom/master/labeled_anomalies.csv

11. Skoltech Anomaly Benchmark (SKAB)

Dataset repo is here.

12. Artificial Intelligence for IT Operations (AIOps) Challenge Datasets

Datasets maintained by the Netman Lab at Tsinghua University, their group's GitHub profile can be found here.

The KPI dataset from their 2018 challenge is here, and the 2020 data is here.

13. Pooled Server Metric (PSM) Dataset

This dataset was collected by eBay, and was released here in their repository of an anomaly detection model they proposed named RANSynCoders.

14. PhysioNet Open Access Databases

Check the PhysioNet Data webpage here. These datasets are all medicine-related.

One of the datasets MIT-BIH Supraventricular Arrhythmia Database was seen used in a VLDB 2022 paper TranAD: deep transformer networks for anomaly detection in multivariate time series data.

15. Datasets Related to Power Systems from IEEE Dataport

a) CYBER-PHYSICAL DATASET OF HARDWARE-IN-THE-LOOP CYBER-PHYSICAL POWER SYSTEMS TESTBED UNDER MITM ATTACKS

Dataset main page is here.

This dataset is collect by performing different Man-in-the-Middle (MiTM) attacks in the synthetic cyber-physical electric grid in RESLab Testbed at Texas AM University, US.

b) DATASET OF PORT SCANNING ATTACKS ON EMULATION TESTBED AND HARDWARE-IN-THE-LOOP TESTBED

Dataset main page is here.

The dataset is generated by performing four scenarios of port scanning attacks on a 8-substation supervisory control and data acquisition (SCADA) system at three different environments, including the minimega at Sandia National Lab (SNL), the Common Open Research Emulator (CORE) at Texas A&M University, and the hardware-in-the-loop RESLab Testbed at Texas A&M University.

c) ICS DATASET FOR SMART GRID ANOMALY DETECTION

Dataset main page is here. Dataset contains both normal traffic and communication with anomalies (cyber attacks, link failure, etc.).

16. Water Quality Dataset at GECCO 2018 Challenge

Download the dataset here.

17. Application Server Dataset (ASD)

The dataset can be found here which is within the code repository of a KDD 2021 paper.

Time Series Classification Datasets That Could Potentially Be Used for Anomaly Detection

Another common way I see people do is to use time series classification datasets for anomaly detection - you can preprocess the datasets by select one or a few minority classses and label them as anomalies.

1. UCI Machine Learning Repository Dataset - Time Series Classification

Look for time series datasets for classification tasks on the UCI repo webpage here here.

2. UEA & UCR Time Series Classification Repository

Dataset mainpage is here.

3. Industrial Control System (ICS) Cyber Attack Datasets

Dataset webpage is here.

4. Ausgrid Solar Home Electricity Dataset

The dataset main page is here. The dataset providers have published a paper Residential load and rooftop PV generation: an Australian distribution network dataset describing their dataset. There also exists an GitHub repo that analyzes this dataset's characteristics. There is a paper that uses this dataset for anomaly detection purposes titled "Anomaly Detection in Smart Meter Data for Preventing Potential Smart Grid Imbalance" here.

5. SmartMeter Energy Consumption Data in London Households

Dataset webpage is here. It contains energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014. Readings were taken at half hourly intervals. The customers in the trial were recruited as a balanced sample representative of the Greater London population. The CSV file (Energy consumption in kWh per half hour, unique household identifier, date, and time.) is around 10GB when unzipped and contains around 167million rows.

About

Public datasets for time series anomaly detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published