Open a terminal and navigate to the data_generator
folder using the following command:
cd data_generator
From there, run:
docker-compose up
This will populate the dataset
folder with 10 artificially generated leak scenarios. This takes around 5 minutes. When
the data is generated the container will automatically exit.
For each pipeline
in the config.ini
file, ensure that train_model
is set to true so that all models are generated
and trained upon starting the application. When cloned directly from the repo, this is always the case. After the first
run we can set this to False
to save time when testing with the same scenario.
Start the application with the following command:
docker-compose up
On the first run all models are trained first. pipeline0
and pipeline1
train very quickly. The progress of training
pipeline2
and pipeline3
can be monitored by following the output of the corresponding container. As soon as the
training finishes, each pipeline starts the leak detection progress.
Data should start flowing into the database, this process can be monitored using Grafana. Grafana can be accessed at:
http://localhost:3000
With the following credentials:
username: admin
password: bitnami
InfluxDB can also be accessed directly at:
http://localhost:8086
With the following credentials:
username: admin
password: bitnami123
The daily operation of water distribution networks involves many things such as ensuring quality and quantity of water. A significant problem for water distribution network operators is the forming of leaks due to for example wear and tear of pipes.
There is already a great body of research dedicated to the challenge of detecting these leaks automatically. The tool at hand provides a way to experiment with different methods of leak detection with synthesized data.
Name | Model name | Model type | Time series compatible | Leak localization | Output | Thresholding method |
---|---|---|---|---|---|---|
pipeline0 | Fault Sensitivity Matrix | Statistical | No | Yes | Correlation matrix | Static scaler |
pipeline1 | Random Forest Classifier | Machine Learning | No | No | Binary label | Non-applicable |
pipeline2 | LSTM Neural Network | Deep Learning | Yes | Yes | Flow predictions | Confidence interval |
pipeline3 | Facebook Prophet | Statistical | Yes | Yes | Flow predictions | Confidence interval |
The tool uses a number of docker
containers which communicate with each other in different ways. A visual overview of
the technologies used is shown below.
The data generator uses WNTR to synthesize a dataset, this is then fed to Kafka to manage the distribution of the data.
Apache Kafka is an event streaming platform. In this project it serves to simulate the real-time streaming of data taken from the synthesized dataset. Kafka sends the data to the numerous pipelines in the form of messages at a set interval.
InfluxDB serves as the data sink in this project. Predictions from the pipelines are saved there along with other things such as the performance over time and the generated flow and pressure data from the data generator
Grafana is our main tool for visualizing the data. Provisioned dashboards are set up within Grafana enabling the user to view the generated predictions in real-time. There is also a separate dashboard to view the performance of the algorithms over time.
wdn_input_file_name
- Name of the file to use as the input network. Input network files should be kept in the
wdn_input_files
folder
message_frequency
- This sets the delay for streaming the data using Kafka. If it is set to 0.5, pressure and flow
data will be sent every 0.5 seconds.
scenario_name
- Name of the scenario to use. This is also the name used for data synthesis so it should be set in line
with the scenario_path
value in most cases.
experiment_start_time
- This is the time from which Kafka should start streaming the data.
scenario_path
- Path of the scenario that is used for all the experiments
The data is synthesized using the WNTR. WNTR is built upon EPANET which is the industry-standard for water distribution network simulation.
demand_input_file_path
- This is the path to the file that serves as the demand pattern for the synthesized scenario
simulation_start_time
- A date from which the simulation starts
train_start
- The date from which the training set starts
train_end
- The date that denotes the end of the training set
val_start
- The date from which the validation set starts
val_end
- The date that denotes the end of the validation set
test_start
- The date from which the test set starts
test_end
- The date that denotes the end of the test set
leak_diameter
- Size of the leak, LeakDB takes this to be in the range [0.02-0.2)
skip_nodes
- A list of node names describing which node data to leave out of the final file. This is helpful since
reservoir nodes provide no useful data in terms of pressure and thus pollute the dataset.
synthesize_data
- Whether to freshly synthesize data. If this is set to true, a new dataset will be generated upon
starting the tool.
is_leak_scenario
- Whether the to be synthesized dataset should be a leak scenario or not.
leak_node
- The node in which the leak should occur
In total there are four implemented leak detection methods. Each one has its own docker
container. Now follows a
description of the different methods
This pipeline uses a method based on a paper by Puig et. al. A leak scenario is simulated for each node to generate a pressure signature for a leak at that node. We then construct a fault sensitivity matrix based on the signatures from all the nodes. Finally, to determine if there is a leak, we compute the correlation between the current pressure signature, and the fault sensitivity matrix. If the correlation exceeds a certain threshold, the algorithm classifies the current time as a leak.
train_model
- Determines whether the algorithm should be trained again
correlation_threshold
- This is used as the threshold for the correlation
A common approach seen in the literature for leak detection is the use of a machine learning classifier for the
detection of leaks. This pipeline is dedicated to that type of approach. We have a basic RandomForestClassifier
from the
sklearn
library implemented here. The idea is to train the classifier on time-series data from a leak detection
scenario, treating each time point as a point for classification. The classifier is fed binary labels which simply
reflect whether there is a leak or not.
train_scenario_path
- This is used as the path for the training set. This should be set to a different scenario than
the one we are experimenting with since we want the model to be tested on novel data.
This pipeline used a method based on a paper by Lee and Yoo. The methods involves prediction of flow using one of the state-of-the-art methods for time-series prediction: a long short-term memory (LTSM) neural network. The flow is first predicted by the network, then based on the performance of the network on the validation set, we generate a confidence interval for the prediction. If measured flow falls outside of the confidence interval of the prediction, we say there is a leak.
train_model
- Determines whether the algorithm should be trained again
train_scenario_path
- This is used as the path for the training set. This should be set to a different scenario than
the one we are experimenting with since we want the model to be tested on novel data.
train_model
- Determines whether the algorithm should be trained again
z_value
- This is our z-value for calculating the confidence interval. A larger z-value means a larger confidence
interval.
sequence_length
- This is our look-back for the time-series prediction. If it is 3, we look at the 3 previous values
to predict the current value
sampling_rate
- This determines how often we sample to get our series for prediction. If it is 48, we sample every
48 values. The data synthesizer generates half-hourly data so a sampling_rate
of 48 means that we predict based on
the previous n days.
This pipeline uses Facebook's Prophet model for flow prediction. We also use the built-in confidence interval values from the model to generate the thresholds for leak detections.
train_model
- Determines whether the algorithm should be trained again