A multithreaded package to validate, curate, and transform large heterogeneous datasets using reproducible recipes, which can be created both in TOML human readable format, or in Julia.
- Installation
- Two Simple Examples
- Walkthrough
- Conditions & Rules : your toolbox of options to check, and actions you can perform
- Usage -- How to use the Julia API if you need to
- Remote usage: posting to Slack, Owncloud, SSH/SCP
- Using Python or R
Depth = 5
DataCurator is a Swiss army knife that ensures:
- pipelines can focus on the algorithm/problem solving
- human readable
recipes
for future reproducibility - validation huge datasets at high speed
- out-of-the-box operation without the need for code or dependencies
We'll show 2 simple examples on how to get started.
DataCurator works on recipes
, TOML text files (see examples), which we will include inline here to illustrate how to use them.
Note All the examples are tested automatically, so you can rest assured that test work.
Let's say we have a dataset, and you only expect it to contain CSV files. At this point, you don't care much about the structure or hierarchy of the files, or the naming patterns. You want to create a report (text file) with all csv files, and one with files or directories that are not.
[global]
inputdirectory = "testdir"
[any]
conditions=["is_csv_file"]
actions = [["log_to_file", "non_csvs.txt"]]
counter_actions=[["log_to_file", "csvs.txt"]]
Execute:
./DataCurator.sif -r myrecipe.toml
When it completes, you will now have 2 text files in your current working directory, non_csvs.txt
and csvs.txt
.
So far we've been looking at file names and types, but DataCurator can look inside as well, and transform the contents. Where validation only verifies datasets, and does not change them, curation can change the data. Often curation is step 2 after validation, it's nice to check if your expectations match data, but if they don't, you still need to intervene.
Let's say you have a dataset with image files in tif
format. Rather than just building lists of them, or checking that they're there, for the right files we want to do some pre-processing.
We also want to change the filenames, because they have a mix of upper and lower case, and the actual analysis pipeline we will feed them in later expects lowercase only.
Note that #
is a comment line
# Start of the recipe, this configures global options
[global]
act_on_success=true
inputdirectory = "testdir"
# Your rules, `any` means you do not care at what level/depth files are checked
[any]
# When to act, in this case, you want to only work on tif files
conditions=["is_tif_file"]
# What to do
actions=[{name_transform=["tolowercase"], content_transform=[ ["gaussian", 3],
"laplacian",
["threshold_image", "abs >", 0.01],
["apply_to_image", ["abs"]],
"otsu_threshold_image",
"erode_image"], mode="copy"}]
This is already fairly complex, but it shows you that you can stack any number of actions
on top of any number of conditions
, giving you a lot of freedom.
And yet, you did not need to write any code.
In full_api.toml you can see an example of how you can specify an entire image processing pipeline with a simple recipe
.
If you experience any problems, please create an issue with the DC version, template, and sample data to reproduce it, including the Julia version and OS.
DataCurator could not work without packages such as:
- Slack.jl
- Images.jl
- PyCall.jl/Conda.jl
- RCall.jl
- SlurmMonitor.jl and many many more, see dependencies