DataCurator.jl Documentation

A multithreaded package to validate, curate, and transform large heterogeneous datasets using reproducible recipes, which can be created both in TOML human readable format, or in Julia.

So far we've been looking at file names and types, but DataCurator can look inside as well, and transform the contents. Where validation only verifies datasets, and does not change them, curation can change the data. Often curation is step 2 after validation, it's nice to check if your expectations match data, but if they don't, you still need to intervene.

Let's say you have a dataset with image files in tif format. Rather than just building lists of them, or checking that they're there, for the right files we want to do some pre-processing. We also want to change the filenames, because they have a mix of upper and lower case, and the actual analysis pipeline we will feed them in later expects lowercase only.

Note that # is a comment line

# Start of the recipe, this configures global options
[global]
act_on_success=true
inputdirectory = "testdir"
# Your rules, `any` means you do not care at what level/depth files are checked
[any]
# When to act, in this case, you want to only work on tif files
conditions=["is_tif_file"]
# What to do
actions=[{name_transform=["tolowercase"], content_transform=[ ["gaussian", 3],
                                                                "laplacian",
                                                                ["threshold_image", "abs >", 0.01],
                                                                ["apply_to_image", ["abs"]],
                                                                "otsu_threshold_image",
                                                                "erode_image"], mode="copy"}]

This is already fairly complex, but it shows you that you can stack any number of actions on top of any number of conditions, giving you a lot of freedom. And yet, you did not need to write any code.

In full_api.toml you can see an example of how you can specify an entire image processing pipeline with a simple recipe.

Troubleshooting

If you experience any problems, please create an issue with the DC version, template, and sample data to reproduce it, including the Julia version and OS.

Acknowledgement

DataCurator could not work without packages such as:

Slack.jl
Images.jl
PyCall.jl/Conda.jl
RCall.jl
SlurmMonitor.jl and many many more, see dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

DataCurator.jl Documentation

Table of contents

Two Simple Examples

Validate a dataset

Curate

Troubleshooting

Acknowledgement

Files

index.md

Latest commit

History

index.md

File metadata and controls

DataCurator.jl Documentation

Table of contents

Two Simple Examples

Validate a dataset

Curate

Troubleshooting

Acknowledgement