Snakemake workflow to setup external data for data analyses. The data sources can be local or remote files.
Easiest is to install via pip:
python -m pip install git+https://github.com/percyfal/datasources-smk@main
Alternatively grab a copy of the source distribution and make a local install:
git clone https://github.com/percyfal/datasources-smk.git
cd datasources-smk
python -m pip install -e .
The workflow and additional commands run via the main entry point:
datasources -h
datasources run -j 1
datasources run --configfile datasources.yaml
See the subcommand help for more information.
This workflow reads a datasources yaml file with list elements
consisting of data
and source
keys, or alternatively a
tab-separated file with columns data
and source
. The data
and
source
keys define file URI mapping from source to a snakemake
target. Supported URI schemes are currently rsync
, file
, sftp
,
http
and https
.
There are two optional keys; description
is a free text field for
provenance information, and tag
a tag to group data types such that
subsets of datasources can be targeted.
The datasources file can be provided via the --configfile
option. If
unset, the workflow will look for files datasources.yaml
,
datasources.tsv
, config/datasources.yaml
and
config/datasources.tsv
, in that order.
URIs are given according to the URI generic
syntax.
For instance, a local file is given as file:relative/path/to/source
,
whereas examples of a remote files are
rsync://example.com:80/absolute/path/to/source
and
sftp://example.com:80/absolute/path/to/source
.
A tsv-formatted datasources file can look like
data source
data/foo1.txt rsync:external_resources/foo1.txt
data/foo2.txt file:external_resources/foo2.txt
data/README.md https://raw.githubusercontent.com/percyfal/datasources-smk/main/README.md
data/foo/foo*txt file:external_resources/
and the corresponding yaml file
- data: data/foo1.txt
source: rsync:external_resources/foo1.txt
description: foo1 data file to copy
- data: data/foo2.txt
source: file:external_resources/foo2.txt
description: foo2 data file to link
- data: data/README.md
source: https://raw.githubusercontent.com/percyfal/datasources-smk/main/README.md
description: Grab readme file from github
- data: data/foo/foo*txt
source: file:external_resources/
description: >-
link all *txt files from directory external_resources to directory
data/foo
- Per Unneberg (@percyfal)
Test cases are in the subfolder src/datasources/.test
. They are automatically
executed via continuous integration with Github
Actions.