-
Notifications
You must be signed in to change notification settings - Fork 1
Eval.nf input data formatting
The eval.nf variant callset evaluation workflow has two distinct and mutually exclusive parameters to provide input data. Both are explained in greater detail below.
(🟢Likely the easier to understand option.) With the --sample_sheet
parameter the user can specify a sample sheet (CSV files) as the eval.nf workflow's input data. The sample sheet must contain a table in comma-separated values. The first line of the table must be a header of the format "index,truthset,callset". The following lines contain the corresponding truth- and callsets. The column order is not strict but must be consistent across all lines. In general, the sample sheet looks like:
index,truthset,callset
1,/path/to/truthsetOne.vcf,/path/to/callsetOne.vcf
2,/path/to/truthsetTwo.vcf,/path/to/callsetTwo.vcf
3,/path/to/truthsetThree.vcf,/path/to/callsetThree.vcf
4,/path/to/truthsetFour.vcf,/path/to/callsetFour.vcf
...
Here is another more precise example of the sample sheet using CIEVaD's test data within the aux/ci_data/
subdirectory, using truthset data from a default hap.nf run and assuming CIEVaD was downloaded to your home (~
) directory:
index,truthset,callset
1,~/cievad/results/simulated_hap1.vcf,~/cievad/aux/ci_data/callset_1.vcf.gz
2,~/cievad/results/simulated_hap2.vcf,~/cievad/aux/ci_data/callset_2.vcf.gz
3,~/cievad/results/simulated_hap3.vcf,~/cievad/aux/ci_data/callset_3.vcf.gz
At this point in time (May 07th, 2024; version 0.3.0) we only tested and confirm functionality using absolute paths for the files. As opposed to the other input option below (section "Input directory") the truth- and callsets do not need to comply with a naming convention. It is up to the user to verify that the truthset and callset per line correspond to each other for evaluation.
Finally, with a correctly formatted sample sheet (e.g. my_samples.csv
present in the CIEVaD root directory) a run command of the evaluation workflow simply looks like:
nextflow run eval.nf -profile local,conda --sample_sheet my_samples.csv
With the --callsets_dir
parameter the user can specify a directory for the workflow to automatically detect variant callsets (VCF files). Each VCF file has to comply with the naming format callset_<X>.vcf[.gz]
, where <X> is the index of the corresponding truthset. Callsets can optionally be gzip compressed. For example, CIEVaD comes with some test data in the aux/ci_data/
subdirectory:
$ tree aux/ci_data/
aux/ci_data/
├── callset_1.vcf.gz
├── callset_2.vcf.gz
├── callset_3.vcf.gz
└── README.md
These callsets were generated from the NGS data of three simulated haplotypes from the hap.nf workflow. Hence, the index [1-3]
in the filenames callset_[1-3].vcf.gz
corresponds to the index of the truthsets results/simulated_hap[1-3].vcf
. To use the test data for eval.nf the input parameter would simply look like:
nextflow run eval.nf -profile local,conda --callsets_dir aux/ci_data/
Here, the corresponding truthsets are assumed to be in the default location (results
directory) and are found automatically. Tip: callsets can also be UNIX symlinks which comes in handy when dealing with larger numbers of truth- and callsets.