main.nf makes the predictions.
Install Java, Groovy, Nextflow, Docker, and Git. Create accounts in GitHub and Docker Hub. Add 'docker.enabled = true' and 'docker.fixOwnership = true' to your Nexflow configuration (e.g., $HOME/.nextflow/config). Make sure Docker is running and you are logged in to Docker Hub. InputThere are two input options.
1. An ID along with a folder of fasta or fastq files, optionally gzipped. (--raw and --id)
2. A two-column text file, where the first column is an ID, and the second column is a path to a fasta or fastq file (--map). Each ID may have multiple rows. The paths to the files be absolute or relative, but the files must be in the same directory as the map file or under it. If using relative paths, the paths must start with the _parent_ folder of the map file.
Option 1 is more efficient with respect to disk space.
Output
For each input ID, an output text file will be created named '_prediction.txt'. Each ID's output file contains a header line and a second line with the haplotype pair predictions and gene predictions.
Each haplotype within a pair is separated by a '+'. If the prediction is ambiguous, each pair of haplotypes is separated by '|'. e.g.,
'cA01˜tA01+cA01˜tB01|cA01˜tA01+cB05˜tB01|cA01˜tB01+cB05˜tA01' means haplotype
'cA01˜tA01 and cA01˜tB01' or 'cA01˜tA01 and cB05˜tB01' or 'cA01˜tB01 and cB05˜tA01'.
The reference haplotypes are defined at https://github.com/droeatumn/kpi/blob/master/input/haps.txt
Running
Use 'raw' to indicate the input directory, and 'output' to indicate the
directory to put the output. The defaults are 'raw' and 'output' under the
location where KPI was pulled.
Use 'filetype' to indicated the input type; default is 'fq' (FASTQ).
f<a/q/m/bam/kmc> - input in FASTA format (fa), FASTQ format (fq), multi FASTA (fm) or BAM (fbam) or KMC(fkmc); default: FASTQ
Option 1: Provide and ID (--id) and a folder (--raw) with its raw data
./main.nf --id ID --raw inDir --output outDir --filetype fq
e.g., ./main.nf --id id1 --raw ~/input --output ~/output
Option 2: Provide a file with a map (--map) from IDs to their raw data
./main.nf --map mapFile.txt --output outDir --filetype fq
e.g., ./main.nf --map ~/input/idstoRaw.txt --output ~/output
In this example the path to files in idstoRaw.txt are somewhere under ~/input/.
Example using data in the image, so no input is required.
Example 1: cA01˜tA01+cB01˜tB01 with --raw.
Run the following command for an example of interpreting synthetic reads created from sequences with Genbank IDs KP420439 and KP420440 (https://www.ncbi.nlm.nih.gov/nuccore/KP420439 and https://www.ncbi.nlm.nih.gov/nuccore/KP420440)). These two haplotypes contain all the genes except KIR2DS5, so the haplotype predictions are very ambiguous.
./main.nf --id ex1 --raw ~/git/kpi/input/example1 --output ~/output
To run another example, replace 'example1' with 'example2'.
Example 2: cA01˜tA01+cA01˜tB01 with --map and --id.
Run the following command for an example of interpreting synthetic reads created from sequences with Genbank IDs KP420439 and KU645197 (https://www.ncbi.nlm.nih.gov/nuccore/KP420439 and https://www.ncbi.nlm.nih.gov/nuccore/KU645197)).
./main.nf --id ex2 --map ~/git/kpi/input/example2/example2.txt --output ~/output
To run another example, replace 'example2' with 'example1'.
Example 3: combine Example 1 and 2 with --map and --id.
./main.nf --id ex12 --map ~/git/kpi/input/example1-2.txt --output ~/output
Miscellaneous
Hardware
For targeted sequencing, kpi requires approximately 4 CPU, 8G RAM and 20G disk space. For WGS, it requires around 13 CPU, 16G RAM total and 100G temp disk space.
Raw data
The software assumes average coverage for both chromosomes is less than 255. If this is not the case for your data, please downsample before running. Support for high coverage data is a future enhancement.
Containers
To run without a container, use the --nocontainer parameter. To use a
container other than the default (droeatumn/kpi:latest), use the --container parameter.
To run in a self-contained environment with the --id parameter. Replace 'inDir' and 'outDir'.
docker run --rm -it -v inDir:/opt/kpi/raw/ -v outDir:/opt/kpi/output/ droeatumn/kpi:latest /opt/kpi/main.nf --id
Or
docker run --rm -it -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --id ex1 --raw /opt/kpi/input/example1/
Or
docker run --rm -it -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --map /opt/kpi/input/example1/example1.txt
Or, if your bam file (for one individual) is locally in ~/data
docker run --rm -it -v ~/data:/opt/kpi/raw -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --filetype fbam --id testid
Or, if a map to the bam files locally withing ~/data
docker run --rm -it -v ~/data:/opt/kpi/raw -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --filetype fbam --map /opt/kpi/raw/map.txt
Citation
Roe D, Kuang R. Accurate and Efficient KIR Gene and Haplotype Inference From Genome Sequencing Reads With Novel K-mer Signatures. Front Immunol (2020) 11:583013. (https://doi.org/10.3389/fimmu.2020.583013)