-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: unchunkify
Unchunkify a set of jplace files using abundance map files and create per-sample jplace files.
Usage: gappa prepare unchunkify [options]
Input | |
---|---|
--abundances-path |
Required. TEXT:PATH(existing)=[] ... List of abundances files or directories to process. For directories, only files with the extension .json[.gz] are processed. |
--jplace-path |
TEXT:PATH(existing)=[] ... Excludes: --chunk-list-file --chunk-file-expression List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
--sequence-path |
TEXT:PATH(existing)=[] ... List of sequence files or directories to process. For directories, only files with the extension .(fasta|fas|fsa|fna|ffn|faa|frn|phylip|phy)[.gz] are processed. |
--chunk-list-file |
TEXT Excludes: --jplace-path --chunk-file-expression If provided, needs to contain a list of chunk file paths in the numerical order that was produced by the chunkify command. |
--chunk-file-expression |
TEXT Excludes: --jplace-path --chunk-list-file If provided, the expression is used to load jplace files by replacing any '@' character with the chunk number. |
Settings | |
--jplace-cache-size |
UINT=0 Cache size to determine how many jplace files are kept in memory. Default (0) means all. Use this if the command runs out of memory. It however comes at the cost of longer runtime. |
--hash-function |
TEXT:{SHA1,SHA256,MD5}=SHA1 Hash function that was used for re-naming and identifying sequences in the chunkify command. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command reverses the effects of the chunkify command (see there for details on the workflow). That is, it takes the abundance map files and the per-chunk placement files as input, and creates a placement file for each of the original input sequences files, with all abundances and sequences names correctly restored. The command is thus one of the steps of our data preprocessing pipeline for phylogenetic placements as described here.
The easiest way to input the placement files to the command is the --jplace-path
option, which takes a list of files or a directory containing .jplace
files. This option works in all cases, and can even handle cases where sequences were moved around between chunks, or chunks that were merged later, and so on. It simply uses the hash names of the sequences to identify them.
Optionally, when sequence file(s) containing the chunked data (with hashed sequence names) are supplied to --sequence-path
, per sample sequence files are additionally written to the output folder.
For large datasets, using the --jplace-path
option might need too much memory, as all files have to be scanned for the sequence hash names first. This is necessary if the jplace files do not correspond exactly to the chunk files. However, if each jplace file was created from one chunk file, there is no need to scan for hashes in other files. Thus, we offer two memory- and time-saving alternatives:
The option takes a file, which needs to contain one jplace file path per line, in the order of the original chunks. For example, let's say the original sequence files were split into 13 chunks chunk_0.fasta
to chunk_12.fasta
by the chunkify
command. Each of them was then placed on the reference tree, producing 13 jplace files. Then, the list file could look like this:
/path/to/chunk_0/result_0.jplace
/path/to/chunk_1/result_1.jplace
/path/to/chunk_2/result_2.jplace
/path/to/chunk_3/result_3.jplace
/path/to/chunk_4/result_4.jplace
/path/to/chunk_5/result_5.jplace
/path/to/chunk_6/result_6.jplace
/path/to/chunk_7/result_7.jplace
/path/to/chunk_8/result_8.jplace
/path/to/chunk_9/result_9.jplace
/path/to/chunk_10/result_10.jplace
/path/to/chunk_11/result_11.jplace
/path/to/chunk_12/result_12.jplace
That is, each line contains a path, in the original order of the chunks. Then, in order to create the placement entry for a sequence, the number n
of the chunk in which the sequence was "chunkified" is used to find the correct jplace file by using the file in the n
-th line of the list.
Alternatively, if the naming of the per-chunk jplace files is as straight forward as above, that is, the file names are just numbered, it is also possible to use an expression instead of the list file, where the @
character is used as a placeholder for the number:
--chunk-file-expression /path/to/chunk_@/[email protected]
This has the same effect as using the list file.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools