Uses a compressed de Bruijn graph (implemented in GATB) to count unitigs in bacterial populations.
This is a slightly modified version of the unitig and graph steps in DBGWAS software, repurposed for input into pyseer.
NB We cannot offer support for unitig-counter, it is provided 'as-is'. Please consider using unitig-caller instead, which offers the same functionality.
If you use this, please cite the DBGWAS paper:
Jaillard M., Lima L. et al. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. PLOS Genetics. 14, e1007758 (2018). doi:10.1371/journal.pgen.1007758.
- Changes the format of the output from
step1
from bugwas matrix to pyseer input (Rtab or kmers). - Removes all code for
step2
andstep3
in DBGWAS. - Remove unused depencencies.
- Change installation procedure ready for bioconda.
Recommended installation is through conda:
conda install unitig-counter
If the package cannot be found, ensure your channels are set up correctly for bioconda.
For compilation from source, see INSTALL.md
.
Run:
unitig-counter -strains strain_list.txt -output output -nb-cores 4
Where strain_list.txt
is a list of input files (assemblies) with a header, for example:
ID Path
6925_1_49 assemblies/6925_1#49.contigs_velvet.fa
6925_1_50 assemblies/6925_1#50.contigs_velvet.fa
Output is in output/unitigs.txt
and can be used with --kmers
in pyseer. You can also test just the
unique patterns in output/unitigs.unique_rows.txt
with the --Rtab
option.
Some unitigs in the output may span multiple input contigs. If you wish to restrict your unitig calls to those appearing in assembled contigs, you can either:
- Run unitig-caller on the input genomes, using the unitig calls from your run.
- Run the script in the
gatb
/bcalm
package, which will cut unitigs that span multiple contigs.
Thanks to @rchikhi and @apredeus for discovering and fixing this.
Two get the shortest sequence distance between two unitigs:
cdbg-ops dist --graph test_data/graph --source GTAATAAACAAA --target AAAAAAAAAAGTTAAAAAT
Short unitigs can be extended by following paths in the graph to neightbouring nodes. This can help map sequences which on their own are difficult to align in a specific manner.
Create a file unitigs.txt
with the unitigs to extend (probably your significantly associated hits)
and run:
cdbg-ops extend --graph output/graph --unitigs unitigs.txt > extended.txt
The output extended.txt
will contain possible extensions, comma separated, with lines corresponding to unitigs
in the input. See the help for more options.
A similar python script can be found in unitig-graph
:
python unitig-graph/extend_hits.py --prefix output/graph --unitigs unitigs.txt > extended.txt