Skip to content

Output format

Jim Shaw edited this page Oct 13, 2024 · 2 revisions

Output folder directory structure

In the output folder (specified by the -o option, or devider_output by default), the following files will be available.

devider_output/
  |
  |--- snp_haplotypes.fasta <- multiple sequence alignment of SNPs
  |
  |--- majority_vote_haplotypes.fasta <- base-level haplotypes 
  |
  |--- ids.txt <- assignment of reads to haplotypes
  |
  |--- hap_info.txt <- more information about haplotypes
  |
  |--- intermediate/ <- files for debugging (not important)
  |
  |--- pipeline_files/ <- bam + vcf files (only present if using run_devider_pipeline)

snp_haplotypes.fasta - sequences of SNPs as an multiple sequence alignment

>Contig:OR483991.1,Range:ALL-ALL,Haplotype:0,Abundance:5.52,Depth:8.43
11000110011010000110000100011011011000011000001110101101011121000100010010110100
1110100100011010011111111010000110010---------

>Contig:OR483991.1,Range:ALL-ALL,Haplotype:1,Abundance:9.27,Depth:14.17
10010110000010010110111010111000011111000001011110110001100001000001100000100110
1011100110011011111000010000111110011101010010

This is a valid multiple sequence alignment in fasta format.

  1. The > line contains haplotype information delimited by commas.
  • Contig: represents the contig identifier
  • Range: indicates the coordinates that were haplotyped, e.g., 3000-6000. ALL-ALL indicates no coordinates specified.
  • Haplotype: is a haplotype identifier starting from 0.
  • Abundance: indicates the normalized depth (grouped by Contig and Range) times 100.
  • Depth: is the approximate depth of coverage; will be underestimated slightly if reads are erroneous
  1. The 0, 1, ... represent reference or alternate alleles within this haplotype. - indicates this SNP is not covered by reads within the haplotype. The base position of each SNP is indicated in the hap_info.txt file.

Tip

Use --allele-output to output the actual base-level alleles instead of 0 or 1. This output can be fed into MSA visualizers.

You can also use this to build a phylogenetic tree, but you may have to change the ids because : and , are not valid for many tree building software.

majority_vote_haplotypes.fasta - base-level consensus sequence for haplotypes

>Contig:OR483991.1,Range:ALL-ALL,Haplotype:0,Abundance:5.51839673547632,Depth:8.431375873903942 SimpleConsensus
....NNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAATTCGGCTAAGGCCANGGGGACGTNNAAAATATCAACTAAAACATT...

This is a fasta file (but not an MSA) representing base-level haplotypes. The bases are obtained by taking the majority base at each position according to the alignment against the reference.

  1. The header line is the same as in snp_haplotypes.fasta.
  2. N is output is the coverage at the base is < --min-cov OR the fraction of bases supporting the majority base is < --n-fraction.

ids.txt - assignments of reads to haplotypes

Contig:OR483991.1       Range:ALL-ALL   Haplotype:0     61440ba2-e383-ee56-9dcb-d15b0797ea01    8f58b524-0a68-aea0-447a-dd5d2d68925d    79c785fe-1e86-29ed-d496-b003898b91d6
Contig:OR483991.1       Range:ALL-ALL   Haplotype:1     41178954-b99c-02ed-164f-45d7e1b37bfd    34bf4b42-b8ef-2e30-7ae9-14e85e0a5395    9a558a06-f0c8-66c9-8033-8d60ae795ddd
...

This is a tab-delimited file.

  1. First column indicates contig.
  2. Second column indicates the range.
  3. Third column indicates the haplotype id.
  4. Fourth column to last column are identifiers of reads assigned to this haplotype.

Warning

A read can possibly be assigned to multiple contigs or haplotypes (e.g., supplementary alignments across contigs).

Tip

If you want to haplotag your bam file (i.e., add HP:i flags to the BAM) for visualization, use the script haplotag_bam included in the conda install (or the scripts/ folder)

hap_info.txt - information about SNPs and haplotypes

Contig:OR483991.1,Range:ALL-ALL Haplotype:0     Haplotype:1     Haplotype:2     Haplotype:3     Haplotype:4     Haplotype:5
286     1:0.83  1:1.00  1:0.80  1:1.00  1:1.00  1:1.00
322     1:1.00  0:1.00  1:1.00  1:1.00  1:0.93  1:1.00
476     0:1.00  0:1.00  0:1.00  0:1.00  0:1.00  1:0.89
491     0:1.00  1:0.52  0:1.00  0:1.00  1:0.96  0:1.00
726     0:1.00  0:1.00  0:1.00  0:0.95  1:0.91  1:1.00
756     1:1.00  1:1.00  1:1.00  1:0.85  1:0.98  0:1.00

This is a tab delimited file (a TSV). The first line is a header. The subsequent lines 286 1:0.83 1:1.00 1:0.80 1:1.00 1:1.00 1:1.00 are interpreted as:

  1. 286 - base-level location of the first SNP.
  2. 1:0.83 - For Haplotype:0, 83% of the reads support the 1 allele -- i.e., the first alternate allele.
  3. 1:1.00 - For Haplotype:1, 100% of the reads support the 1 allele.
  4. And so forth

pipeline_files/ - folder with BAM + VCF files from run_devider_pipeline

This folder is present if run_devider_pipeline was used. This contains BAM files from using minimap2 for your input reads against the reference (default minimap2 parameters). This also contains the VCF file from using LoFreq (--B parameter used). You can rerun devider on these files:

devider -b output/pipeline_files/mapping.bam -v output/pipeline_files/lofreq.vcf.gz -r reference.fa (use some other options)