output formats

annotation file headers
in-memory annotation dictionary keys
partition file headers

The annotate action writes a single csv file with annotations for each sequence in the input. The partition action, on the other hand, writes two csv files: one with a list of the most likely partitions and their relative likelihoods, and another with annotations for each cluster in the most likely partition. This cluster annotation file is by default written to the same path as the partition file, but with -cluster-annotations inserted before the suffix (you can change this with --cluster-annotation-fname). These two files will eventually be combined into a single yaml file.

If you just want to view an ascii representation of the results, use the partis view-annotations and view-partitions actions. If you want to directly access the csv columns, the partition and annotation headers are listed below. While by default a fairly minimal set of annotation information is written to file, many more keys are present in the dictionary in memory. Any of these keys, together with several additional ones, can be added to the output file by setting --extra-annotation-columns (for all the choices, see partis annotate --help|grep -C5 extra-annotation).

If, on the other hand, you need to do some additional calculations, there is a lot of existing code to facilitate this. To get you started, there are simple example parsing scripts for annotation and partition output files. If more detailed calculations are necessary, you can of course just add the necessary code to the example scripts.

annotation file headers

The annotation csv contains the following columns by default:

name	description
unique_ids	colon-separated list of sequence identification strings
v_gene	V gene in most likely annotation
d_gene	see v_gene
j_gene	see v_gene
cdr3_length	CDR3 length of most likely annotation (IMGT scheme, i.e. including both codons in their entirety)
mut_freqs	colon-separated list of sequence mutation frequencies
input_seqs	colon-separated list of input sequences, with constant regions (fv/jf insertions) removed
naive_seq	naive (unmutated ancestor) sequence corresponding to most likely annotation
v_3p_del	length of V 3' deletion in most likely annotation
d_5p_del	see v_3p_del
d_3p_del	see v_3p_del
j_5p_del	see v_3p_del
v_5p_del	length of an "effective" V 5' deletion in the most likely annotation, corresponding to a read which does not extend through the entire V segment
j_3p_del	see v_5p_del
vd_insertion	sequence of nucleotides corresponding to the non-templated insertion between the V and D segments
dj_insertion	sequence of nucleotides corresponding to the non-templated insertion between the D and J segments
fv_insertion	constant region on the 5' side of the V (accounts for reads which extend beyond the 5' end of V)
jf_insertion	constant region on the 3' side of the J (accounts for reads which extend beyond the 3' end of J)
codon_positions	zero-indexed V and J conserved codon positions in the indel-reversed sequence[s]
mutated_invariants	true if the conserved codons corresponding to the start and end of the CDR3 code for the same amino acid as in their original germline (cyst and tryp/phen, in IMGT numbering)
in_frames	true if the net effect of VDJ rearrangement and SHM indels leaves both the start and end of the CDR3 (IMGT cyst and tryp/phen) in frame with respect to the start of the germline V sequence
stops	true if there's a stop codon in frame with respect to the start of the germline V sequence
v_per_gene_support	approximate probability supporting the top V gene matches, as a semicolon-separated list of colon-separated gene:probability pairs (approximate: monotonically related to the actual probability, but not exactly one-to-one)
d_per_gene_support	see v_per_gene_support
j_per_gene_support	see v_per_gene_support
indel_reversed_seqs	colon-separated list of input sequences with indels "reversed" (i.e. undone), and with constant regions (fv/jf insertions) removed. Empty string if there are no indels, i.e. if it's the same as 'input_seqs'
gl_gap_seqs	colon-separated list of germline sequences with gaps at shm indel positions (alignment matches qr_gap_seqs)
qr_gap_seqs	colon-separated list of query sequences with gaps at shm indel positions (alignment matches gl_gap_seqs)
duplicates	colon-separated list of "duplicate" sequences for each sequence, i.e. sequences which, after trimming fv/jf insertions, were identical and were thus collapsed.

while the following (together with any in-memory key) can be added using --extra-annotation-columns:

name	description
cdr3_seqs	nucleotide CDR3 sequence, including bounding conserved codons
full_coding_naive_seq	in cases where the input reads do not extend through the entire V and J regions, the input_seqs and naive_seq keys will also not cover the whole coding regions. In such cases full_coding_naive_seq and full_coding_input_seqs can be used to tack on the missing bits.
full_coding_input_seqs	see full_coding_naive_seq

All columns listed as "colon-separated lists" are trivial/length one for single sequence annotation, i.e. are only length greater than one (contain actual colons) when the multi-hmm has been used for simultaneous annotation on several clonally-related sequences (typically in the cluster annotation file from partitioning, but can also be set using --n-simultaneous-seqs and --simultaneous-true-clonal-seqs).

deprecated keys (only present in old files):

column header	description
indelfos	colon-separated list of information on any SHM indels that were inferred in the Smith-Waterman step

in-memory annotation dictionary keys

Extra information available in the dictionary in memory, but not written to disc by default (can be written by setting --extra-annotation-columns):

key	value
codon_positions	zero-indexed indel-reversed-sequence positions of the conserved cyst and tryp/phen codons, e.g. `{'v': 285, 'j': 336}`
v_gl_seq	portion of v germline gene aligned to the indel-reversed sequence (i.e. with 5p and 3p deletions removed). Colon-separated list.
d_gl_seq	see v_gl_seq
j_gl_seq	see v_gl_seq
v_qr_seqs	porition of indel-reversed sequence aligned to the v region. Colon-separated list.
d_qr_seqs	see v_qr_seqs
j_qr_seqs	see v_qr_seqs
lengths	lengths aligned to each of the v, d, and j regions, e.g. `{'j': 48, 'd': 26, 'v': 296}`
regional_bounds	indices corresponding to the boundaries of the v, d, and j regions (python slice conventions), e.g. `{'j': (322, 370), 'd': (296, 322), 'v': (0, 296)}`
aligned_v_seqs	colon-separated list of indel-reversed sequences aligned to germline sequences given by `--aligned-germline-fname`. Only used for presto output
aligned_d_seqs	see aligned_v_seqs
aligned_j_seqs	see aligned_v_seqs
invalid	indicates an invalid rearrangement event

partition file headers

The partition action writes a list of partitions, with one line for the most likely partition (with the lowest logprob), as well as a number of lines for the surrounding less-likely partitions. It also writes the annotation for each cluster in the most likely partition to a separate file (by default <--outfname>.replace('.csv', '-cluster-annotations.csv'), you can change this file name with --cluster-annotation-fname).

column header	description
logprob	Total log probability of this partition
n_clusters	Number of clusters (clonal families in this partition)
partition	String representing the clusters, where clusters are separated by ; and sequences within clusters by :, e.g. 'a:b;c:d:e'
n_procs	Number of processes which were simultaneously running for this clusterpath. In practice, final output is usually only written for n_procs = 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

output-formats.md

output-formats.md

output formats

annotation file headers

in-memory annotation dictionary keys

partition file headers

Files

output-formats.md

Latest commit

History

output-formats.md

File metadata and controls

output formats

annotation file headers

in-memory annotation dictionary keys

partition file headers