FindFlu is a bioinformatic tool designed to identify and characterize fragments of influenza A virus (IAV) genome from user-provided sequences. For each IAV genome fragment identified, FindFlu will report the segment of the IAV genome from which it originated, the subtype of that segment (for fragments of haemagglutinin and neuraminidase segments), and the length of the fragment. FindFlu identifies and characterizes fragments by aligning them to a database of IAV reference sequences and finding each fragment’s best-matching reference sequence.
As input, FindFlu expects a FASTA file containing pairs of consensus sequences describing both ends of distinct genomic fragments. The entries in this FASTA file must have headers that use the HopDropper (v1.0.0) format (https://github.com/KevinKuchinski/HopDropper).
FindFlu aligns input sequences to a database of annotated IAV reference sequences. Reference sequences must be in a FASTA file with the following header format:
>ref_seq_identifier|strain_name(strain_subtype)|segment|subtype|
Ref_seq_identifier is a unique identifier for the reference sequence, typically its GenBank accession number. Strain_name is the name of the strain (e.g. A/goose/Guangdong/1/1996), and strain_subtype is the HxNx subtype of the strain (regardless of which IAV genome segment the reference sequence represents). Segment is the IAV genome segment the reference sequence represents, and it must be encoded as one of the following: PB2, PB1, PA, HA, NP, NA, M, or NS. Subtype is the HA subtype (for HA segment sequences) or the NA subtype (for NA segment sequences). For internal segments, ‘none’ should be entered in the subtype field.
FindFlu aligns input sequences to the reference sequence database with blastn (v2.13.0). The following filtering is performed on the blastn alignments:
- Alignments are discarded if their percent identity is less than the minimum identity threshold (-I, default=90%).
- Alignments are discarded if their query coverage is less than the minimum coverage threshold (-c, default=95%).
- Each fragment end sequence can only align once to a reference sequence, otherwise all alignments between that fragment end and that reference sequence are discarded.
- Both ends of a fragment must align to a reference sequence, otherwise all alignments between either end of that fragment and that reference sequence are discarded.
- For each fragment-reference sequence pairing, one end must align in the plus sense and the other end must align in the minus sense.
The following steps are performed on the filtered blastn alignments to identify each fragment's best-matching reference sequences:
- A combined bitscore is calculated for each fragment-reference sequence pairing. The combined bitscore is determined by summing the bitscores for both fragment ends alignments to the reference sequence.
- The best reference sequences for each fragment are chosen based on the top n-ranked combined bitscores calculated in the previous step (-r, default=2).
- All best reference sequences for a fragment must share the same segment and subtype annotation, otherwise that fragment is discarded.
- The length of the fragment is estimated based on the distance between alignment coordinates for both fragment ends. If a fragment has multiple best reference sequences, a fragment length is estimated for each reference sequence, then the median fragment length is chosen.
Once the length of each fragment has been estimated, the consensus sequences for both ends of a fragment are merged into a single consensus sequence. If the fragment end sequences do not overlap each other, FindFlu inserts the appropriate number of N characters between them based on the distance between the two fragment ends and the number of positions sequenced at each fragment end. The merged fragment sequences are then re-aligned against the FindFlu database so that the number mismatches, insertions, and deletions between fragments and each of their best-matching reference sequences can be calculated.
FindFlu outputs 3 files to the directory specified at runtime by the -o parameter. File names are prepended with the analysis name provided at runtime by the -n parameter.
• flu_frags_seqs.fa: A FASTA file containing the IAV fragments identified by FindFlu. Each fragment is described by a single consensus sequence. Fragments inherit their FASTA headers from the input file, but with segment and subtype appended.
• top_ref_seqs_report.csv: This is FindFlu’s most detailed report. Each line describes the alignment of one fragment to one of its best-matching reference sequences. Columns are provided that indicate fragment length and the segment and subtype of the fragment/reference sequence. Alignment metrics (identity, coverage, mismatches, gaps, etc) are also provided.
• frag_report.csv: This report is a simplified version of the best reference sequences report; it is a line list wherein information about each fragment is summarized on one line.
Each line of this report describes the alignment of one fragment to one of its best-matching reference sequences. It contains the following columns:
• experiment: the name of the experiment to which this line's fragment belongs (extracted from the HopDropper header).
• lib_name: the name of the library in which this line's fragment was detected (extracted from the HopDropper header).
• frag_name: the fragment identifier assigned to this line's fragment by HopDropper (extracted from the HopDropper header).
• UMI_pair: the UMI pair describing this line's fragment (extracted from the HopDropper header). This can be used to identify the same fragment across different experiments (when it might have been assigned a different fragment identifier within its library).
• frag_copies: the number of times this line's fragment was sequenced (extracted from the HopDropper header).
• frag_length: The estimated length (in nucleotide positions) of this line's fragment based on this line's reference sequence.
• perc_sequenced: The estimated percentage of positions in the fragment that were sequenced (based on the fragment length estimate, the length of both fragment end sequences, and the length of any overlap between the fragment end sequences).
• ref_seq_name: the strain name of this line's reference sequence and its HxNx subtype in parentheses.
• ref_seq_length: the length of this line's reference sequence.
• segment: the IAV genome segment of this line's reference sequence (one of PB2, PB1, PA, HA, NP, NA, M, or NS).
• subtype: the HA or NA subtype of this line's reference sequence ("none" for internal segments).
• aligned: the number of positions in the fragment that aligned to this line's reference sequence.
• identical: the number of identical positions in the alignment between this line's fragment and reference sequence.
• mismatches: the number of mismatched positions in the alignment between this line's fragment and reference sequence (positions with Ns do NOT count as mismatches).
• deletions: the number of deletions in the this line's fragment when aligned this line's reference sequence.
• insertions: the number of insertions in the this line's fragment when aligned this line's reference sequence.
• ambiguous: the number of ambiguous positions in the alignment between this line's fragment and reference sequence.
• ambiguous: the number of ambiguous positions in the alignment between this line's fragment and reference sequence.
• perc_aligned: the percentage of positions in this line's fragment that aligned to this line's reference sequence.
• perc_identity: the percentage of positions in this line's fragment that aligned to this line's reference sequence and were identical.
• perc_ambiguous: the percentage of positions in this line's fragment that aligned to this line's reference sequence and were ambiguous.
• ref_seq_cov_sequenced: the percentage of positions in this line's reference sequence that were covered by this line's fragment and sequenced.
• ref_seq_cov_total: the percentage of positions in this line's reference sequence that were covered by this line's fragment including ambiguous positions and unsequenced positions between the fragment end sequences.
This report is a line list wherein information about each fragment is summarized on one line. It contains the following columns:
• experiment: the name of the experiment to which this line's fragment belongs (extracted from the HopDropper header).
• lib_name: the name of the library in which this line's fragment was detected (extracted from the HopDropper header).
• frag_name: the fragment identifier assigned to this line's fragment by HopDropper (extracted from the HopDropper header).
• UMI_pair: the UMI pair describing this line's fragment (extracted from the HopDropper header). This can be used to identify the same fragment across different experiments (when it might have been assigned a different fragment identifier within its library).
• frag_copies: the number of times this line's fragment was sequenced (extracted from the HopDropper header).
• segment: the IAV genome segment of this line's fragment (one of PB2, PB1, PA, HA, NP, NA, M, or NS).
• subtype: the HA or NA subtype of this line's fragment ("none" for internal segments).
• frag_length: The estimated length (in nucleotide positions) of this line's fragment (median of frag_length estimates based on all best reference sequences).
• perc_sequenced: The estimated percentage of positions in the fragment that were sequenced based on the fragment length estimate, the length of both fragment end sequences, and the length of any overlap between the fragment end sequences (median of perc_sequenced based on all best reference sequences).
• perc_aligned: the percentage of positions in this line's fragment that aligned to its best reference sequences (median of perc_aligned based on all best reference sequences).
• perc_identity: the percentage of positions in this line's fragment that aligned to its best reference sequences and were identical (median of perc_identity based on all best reference sequences).
• perc_ambiguous: the percentage of positions in this line's fragment that aligned to its best reference sequences and were ambiguous (median of perc_ambiguous based on all best reference sequences).
• ref_seq_cov_sequenced: the percentage of positions in the genome segment covered by this line's fragment and sequenced (median of ref_seq_cov_sequenced based on all best reference sequences).
• ref_seq_cov_total: the percentage of positions in the genome segment covered by this line's fragment including ambiguous positions and unsequenced positions between the fragment end sequences (median of ref_seq_cov_total based on all best reference sequences).
python find_flu.py -i <input FASTA file> -o <output dir> -n <analysis name> -d <database FASTA file> [-I <min alignment identity> -c <min alignment cov> -t <blastn threads>]
Required arguments:
-i : input FASTA file
-o : output directory
-n : analysis name
-d : database FASTA file
Optional arguments:
-I : minimum fragment end alignment identity percentage (default = 90)
-c : minimum fragment end alignment coverage percentage (default = 95)
-t : number of threads to use for blastn (default = 1)
-r : ranks of combined bitscore to consider as best matches (default=2)