Skip to content

1) Index creation

Grégoire Siekaniec edited this page Mar 28, 2022 · 4 revisions

Create your own index

⚠️ Caution: reference genomes must be in .fasta or .fna and in the same directory.

If you have the fasta files distributed in several subfolders, you have to redirect them to a single directory.

Repertory containing fasta files (must be the current directory) to run the following scripts:

cd path/to/the/fasta/files/repertory

1) Create the bloom filters (.bf) for each genome

As the last identification/quantification step requires the size of the genomes, it is preferable to calculate it when we have our genomes:

ORI.py length -g path/to/the/genomes -o length.txt
Parameters
Parameters Description Required
-g/--genomes path to the repertory containing genome (.fna or .fasta). Yes
-o/--outfile output file containing length of each genome. No. Default: length.txt
-fpr/--false_positive_rate false positive percent (between 0 and 1) wanted for the bloom filter containing the largest genomes. No. Default: 0.01
-s/--seed_size size of the spaced seed you want to use (not the weight). No. Default: None

The ORI.py length step also allows to calculate an effective size for the bloom filters. Use the -seed_size and -false_positive_rate options . The Bloom filter size is given in the bf_min_size.txt file. If you ever have large genomes it makes more sense to use ntCards. In fact the ORI.py length script gives the effective size of the Bloom filters for the total number of occurrences of k-mers in the largest genome. However, the calculation should be carried out with the number of distinct k-mers (longer to calculate).


BF creation

Then we create the bloom filters for all genomes:

howdesbt makebfQ --k=15 --qgram=path/to/seedfile.txt --bits=0.25G *.fna
Parameters
Parameters Description
--k length of the seed given in --qgram. You may modify your seedfile.txt and this parameter.
--qgram file containing the used spaced seed. It's a text file containing the seed used on the first line (here 111111001111111).
--bits size of bloom filters. A size that is too small will give too many false positives to be usable and a size that is too large will take up too much space and greatly increase the computation times.

Then, we get the names of the bf (Bloom filter) files used to create the index:

ls *.bf > leafname

Now that the bloom filters are created it is no longer necessary to keep the fastas files. If it is not necessary to keep them, they can be deleted to save space. In addition, if the fasta files cannot be completely downloaded on the machine due to lack of space, it is possible to download them little by little and create the filters as you go by deleting the fasta files once in the form of a filter (.bf).


Once the Bloom filters are created, two choices are possible (1) merge the very close genomes together (sibling strains) using a threshold t on the Hamming distance between the strains Bloom filters and then build the index or (2) directly build the index without merging the sibling strains.

Index creation

2) Merging close strains

If you want to merge sibling strains to identify a fine cluster of strains rather than a mixed list of single strains, you need to use:

howdesbt distance --list=leafname
Parameters
Parameters Description
--list list of the bloom filters names (one per line).

⚠️ Caution: the threshold parameter t depends on the genetic proximity of your strains based on the Hamming distance matrix which is itself dependent on the size of the Bloom filters used.

To adapt this parameter to your bacterial species, launch the following command once (ORI.py threshold). It gives three figures and one file helping to set a correct threshold. The files are the following:

  • threshold.png : contains the distribution of distances between the strains in the index (distances are multiplied by 1e05 in the figure).
  • hc_heatmap.svg: contains the heatmap of the distances between strains which can be compared to heatmap_threshold.svg which shows the same heatmap but with distances higher than the set threshold considered as identical, we can then already start to visualize the clusters which will be realized according to the threshold used.
  • heatmap_names.txt: contains the names of the strains having at least one sibling strain, in the order of the heatmaps (hc_heatmap.svg and heatmap_threshold.svg).

More generally, if your group of strains is too big and gives you too many clusters, try a lower t value.

ORI.py threshold -m hamming_matrix.tsv -t 0.0002
Parameters
Parameters Description Required
-m/--matrix path to the hamming distance matrix. It's the output of the first howdesbt distance Yes
-t/--threshold threshold that we want to set to merge close genomes. Be careful not to set this threshold too high or too low. Floating number between 0 and 1. No. Default: 0.0002

The default 0.0002 value is the value used to merge Streptococcus thermophilus strains using filters of size 0.5 G. This value must be modified in the case of using another species and/or another filter size.

Once you have defined your own t value, you can merge the sibling strains in clusters. For this ORI creates a graph where a node corresponds to a strain and an edge correspond to a sibling strain relation between two strains (Hamming distance lower than the threshold t). There are then two ways to merge the strains:

  1. merge the connected components of this graph (--merge).
  2. look for the maximal cliques of the graph and merge the overlapping cliques if their clustering coefficient (which is a value between 0.5 and 1 representing the completeness of the graph) is higher than a threshold (--merge=0.7). In case where the clustering coefficient is lower than the threshold the cliques are separated using the minimum cut.

The command line is as follows:

howdesbt distance --list=leafname --threshold=0.0002 --matrix=hamming_matrix.bin --merge (--merge=0.7)
Parameters
Parameters Description
--list list of the bloom filters names (one per line).
--matrix path to the hamming distance matrix. It's the output of the first howdesbt distance
--threshold hamming distance threshold between bloom filter for merging them. Floating number between 0 and 1.
--merge or --merge=0.7 merge maximal cliques.

Once the merged Bloom filters have been created, we can delete the filters of the sibling strains. The following ORI.py clean_merge script removes these filters and creates a list_number_file.txt file containing the list of strains and their corresponding numbers allowing to find which strain is present in each cluster (clusters are named using strain numbers, not full names).

ORI.py clean_merge -n leafname -r path/to/repository/with/bf/files -o list_number_file.txt
Parameters
Parameters Description Required
-n/--names list of the bloom filters names (one per line). Yes
-r/--repository repertory containing the bloom filter file (.bf). Yes
-o/--outfile output file. Out file containing one id number, one genome name and the corresponding number of sequence per line. Yes

Then, as the Bloom filters have changed, we get the new names of the bf (Bloom filter) files used to create the index afterwards:

ls *.bf > leafname_merge

Since the genomes of some strains have been merged, the genome sizes are recalculated (mean sizes for clusters):

ORI.py merge_length -b leafname_merge -l length.txt -c list_number_file.txt -o merge_length.txt
Parameters
Parameters Description Required
-b/--bflist list of the bloom filters names (one per line). Yes
-l/--lengthfile file containing length of each genome. It's the output of ORI.py length (default : length.txt). Yes
-c/--correspondance file containing correspondance between numbers and genomes. It's the output of ORI.py clean_merge (list_number_file.txt). Yes
-o/--outfile output file containing length of each genome or genomes cluster. No. Default: length_merge.txt

3) Create the index tree

The topology of the index tree is then calculated.
⚠️ Caution: To run these commands you must be in the directory containing the .bf files.

howdesbt cluster --list=leafname/or/leafname_merge --tree=union.sbt --nodename=node{number} --cull
Parameters
Parameters Description
--list list of the bloom filters names (one per line).
--tree name for tree toplogy file.
--nodename filename template for internal tree nodes this must contain the substring {number}.
--cull remove nodes from the binary tree; remove those for which saturation of determined is more than 2 standard deviations.

And the index is created:

howdesbt build --howde --tree=union.sbt --outtree=howde.sbt
Parameters
Parameters Description
--howde create tree nodes as determined/how, but only store active bits. Create the nodes as rrr-compressed bit vector(s).
--tree name for tree toplogy file.
--outtree name of topology file to write tree consisting of the filters built.

Once the compressed Bloom filters have been created, we can delete those that are not compressed (:warning: be careful, this step deletes files):

ls | grep -Pv 'detbrief.rrr.' | grep '.bf' | xargs rm --