-
Notifications
You must be signed in to change notification settings - Fork 0
1) Index creation
If you have the fasta files distributed in several subfolders, you have to redirect them to a single directory.
Repertory containing fasta files (must be the current directory) to run the following scripts:
cd path/to/the/fasta/files/repertory
As the last identification/quantification step requires the size of the genomes, it is preferable to calculate it when we have our genomes:
ORI.py length -g path/to/the/genomes -o length.txt
Parameters
Parameters | Description | Required |
---|---|---|
-g/--genomes | path to the repertory containing genome (.fna or .fasta). | Yes |
-o/--outfile | output file containing length of each genome. | No. Default: length.txt |
-fpr/--false_positive_rate | false positive percent (between 0 and 1) wanted for the bloom filter containing the largest genomes. | No. Default: 0.01 |
-s/--seed_size | size of the spaced seed you want to use (not the weight). | No. Default: None |
The ORI.py length
step also allows to calculate an effective size for the bloom filters. Use the -seed_size
and -false_positive_rate
options . The Bloom filter size is given in the bf_min_size.txt file. If you ever have large genomes it makes more sense to use ntCards. In fact the ORI.py length
script gives the effective size of the Bloom filters for the total number of occurrences of k-mers in the largest genome. However, the calculation should be carried out with the number of distinct k-mers (longer to calculate).
Then we create the bloom filters for all genomes:
howdesbt makebfQ --k=15 --qgram=path/to/seedfile.txt --bits=0.25G *.fna
Parameters
Parameters | Description |
---|---|
--k | length of the seed given in --qgram. You may modify your seedfile.txt and this parameter. |
--qgram | file containing the used spaced seed. It's a text file containing the seed used on the first line (here 111111001111111). |
--bits | size of bloom filters. A size that is too small will give too many false positives to be usable and a size that is too large will take up too much space and greatly increase the computation times. |
Then, we get the names of the bf (Bloom filter) files used to create the index:
ls *.bf > leafname
Now that the bloom filters are created it is no longer necessary to keep the fastas files. If it is not necessary to keep them, they can be deleted to save space. In addition, if the fasta files cannot be completely downloaded on the machine due to lack of space, it is possible to download them little by little and create the filters as you go by deleting the fasta files once in the form of a filter (.bf).
Once the Bloom filters are created, two choices are possible (1) merge the very close genomes together (sibling strains) using a threshold t on the Hamming distance between the strains Bloom filters and then build the index or (2) directly build the index without merging the sibling strains.
If you want to merge sibling strains to identify a fine cluster of strains rather than a mixed list of single strains, you need to use:
howdesbt distance --list=leafname
Parameters
Parameters | Description |
---|---|
--list | list of the bloom filters names (one per line). |
To adapt this parameter to your bacterial species, launch the following command once (ORI.py threshold
). It gives three figures and one file helping to set a correct threshold. The files are the following:
- threshold.png : contains the distribution of distances between the strains in the index (distances are multiplied by 1e05 in the figure).
- hc_heatmap.svg: contains the heatmap of the distances between strains which can be compared to heatmap_threshold.svg which shows the same heatmap but with distances higher than the set threshold considered as identical, we can then already start to visualize the clusters which will be realized according to the threshold used.
- heatmap_names.txt: contains the names of the strains having at least one sibling strain, in the order of the heatmaps (hc_heatmap.svg and heatmap_threshold.svg).
More generally, if your group of strains is too big and gives you too many clusters, try a lower t value.
ORI.py threshold -m hamming_matrix.tsv -t 0.0002
Parameters
Parameters | Description | Required |
---|---|---|
-m/--matrix | path to the hamming distance matrix. It's the output of the first howdesbt distance | Yes |
-t/--threshold | threshold that we want to set to merge close genomes. Be careful not to set this threshold too high or too low. Floating number between 0 and 1. | No. Default: 0.0002 |
The default 0.0002 value is the value used to merge Streptococcus thermophilus strains using filters of size 0.5 G. This value must be modified in the case of using another species and/or another filter size.
Once you have defined your own t value, you can merge the sibling strains in clusters. For this ORI creates a graph where a node corresponds to a strain and an edge correspond to a sibling strain relation between two strains (Hamming distance lower than the threshold t). There are then two ways to merge the strains:
- merge the connected components of this graph (
--merge
). - look for the maximal cliques of the graph and merge the overlapping cliques if their clustering coefficient (which is a value between 0.5 and 1 representing the completeness of the graph) is higher than a threshold (
--merge=0.7
). In case where the clustering coefficient is lower than the threshold the cliques are separated using the minimum cut.
The command line is as follows:
howdesbt distance --list=leafname --threshold=0.0002 --matrix=hamming_matrix.bin --merge (--merge=0.7)
Parameters
Parameters | Description |
---|---|
--list | list of the bloom filters names (one per line). |
--matrix | path to the hamming distance matrix. It's the output of the first howdesbt distance |
--threshold | hamming distance threshold between bloom filter for merging them. Floating number between 0 and 1. |
--merge or --merge=0.7 | merge maximal cliques. |
Once the merged Bloom filters have been created, we can delete the filters of the sibling strains. The following ORI.py clean_merge
script removes these filters and creates a list_number_file.txt
file containing the list of strains and their corresponding numbers allowing to find which strain is present in each cluster (clusters are named using strain numbers, not full names).
ORI.py clean_merge -n leafname -r path/to/repository/with/bf/files -o list_number_file.txt
Parameters
Parameters | Description | Required |
---|---|---|
-n/--names | list of the bloom filters names (one per line). | Yes |
-r/--repository | repertory containing the bloom filter file (.bf). | Yes |
-o/--outfile | output file. Out file containing one id number, one genome name and the corresponding number of sequence per line. | Yes |
Then, as the Bloom filters have changed, we get the new names of the bf (Bloom filter) files used to create the index afterwards:
ls *.bf > leafname_merge
Since the genomes of some strains have been merged, the genome sizes are recalculated (mean sizes for clusters):
ORI.py merge_length -b leafname_merge -l length.txt -c list_number_file.txt -o merge_length.txt
Parameters
Parameters | Description | Required |
---|---|---|
-b/--bflist | list of the bloom filters names (one per line). | Yes |
-l/--lengthfile | file containing length of each genome. It's the output of ORI.py length (default : length.txt). | Yes |
-c/--correspondance | file containing correspondance between numbers and genomes. It's the output of ORI.py clean_merge (list_number_file.txt). | Yes |
-o/--outfile | output file containing length of each genome or genomes cluster. | No. Default: length_merge.txt |
The topology of the index tree is then calculated.
howdesbt cluster --list=leafname/or/leafname_merge --tree=union.sbt --nodename=node{number} --cull
Parameters
Parameters | Description |
---|---|
--list | list of the bloom filters names (one per line). |
--tree | name for tree toplogy file. |
--nodename | filename template for internal tree nodes this must contain the substring {number}. |
--cull | remove nodes from the binary tree; remove those for which saturation of determined is more than 2 standard deviations. |
And the index is created:
howdesbt build --howde --tree=union.sbt --outtree=howde.sbt
Parameters
Parameters | Description |
---|---|
--howde | create tree nodes as determined/how, but only store active bits. Create the nodes as rrr-compressed bit vector(s). |
--tree | name for tree toplogy file. |
--outtree | name of topology file to write tree consisting of the filters built. |
Once the compressed Bloom filters have been created, we can delete those that are not compressed (:warning: be careful, this step deletes files):
ls | grep -Pv 'detbrief.rrr.' | grep '.bf' | xargs rm --