Inconsistent output across runs of all-versus-all ANI computation #67

apcamargo · 2020-06-19T21:35:15Z

I'm using FastANI to compare a set of approximately 500 MAGs. To do that, I'm executing:

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt

Across multiple runs I observed that the output varies significantly. For instance, in some cases a comparison of a genome with itself would (a) have a low aligned fraction (~40%), (b) have ~100% of the genome aligned, or (c) wouldn't even show in the output (presumably due to low coverage of the alignment). I've also seen different genomes with high ANI between them (~98%) sometimes appear in the output and sometimes not.

In all my 1 vs. 1 comparisons the output was consistent. The discrepant results appeared only when comparing two lists (in this case, the same list was used as both query and reference).

Here are the output of two independent runs:
dereplicated_mags_ani_raw_1.txt
dereplicated_mags_ani_raw_2.txt

EDIT: I performed a new test using the master branch. The results are still inconsistent and comparisons are missing from all the outputs I'm obtaining.

cjain7 · 2020-06-20T16:03:34Z

Thanks for sharing this problem.

I've noticed multiple issues that highlight this problem (also see #37 and #58 ); however i've failed to reproduce this issue on our compute clusters unfortunately. I'm willing to invest time into this problem, but need help so I can reproduce this behavior at my end for debugging.

Are you able to provide more details (e.g, mac/linux, gcc version, input data files) etc.. ?

apcamargo · 2020-06-20T16:44:14Z

I got this issue with both the Conda version (I believe they use GCC 7.*) and a statically compiled version in my personal computer (master branch, Ubuntu 16.04, GCC 7.5.0). I executed the runs in a cluster with SUSE Linux Enterprise Server 15.

By the way, I had a bug while compiling FastANI in my PC and I submitted a PR fixing it: #68

I don't think I can share this specific dataset because it isn't mine. But I'll try to replicate the issue with my own genomes so I can send you the data. I can't promise that I'll be able to do that in the next few days, though.

Just to illustrate the extend of the inconsistency: I executed the all-versus-all comparison eight times and each run had ~16 comparisons that were not found in any of the other ones. I also noticed that this greatly influenced the definition of species (using an algorithm similar to the one used by GTDB).

cjain7 · 2020-06-20T16:53:30Z

Got it. Since one of the issue filed previously involved use of SLURM; curious if you too are using SLURM?

cjain7 · 2020-06-20T16:55:34Z

Is this is a locally owned cluster? Wondering if you can arrange a temporary account for me (perhaps for a week) ?

apcamargo · 2020-06-20T16:58:14Z

Yes, I'm using SLURM.

Unfortunately it is a big shared cluster and I have no control over it, otherwise I'd be happy to give you access to it.

cjain7 · 2020-06-20T18:23:42Z

Thanks! I guess the bug might be related to SLURM. When you get chance, can you send me your slurm job script/commands and job output log while running:

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
/usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt. #
printenv #please add an extra command for me

cjain7 · 2020-06-20T18:27:43Z

cc'ing @luke-dt

apcamargo · 2020-06-20T18:56:47Z

#!/bin/bash
#SBATCH --job-name=fastani
#SBATCH --account=fnglanot
#SBATCH --qos=genepool
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --constraint=haswell

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
srun --cpus-per-task=64 --ntasks=1 /usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt
printenv

Here's the log: slurm-31935719.txt

apcamargo · 2020-06-20T23:34:49Z

I executed the same command twice in a different cluster that uses PBS. For some reason it took much longer for FastANI to finish, but the outputs were different anyway.

wc -l fastani_output_1.txt fastani_output_2.txt
   3130 fastani_output_1.txt
   3149 fastani_output_2.txt

luke-dt · 2020-06-21T00:22:37Z

Here is the script that I run with sbatch:

#SBATCH --job-name=fastani
#SBATCH --mem=30G
#SBATCH --cpus-per-task=4
#SBATCH --output=slurm_out/fastani/z_fastani_%A.out
#SBATCH --error=slurm_out/fastani/z_fastani_%A.out

module load fastani/1.3.1a

basedir="$PWD"
outdir="${basedir}/d03_species_analysis/fastani"

fastANI --ql ${outdir}/genomepaths.txt \
        --rl ${outdir}/genomepaths.txt \
        -o ${outdir}/fastani_out.txt \
        -t ${SLURM_CPUS_PER_TASK}

log file for 4 threads (analysis worked)
log file for 8 threads (comparisons missing

apcamargo · 2020-06-21T17:31:45Z

After executing with 4 cores I got consistent outputs. However there are some missing comparisons. The output I got from the execution with 4 cores has 3149 lines and a file that I built by aggregating multiple executions has 3271 lines.

Here are ANI vs. % aligned plots for these two files:

It seems that most of the missing comparisons are from pairs with high ANI and low % aligned.

cjain7 · 2020-06-21T18:13:10Z

Thanks! I'm able to reproduce inconsistent output at my end on a cluster with SLURM, which is good! Will reach out if I need more info.

I replicated a single publicly-available genome and did a all-to-all among them. For a few pairs, I do see <100% ANI reported in an inconsistent manner. Please give me some time to investigate.

apcamargo · 2020-06-21T18:22:41Z

You're welcome!

For further context: to build the first figure I executed FastANI with 64 cores in a PBS cluster and aggregated the results into a single file. For lines in which the first and second genomes were the same but a different ANI was reported, I chose the one with the highest % aligned (which usually corresponded to the lowest ANI).

cjain7 · 2020-07-04T22:28:43Z

Hi @apcamargo , @luke-dt ,

Thanks again for your help! There was a bug in my code associated with file-io. I've committed the fix to master branch. When you get chance, please run the code again, and let me know if also fixes the issue at your end. I will create a new fastANI version after I hear from you.

cjain7 · 2020-07-10T20:09:33Z

Hi guys (@apcamargo , @luke-dt)
let me know if you were able to check.

apcamargo · 2020-07-11T21:02:16Z

Hi @cjain7!
I just submitted the job and I'll let you know when I get the results.

apcamargo · 2020-07-11T23:09:31Z

The SLURM cluster I have access to is in maintenance, so I executed fastANI in a PBS cluster with -t 120.

$ sha256sum dereplicated_mags_ani_raw_*

  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_1.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_2.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_3.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_4.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_5.txt

The bug seems to be fixed! Thank you @cjain7!

cjain7 · 2020-07-11T23:15:32Z

Good to know. Thanks!
Closing the issue now. I'll create v1.4; please use that going forward.

apcamargo · 2020-07-15T11:16:23Z

Hey @cjain7

Even though the results are now consistent across runs, I noticed that there are still many comparisons missing from the output. I know that fastANI won't report comparisons of genomes with low % of alignment, but some of the missing comparisons were present in previous runs. Is this behaviour expected?

cjain7 · 2020-07-15T17:41:30Z

yeah, i think (or at least I hope) that output will be consistent from now onwards. Those cases you mention are probably border-line cases which cleared the ~80% cutoff by a small margin due to previous bug.

apcamargo · 2020-07-16T12:49:34Z

The strange thing is that the number of genomes in the output (520) is less than the total number of genomes (522), meaning that there are two genomes that are not being compared with themselves (certainly more than 80%)

cjain7 · 2020-07-16T16:30:56Z

Can you check if they have same file names?

cjain7 · 2020-07-16T16:32:01Z

Please create a new issue with more information (e.g., log files, input command etc.) if you would like me to look further.

apcamargo · 2020-07-16T19:59:07Z

I was just preparing a bug report and a noticed that the bug was in the script I was using to process the output. Sorry for the trouble!

Valentin-Bio · 2023-10-10T03:51:53Z

Hello, I'm having the same inconsistency problem but I'm not running FastANI via slurm. I'm running it on a Ubuntu machine and installed it using the compiled version from the master branch.

~/Downloads/FastANI/fastANI --ql filespaths.txt --rl filespaths.txt -t 7 -o ~/Documents/miriam/fastani_results.txt

The output table has different ANI values for the same compared genomes.

cjain7 closed this as completed Jul 11, 2020

This was referenced Jul 11, 2020

Results missing from output #37

Closed

fastANI skips reference genomes #58

Closed

apcamargo mentioned this issue Jul 16, 2020

Update fastANI requirement to 1.31 wwood/galah#6

Closed

apcamargo mentioned this issue Oct 2, 2020

[Documentation] Update recommended FastANI version in the documentation Ecogenomics/GTDBTk#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent output across runs of all-versus-all ANI computation #67

Inconsistent output across runs of all-versus-all ANI computation #67

apcamargo commented Jun 19, 2020 •

edited

Loading

cjain7 commented Jun 20, 2020

apcamargo commented Jun 20, 2020 •

edited

Loading

cjain7 commented Jun 20, 2020

cjain7 commented Jun 20, 2020

apcamargo commented Jun 20, 2020

cjain7 commented Jun 20, 2020

cjain7 commented Jun 20, 2020

apcamargo commented Jun 20, 2020

apcamargo commented Jun 20, 2020

luke-dt commented Jun 21, 2020

apcamargo commented Jun 21, 2020

cjain7 commented Jun 21, 2020 •

edited

Loading

apcamargo commented Jun 21, 2020 •

edited

Loading

cjain7 commented Jul 4, 2020

cjain7 commented Jul 10, 2020

apcamargo commented Jul 11, 2020

apcamargo commented Jul 11, 2020

cjain7 commented Jul 11, 2020

apcamargo commented Jul 15, 2020

cjain7 commented Jul 15, 2020 •

edited

Loading

apcamargo commented Jul 16, 2020

cjain7 commented Jul 16, 2020

cjain7 commented Jul 16, 2020

apcamargo commented Jul 16, 2020

Valentin-Bio commented Oct 10, 2023

Inconsistent output across runs of all-versus-all ANI computation #67

Inconsistent output across runs of all-versus-all ANI computation #67

Comments

apcamargo commented Jun 19, 2020 • edited Loading

cjain7 commented Jun 20, 2020

apcamargo commented Jun 20, 2020 • edited Loading

cjain7 commented Jun 20, 2020

cjain7 commented Jun 20, 2020

apcamargo commented Jun 20, 2020

cjain7 commented Jun 20, 2020

cjain7 commented Jun 20, 2020

apcamargo commented Jun 20, 2020

apcamargo commented Jun 20, 2020

luke-dt commented Jun 21, 2020

apcamargo commented Jun 21, 2020

cjain7 commented Jun 21, 2020 • edited Loading

apcamargo commented Jun 21, 2020 • edited Loading

cjain7 commented Jul 4, 2020

cjain7 commented Jul 10, 2020

apcamargo commented Jul 11, 2020

apcamargo commented Jul 11, 2020

cjain7 commented Jul 11, 2020

apcamargo commented Jul 15, 2020

cjain7 commented Jul 15, 2020 • edited Loading

apcamargo commented Jul 16, 2020

cjain7 commented Jul 16, 2020

cjain7 commented Jul 16, 2020

apcamargo commented Jul 16, 2020

Valentin-Bio commented Oct 10, 2023

apcamargo commented Jun 19, 2020 •

edited

Loading

apcamargo commented Jun 20, 2020 •

edited

Loading

cjain7 commented Jun 21, 2020 •

edited

Loading

apcamargo commented Jun 21, 2020 •

edited

Loading

cjain7 commented Jul 15, 2020 •

edited

Loading