Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent output across runs of all-versus-all ANI computation #67

Closed
apcamargo opened this issue Jun 19, 2020 · 25 comments
Closed

Inconsistent output across runs of all-versus-all ANI computation #67

apcamargo opened this issue Jun 19, 2020 · 25 comments

Comments

@apcamargo
Copy link

apcamargo commented Jun 19, 2020

Hi @cjain7!

I'm using FastANI to compare a set of approximately 500 MAGs. To do that, I'm executing:

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt

Across multiple runs I observed that the output varies significantly. For instance, in some cases a comparison of a genome with itself would (a) have a low aligned fraction (~40%), (b) have ~100% of the genome aligned, or (c) wouldn't even show in the output (presumably due to low coverage of the alignment). I've also seen different genomes with high ANI between them (~98%) sometimes appear in the output and sometimes not.

In all my 1 vs. 1 comparisons the output was consistent. The discrepant results appeared only when comparing two lists (in this case, the same list was used as both query and reference).

Here are the output of two independent runs:
dereplicated_mags_ani_raw_1.txt
dereplicated_mags_ani_raw_2.txt

EDIT: I performed a new test using the master branch. The results are still inconsistent and comparisons are missing from all the outputs I'm obtaining.

@cjain7
Copy link
Member

cjain7 commented Jun 20, 2020

Thanks for sharing this problem.

I've noticed multiple issues that highlight this problem (also see #37 and #58 ); however i've failed to reproduce this issue on our compute clusters unfortunately. I'm willing to invest time into this problem, but need help so I can reproduce this behavior at my end for debugging.

Are you able to provide more details (e.g, mac/linux, gcc version, input data files) etc.. ?

@apcamargo
Copy link
Author

apcamargo commented Jun 20, 2020

I got this issue with both the Conda version (I believe they use GCC 7.*) and a statically compiled version in my personal computer (master branch, Ubuntu 16.04, GCC 7.5.0). I executed the runs in a cluster with SUSE Linux Enterprise Server 15.

By the way, I had a bug while compiling FastANI in my PC and I submitted a PR fixing it: #68

I don't think I can share this specific dataset because it isn't mine. But I'll try to replicate the issue with my own genomes so I can send you the data. I can't promise that I'll be able to do that in the next few days, though.

Just to illustrate the extend of the inconsistency: I executed the all-versus-all comparison eight times and each run had ~16 comparisons that were not found in any of the other ones. I also noticed that this greatly influenced the definition of species (using an algorithm similar to the one used by GTDB).

@cjain7
Copy link
Member

cjain7 commented Jun 20, 2020

Got it. Since one of the issue filed previously involved use of SLURM; curious if you too are using SLURM?

@cjain7
Copy link
Member

cjain7 commented Jun 20, 2020

Is this is a locally owned cluster? Wondering if you can arrange a temporary account for me (perhaps for a week) ?

@apcamargo
Copy link
Author

Yes, I'm using SLURM.

Unfortunately it is a big shared cluster and I have no control over it, otherwise I'd be happy to give you access to it.

@cjain7
Copy link
Member

cjain7 commented Jun 20, 2020

Thanks! I guess the bug might be related to SLURM. When you get chance, can you send me your slurm job script/commands and job output log while running:

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
/usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt. #
printenv #please add an extra command for me 

@cjain7
Copy link
Member

cjain7 commented Jun 20, 2020

cc'ing @luke-dt

@apcamargo
Copy link
Author

#!/bin/bash
#SBATCH --job-name=fastani
#SBATCH --account=fnglanot
#SBATCH --qos=genepool
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --constraint=haswell

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
srun --cpus-per-task=64 --ntasks=1 /usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt
printenv

Here's the log: slurm-31935719.txt

@apcamargo
Copy link
Author

I executed the same command twice in a different cluster that uses PBS. For some reason it took much longer for FastANI to finish, but the outputs were different anyway.

wc -l fastani_output_1.txt fastani_output_2.txt
   3130 fastani_output_1.txt
   3149 fastani_output_2.txt

@luke-dt
Copy link

luke-dt commented Jun 21, 2020

Here is the script that I run with sbatch:

#SBATCH --job-name=fastani
#SBATCH --mem=30G
#SBATCH --cpus-per-task=4
#SBATCH --output=slurm_out/fastani/z_fastani_%A.out
#SBATCH --error=slurm_out/fastani/z_fastani_%A.out

module load fastani/1.3.1a

basedir="$PWD"
outdir="${basedir}/d03_species_analysis/fastani"

fastANI --ql ${outdir}/genomepaths.txt \
        --rl ${outdir}/genomepaths.txt \
        -o ${outdir}/fastani_out.txt \
        -t ${SLURM_CPUS_PER_TASK}

log file for 4 threads (analysis worked)
log file for 8 threads (comparisons missing

@apcamargo
Copy link
Author

After executing with 4 cores I got consistent outputs. However there are some missing comparisons. The output I got from the execution with 4 cores has 3149 lines and a file that I built by aggregating multiple executions has 3271 lines.

Here are ANI vs. % aligned plots for these two files:

index2

index

It seems that most of the missing comparisons are from pairs with high ANI and low % aligned.

@cjain7
Copy link
Member

cjain7 commented Jun 21, 2020

Thanks! I'm able to reproduce inconsistent output at my end on a cluster with SLURM, which is good! Will reach out if I need more info.

I replicated a single publicly-available genome and did a all-to-all among them. For a few pairs, I do see <100% ANI reported in an inconsistent manner. Please give me some time to investigate.

@apcamargo
Copy link
Author

apcamargo commented Jun 21, 2020

You're welcome!

For further context: to build the first figure I executed FastANI with 64 cores in a PBS cluster and aggregated the results into a single file. For lines in which the first and second genomes were the same but a different ANI was reported, I chose the one with the highest % aligned (which usually corresponded to the lowest ANI).

@cjain7
Copy link
Member

cjain7 commented Jul 4, 2020

Hi @apcamargo , @luke-dt ,

Thanks again for your help! There was a bug in my code associated with file-io. I've committed the fix to master branch. When you get chance, please run the code again, and let me know if also fixes the issue at your end. I will create a new fastANI version after I hear from you.

@cjain7
Copy link
Member

cjain7 commented Jul 10, 2020

Hi guys (@apcamargo , @luke-dt)
let me know if you were able to check.

@apcamargo
Copy link
Author

Hi @cjain7!
I just submitted the job and I'll let you know when I get the results.

@apcamargo
Copy link
Author

The SLURM cluster I have access to is in maintenance, so I executed fastANI in a PBS cluster with -t 120.

$ sha256sum dereplicated_mags_ani_raw_*

  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_1.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_2.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_3.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_4.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_5.txt

The bug seems to be fixed! Thank you @cjain7!

@cjain7
Copy link
Member

cjain7 commented Jul 11, 2020

Good to know. Thanks!
Closing the issue now. I'll create v1.4; please use that going forward.

@apcamargo
Copy link
Author

Hey @cjain7

Even though the results are now consistent across runs, I noticed that there are still many comparisons missing from the output. I know that fastANI won't report comparisons of genomes with low % of alignment, but some of the missing comparisons were present in previous runs. Is this behaviour expected?

@cjain7
Copy link
Member

cjain7 commented Jul 15, 2020

yeah, i think (or at least I hope) that output will be consistent from now onwards. Those cases you mention are probably border-line cases which cleared the ~80% cutoff by a small margin due to previous bug.

@apcamargo
Copy link
Author

The strange thing is that the number of genomes in the output (520) is less than the total number of genomes (522), meaning that there are two genomes that are not being compared with themselves (certainly more than 80%)

@cjain7
Copy link
Member

cjain7 commented Jul 16, 2020

Can you check if they have same file names?

@cjain7
Copy link
Member

cjain7 commented Jul 16, 2020

Please create a new issue with more information (e.g., log files, input command etc.) if you would like me to look further.

@apcamargo
Copy link
Author

I was just preparing a bug report and a noticed that the bug was in the script I was using to process the output. Sorry for the trouble!

@Valentin-Bio
Copy link

Hello, I'm having the same inconsistency problem but I'm not running FastANI via slurm. I'm running it on a Ubuntu machine and installed it using the compiled version from the master branch.

~/Downloads/FastANI/fastANI --ql filespaths.txt --rl filespaths.txt -t 7 -o ~/Documents/miriam/fastani_results.txt

The output table has different ANI values for the same compared genomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants