mmseqs splitdb does not preserve annotation of originating fasta files #376

grst · 2020-11-26T15:40:52Z

Expected Behavior

I use splitdb to run mmseqs search in parallel on a HPC (SGE).
For each match, I would like to retrieve the name of the original fasta file with mmseqs convertalis --format-output "...,qset,tset,...".

Current Behavior

Specifying qset or tset leads to a segmentation fault. Running search and convertalis on the full db works without issues.

Steps to Reproduce (for bugs)

mmseqs createdb test1.faa test2.faa db
mmseqs splitdb db db_split --split 2
for file in db_split_*_2; do
  mmseqs createsubdb ${file}.index db_h ${file}_h
done
mmseqs search db_split_1_2 db resultdb tmp
mmseqs convertalis db_split_1_2 db resultdb results.tsv --format-output "query,target,qset"

MMseqs Output (for bugs)

convertalis db_split_1_2 db resultdb results.tsv --format-output query,target,qset 

MMseqs Version:         45c4de7f1daefa06b45688195305eadedaea4d97
Substitution matrix     nucl:nucleotide.out,aa:blosum62.out
Alignment format        0
Format alignment output query,target,qset
Translation table       1
Gap open cost           nucl:5,aa:11
Gap extension cost      nucl:2,aa:1
Database output         false
Preload mode            0
Search type             0
Threads                 64
Compressed              0
Verbosity               3

repex.sh: line 7: 43190 Segmentation fault      (core dumped) mmseqs convertalis db_split_1_2 db resultdb results.tsv --format-output "query,target,qset"

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 45c4de7f1daefa06b45688195305eadedaea4d97
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): statically compiled
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz, AVX2 support, 3TB RAM
Operating system and version:CentOS Linux 7 64bit / Linux 3.10.0-1127.13.1.el7.x86_64

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2020-11-26T16:05:51Z

You should probably use the MPI support within MMseqs2 to do this:
https://github.com/soedinglab/mmseqs2#how-to-run-mmseqs2-on-multiple-servers-using-mpi
https://github.com/soedinglab/MMseqs2/wiki#how-to-run-mmseqs2-on-multiple-servers-using-mpi

MMseqs2 MPI will automatically split either the query or target database to fit within memory and will produce a single result database. You'll have to compile MMseqs2 with MPI support though (cmake -DHAVE_MPI=1 ...).

splitdb is probably not symlinking the databases right. I'll have to look when I have time.

grst · 2020-11-26T16:17:12Z

I was trying to avoid MPI so far... Mostly because I run mmseqs as part of a nextflow pipeline, and I'm not even sure if it is possible to use MPI from there, because nextflow usually takes care of the parallelization.

milot-mirdita · 2020-11-26T16:34:59Z

I added the line to create all the necessary symlinks for convertalis to work. You can compile from source yourself or wait for the CI to upload new binaries in about an hour.

grst · 2020-11-26T19:41:03Z

This works 🎉
Thanks for fixing this so quickly!

milot-mirdita added a commit that referenced this issue Nov 26, 2020

Add symlinks to splitdb #376

f4f3868

grst closed this as completed Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmseqs splitdb does not preserve annotation of originating fasta files #376

mmseqs splitdb does not preserve annotation of originating fasta files #376

grst commented Nov 26, 2020

milot-mirdita commented Nov 26, 2020 •

edited

Loading

grst commented Nov 26, 2020 •

edited

Loading

milot-mirdita commented Nov 26, 2020

grst commented Nov 26, 2020

mmseqs splitdb does not preserve annotation of originating fasta files #376

mmseqs splitdb does not preserve annotation of originating fasta files #376

Comments

grst commented Nov 26, 2020

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Your Environment

milot-mirdita commented Nov 26, 2020 • edited Loading

grst commented Nov 26, 2020 • edited Loading

milot-mirdita commented Nov 26, 2020

grst commented Nov 26, 2020

milot-mirdita commented Nov 26, 2020 •

edited

Loading

grst commented Nov 26, 2020 •

edited

Loading