MMSeqs2 DB slimmer #316

genomewalker · 2020-06-04T18:06:55Z

Hi
this is not an issue but a potential enhancement we discussed with @martin-steinegger.
We have a seed clustering database that is continuously updated with new sequences. The size of the DB is growing quite fast, and eventually, we will have problems storing and distributing it. As we have many redundant sequences in each cluster. We thought that having a module that takes a DB and then filters it based on a criterion similar to --diff from result2msa or result2profile would be very useful to keep only informative sequences in the clusters.

Thanks
Antonio

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2020-06-10T14:40:56Z

A short update: I started working on this, however found some potential weirdness in result2msa that I want to look first before pushing the changes. Should be done in the next few days.

genomewalker · 2020-06-16T19:13:30Z

Thank you very much @milot-mirdita It will be very useful to keep our DBs slim :-)

…in a result db #316

milot-mirdita · 2020-07-28T23:55:59Z

I added a filterresult module that does basically the same as result2msa but returns as result database. Hope it's useful :)

genomewalker · 2020-07-29T13:29:26Z

Awesome! @ChiaraVanni will test it and will we back to you in case we find any problem. Many thanks!

milot-mirdita · 2020-10-15T09:15:29Z

We had a bug that martin fixed yesterday in 18588bb.

genomewalker · 2021-03-12T17:11:48Z

Hi @milot-mirdita
finally, we got our hands on the filterresult, and we have a couple of questions about how to proceed after the filtering.
After running:

mmseqs filterresult seqDB seqDB cluDB cluDB-filt --threads 28

we got few alignment DBs, the index and dbtype files. Looking at the alignment DB files, it seems that they have the cluster DB format, and the number of entries has decreased substantially. Any suggestions on converting the output of filterresult to a cluster DB we can use for updating? Here you can get the DB we are trying to slim down https://ndownloader.figshare.com/files/23066651

Many thanks!
Antonio

milot-mirdita · 2021-03-12T17:45:15Z

You can use filterdb --trim-to-one-column to discard all alignment columns in this database and leave only the identifiers.
Then you can overwrite the dbtype file to convince MMseqs2 that this database is a cluster result and not an alignment result:

mmseqs filterdb cluDB-filt cluDB-trim --trim-to-one-column
awk 'BEGIN { printf("%c%c%c%c",6,0,0,0); exit; }' > cluDB-trim.dbtype

genomewalker · 2021-03-12T19:05:09Z

Thanks @milot-mirdita This works perfect.
Last question, if we want to have the associated sequence DB with the subset of sequences in the slimmed down cluster DB. If we use createsubd with the original sequence DB and the slimmer cluster DB will only pick the representative sequences, any tips?

Many thanks!
Antonio

milot-mirdita · 2021-03-15T00:22:04Z

Yes if you just pass the clustering db to createsubdb it would pick only the representatives.

Usually I use something like:

mmseqs createsubdb <(cat clu.[0-9]* | tr -d '\000' | cut -f1) seq seq_sub

Here we use tr to remove the NULL bytes between db entries and then cut to only return the first column (which is not strictly necessary as createsubdb would ignore all other columns anyway. If your database is not split you can drop the .[0-9]* part.

genomewalker · 2021-03-15T21:43:17Z

Worked beautifully! Thanks @milot-mirdita

RobinEllison · 2024-01-24T03:37:06Z

Following this issue, i was confused when i want to select only 10 most divergent seqs in each cluster by the following command:
mmseqs filterresult db db db_clu db_clu_nr -diff 10
I am wondering what's the behavior when the group number is less than 10 (some cluster has 6/7 ...).

milot-mirdita · 2024-01-26T04:07:21Z

filterresult should behave equivalent to HH-suite's hhfilter, with the additional implemented feature that multiple sequence identity thresholds (--qid) for filtering can be used instead of only one.

It should keep all entries, assuming the other filtering parameters are also fulfilled, by default --qsc (Minimum score per column) is also enabled and set to -20.

milot-mirdita added a commit that referenced this issue Jul 28, 2020

Add filterresult for pairwise HHblits filtering to reduce redundancy …

06bd0cf

…in a result db #316

genomewalker closed this as completed Jul 29, 2020

genomewalker reopened this Mar 12, 2021

genomewalker closed this as completed Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMSeqs2 DB slimmer #316

MMSeqs2 DB slimmer #316

genomewalker commented Jun 4, 2020

milot-mirdita commented Jun 10, 2020

genomewalker commented Jun 16, 2020

milot-mirdita commented Jul 28, 2020

genomewalker commented Jul 29, 2020

milot-mirdita commented Oct 15, 2020

genomewalker commented Mar 12, 2021

milot-mirdita commented Mar 12, 2021 •

edited

Loading

genomewalker commented Mar 12, 2021

milot-mirdita commented Mar 15, 2021 •

edited

Loading

genomewalker commented Mar 15, 2021

RobinEllison commented Jan 24, 2024

milot-mirdita commented Jan 26, 2024

MMSeqs2 DB slimmer #316

MMSeqs2 DB slimmer #316

Comments

genomewalker commented Jun 4, 2020

milot-mirdita commented Jun 10, 2020

genomewalker commented Jun 16, 2020

milot-mirdita commented Jul 28, 2020

genomewalker commented Jul 29, 2020

milot-mirdita commented Oct 15, 2020

genomewalker commented Mar 12, 2021

milot-mirdita commented Mar 12, 2021 • edited Loading

genomewalker commented Mar 12, 2021

milot-mirdita commented Mar 15, 2021 • edited Loading

genomewalker commented Mar 15, 2021

RobinEllison commented Jan 24, 2024

milot-mirdita commented Jan 26, 2024

milot-mirdita commented Mar 12, 2021 •

edited

Loading

milot-mirdita commented Mar 15, 2021 •

edited

Loading