Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMSeqs2 DB slimmer #316

Closed
genomewalker opened this issue Jun 4, 2020 · 12 comments
Closed

MMSeqs2 DB slimmer #316

genomewalker opened this issue Jun 4, 2020 · 12 comments

Comments

@genomewalker
Copy link
Contributor

Hi
this is not an issue but a potential enhancement we discussed with @martin-steinegger.
We have a seed clustering database that is continuously updated with new sequences. The size of the DB is growing quite fast, and eventually, we will have problems storing and distributing it. As we have many redundant sequences in each cluster. We thought that having a module that takes a DB and then filters it based on a criterion similar to --diff from result2msa or result2profile would be very useful to keep only informative sequences in the clusters.

Thanks
Antonio

@milot-mirdita
Copy link
Member

A short update: I started working on this, however found some potential weirdness in result2msa that I want to look first before pushing the changes. Should be done in the next few days.

@genomewalker
Copy link
Contributor Author

Thank you very much @milot-mirdita It will be very useful to keep our DBs slim :-)

@milot-mirdita
Copy link
Member

I added a filterresult module that does basically the same as result2msa but returns as result database. Hope it's useful :)

@genomewalker
Copy link
Contributor Author

Awesome! @ChiaraVanni will test it and will we back to you in case we find any problem. Many thanks!

@milot-mirdita
Copy link
Member

We had a bug that martin fixed yesterday in 18588bb.

@genomewalker
Copy link
Contributor Author

Hi @milot-mirdita
finally, we got our hands on the filterresult, and we have a couple of questions about how to proceed after the filtering.
After running:

mmseqs filterresult seqDB seqDB cluDB cluDB-filt --threads 28

we got few alignment DBs, the index and dbtype files. Looking at the alignment DB files, it seems that they have the cluster DB format, and the number of entries has decreased substantially. Any suggestions on converting the output of filterresult to a cluster DB we can use for updating? Here you can get the DB we are trying to slim down https://ndownloader.figshare.com/files/23066651

Many thanks!
Antonio

@genomewalker genomewalker reopened this Mar 12, 2021
@milot-mirdita
Copy link
Member

milot-mirdita commented Mar 12, 2021

You can use filterdb --trim-to-one-column to discard all alignment columns in this database and leave only the identifiers.
Then you can overwrite the dbtype file to convince MMseqs2 that this database is a cluster result and not an alignment result:

mmseqs filterdb cluDB-filt cluDB-trim --trim-to-one-column
awk 'BEGIN { printf("%c%c%c%c",6,0,0,0); exit; }' > cluDB-trim.dbtype

@genomewalker
Copy link
Contributor Author

Thanks @milot-mirdita This works perfect.
Last question, if we want to have the associated sequence DB with the subset of sequences in the slimmed down cluster DB. If we use createsubd with the original sequence DB and the slimmer cluster DB will only pick the representative sequences, any tips?

Many thanks!
Antonio

@milot-mirdita
Copy link
Member

milot-mirdita commented Mar 15, 2021

Yes if you just pass the clustering db to createsubdb it would pick only the representatives.

Usually I use something like:

mmseqs createsubdb <(cat clu.[0-9]* | tr -d '\000' | cut -f1) seq seq_sub

Here we use tr to remove the NULL bytes between db entries and then cut to only return the first column (which is not strictly necessary as createsubdb would ignore all other columns anyway. If your database is not split you can drop the .[0-9]* part.

@genomewalker
Copy link
Contributor Author

Worked beautifully! Thanks @milot-mirdita

@RobinEllison
Copy link

Following this issue, i was confused when i want to select only 10 most divergent seqs in each cluster by the following command:
mmseqs filterresult db db db_clu db_clu_nr -diff 10
I am wondering what's the behavior when the group number is less than 10 (some cluster has 6/7 ...).

@milot-mirdita
Copy link
Member

filterresult should behave equivalent to HH-suite's hhfilter, with the additional implemented feature that multiple sequence identity thresholds (--qid) for filtering can be used instead of only one.

It should keep all entries, assuming the other filtering parameters are also fulfilled, by default --qsc (Minimum score per column) is also enabled and set to -20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants