Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too loose RegExp interpretation by TaxonomyExpression? #320

Open
elileka opened this issue Jun 20, 2020 · 1 comment
Open

Too loose RegExp interpretation by TaxonomyExpression? #320

elileka opened this issue Jun 20, 2020 · 1 comment

Comments

@elileka
Copy link
Member

elileka commented Jun 20, 2020

Expected Behavior

Expected the following two commands to result in the same database:

mmseqs filtertaxseqdb swissProtSomeClasses test1 --taxon-list '4891||1075807||147549'

and

mmseqs filtertaxseqdb swissProtSomeClasses test2 --taxon-list '4891,1075807,147549'

Also expected the following command to fail/throw warning:

mmseqs filtertaxseqdb swissProtSomeClasses test3 --taxon-list '489!@!@1075807****147549'

Current Behavior

The first two commands result in two different databases:

wc -l test1
15447 test1

wc -l test2
32 test2

The third command runs without issuing any warning (it effectively does nothing to the database).

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

  1. Download a small NCBI-like taxonomy database swissProtSomeClasses from here
  2. Run the commands above

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

filtertaxseqdb metaeuk-regression-master/sacc_tax/swissProtSomeClasses test1 --taxon-list 4891||1075807||147549 

MMseqs Version:	e2510e8f6911e4340c62989aa9ba1b9c58e18d60
Compressed   	0
Selected taxa	4891||1075807||147549
Subdb mode   	0
Threads      	8
Verbosity    	3

Loading NCBI taxonomy
Loading nodes file ... Done, got 13938 nodes
Loading merged file ... Done, added 0 merged nodes.
Loading names file ... Done
Making matrix ... Done
Init RMQ ...Done

and

filtertaxseqdb metaeuk-regression-master/sacc_tax/swissProtSomeClasses test2 --taxon-list 4891,1075807,147549 

MMseqs Version:	e2510e8f6911e4340c62989aa9ba1b9c58e18d60
Compressed   	0
Selected taxa	4891,1075807,147549
Subdb mode   	0
Threads      	8
Verbosity    	3

Loading NCBI taxonomy
Loading nodes file ... Done, got 13938 nodes
Loading merged file ... Done, added 0 merged nodes.
Loading names file ... Done
Making matrix ... Done
Init RMQ ...Done

Context

Providing context helps us come up with a solution and improve our documentation for the future.

The help for modules that use --taxon-list allows for comma separated values:
--taxon-list STR Taxonomy ID, possibly multiple values separated by ',' []

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): e2510e8
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake version 3.5.1, c++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version: ubuntu1~16.04.12
@milot-mirdita
Copy link
Member

MMseqs2 does something more sensible with , in --taxon-list. No idea what to do about the validation step though. We could add a regex to check for only sensible operators and numbers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants