Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obonet is incapable to parse the definition terms correctly in obo files #107

Open
erikyao opened this issue Jul 20, 2021 · 2 comments
Open
Assignees

Comments

@erikyao
Copy link
Contributor

erikyao commented Jul 20, 2021

Version

obonet==0.3.0

Related To

Priority

Low. Currently it's not an issue. Maybe an issue in the future.

Problem

The def field of obo ontology has a format of <def_string> [<dbxref>]. See GO.format.obo-1_4.html#S.2.2.

Library obonet will read such a field incorrectly into a whole string. E.g.

'"A ribonucleoprotein complex that contains an RNA molecule ..." [GOC:sgd_curators, PMID:10690410, PMID:14729943, PMID:7510714]'

However the the def fields within the current ChEBI obo file all have empty <dbxref> lists. Our current implenentation is to trim them from the string values of def fields. E.g.

 '"A macrocyclic lactone with a ring of twelve or more members derived from a polyketide." []'

will be trimmed to

'A macrocyclic lactone with a ring of twelve or more members derived from a polyketide.'

Note that the quotes inside will also be removed.

Our current implementation cannot handle any def field with a non-empty <dbxref> list.

Solution

pronto is another library to read obo files. It's more heavy-weight yet low-level. It has a clear class hierarchy but at the same time not well-documented. An alternative implementation to the OntologyReader in chebi_parser.py is ProntoOntologyReader.py.

Performance-wise:

  • pronto is about 4-times slower than obonet
    • E.g. with rel201/chebi_lite.obo, ProntoOntologyReader uses ~150 seconds to load the file and generate all 146,183 documents, while our implementation with obonet uses only ~30 seconds.
  • pronto uses slightly more memory than obonet

We can also watch for the update of obonet on this issue.

@erikyao erikyao self-assigned this Jul 20, 2021
@erikyao erikyao changed the title obonet is incapable to parse the definition terms in obo files obonet is incapable to parse the definition terms correctly in obo files Jul 20, 2021
@newgene
Copy link
Member

newgene commented Aug 21, 2023

@DylanWelzel obonet now on v1.0.0. Let's re-evaluate if we still need to keep our local fix for this issue.

@DylanWelzel
Copy link
Contributor

obonet v1.0.0 still does not correctly parse the def field. The local fix will stay but I've updated the obonet version to v1.0.0 in the requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants