obonet
is incapable to parse the definition
terms correctly in obo files
#107
Labels
obonet
is incapable to parse the definition
terms correctly in obo files
#107
Version
Related To
Priority
Low. Currently it's not an issue. Maybe an issue in the future.
Problem
The
def
field of obo ontology has a format of<def_string> [<dbxref>]
. See GO.format.obo-1_4.html#S.2.2.Library
obonet
will read such a field incorrectly into a whole string. E.g.'"A ribonucleoprotein complex that contains an RNA molecule ..." [GOC:sgd_curators, PMID:10690410, PMID:14729943, PMID:7510714]'
However the the
def
fields within the current ChEBI obo file all have empty<dbxref>
lists. Our current implenentation is to trim them from the string values ofdef
fields. E.g.'"A macrocyclic lactone with a ring of twelve or more members derived from a polyketide." []'
will be trimmed to
Note that the quotes inside will also be removed.
Our current implementation cannot handle any
def
field with a non-empty<dbxref>
list.Solution
pronto
is another library to read obo files. It's more heavy-weight yet low-level. It has a clear class hierarchy but at the same time not well-documented. An alternative implementation to theOntologyReader
inchebi_parser.py
is ProntoOntologyReader.py.Performance-wise:
pronto
is about 4-times slower thanobonet
rel201/chebi_lite.obo
,ProntoOntologyReader
uses ~150 seconds to load the file and generate all 146,183 documents, while our implementation withobonet
uses only ~30 seconds.pronto
uses slightly more memory thanobonet
We can also watch for the update of
obonet
on this issue.The text was updated successfully, but these errors were encountered: