Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to Issue#83 #106

Merged
merged 12 commits into from
Aug 3, 2021
Merged

Fix to Issue#83 #106

merged 12 commits into from
Aug 3, 2021

Conversation

erikyao
Copy link
Contributor

@erikyao erikyao commented Jul 19, 2021

Summary

This PR mainly updates ChEBI parser and will fix issue#83.

Key agreements we reached on reading the ontology fields:

  1. The ontology network should be built only upon the is_a relationships
  2. ChEBI ids found in the ontology network that have no chemical/compound fields should also be indexed
  3. Set a threshold to the numbers of successors/predecessors/descendants/ancestors
    • We set it to 2,000 in our code because we found from the 99.9% quantiles that, only 143 nodes will have more than 2,187.12 descendants (up to 141,138).

Code structure of the new ChEBI parser

The original code was refactored into a CompoundReader (which reads the sdf file for chemical structure fields). A new OntologyReader is added to read the obo file for ontology fields.

A new ChebiParser is created to hold one instance for each of the above two Reader classes and generate ChEBI documents in the following order:

  1. Iterate over the sdf file. For each ChEBI id i, generate a compound document.
    • If the ChEBI id i also exists in the ontology network, generate an ontology document as well, merge with its compound document as its final ChEBI document.
    • If not, simply return its compound document as its ChEBI document.
  2. When the iteration is finished, find the leftover ChEBI ids in the ontology network, generate their ontology documents as their ChEBI documents.

New entries in mapping

"num_children": {
    "type": "integer"
},
"children": {
    "type": "text"
},
"num_parents": {
    "type": "integer"
},
"parents": {
    "type": "text"
},
"num_descendants": {
    "type": "integer"
},
"descendants": {
    "type": "text"
},
"num_ancestors": {
    "type": "integer"
},
"ancestors": {
    "type": "text"
}
"relationship": {
    "properties": {
        "has_functional_parent": {
            "type": "text"
        },
        "has_parent_hydride": {
            "type": "text"
        },
        "has_part": {
            "type": "text"
        },
        "has_role": {
            "type": "text"
        },
        "is_conjugate_acid_of": {
            "type": "text"
        },
        "is_conjugate_base_of": {
            "type": "text"
        },
        "is_enantiomer_of": {
            "type": "text"
        },
        "is_substituent_group_from": {
            "type": "text"
        },
        "is_tautomer_of": {
            "type": "text"
        }
    }
}

Document Samples

CHEBI:45783, imatinib (chemical + ontology fields)

{
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_20210719_58fd7iko",
        "_type": "chem",
        "_id": "KTUFNOKKBVMGRW-UHFFFAOYSA-N",
        "_score": 1,
        "_source": {
          "chebi": {
            "id": "CHEBI:45783",
            "secondary_chebi_id": [
              "CHEBI:305376",
              "CHEBI:38918",
              "CHEBI:45781"
            ],
            "definition": "A benzamide obtained by formal condensation of the carboxy group of 4-[(4-methylpiperazin-1-yl)methyl]benzoic acid with the primary aromatic amino group of 4-methyl-N(3)-[4-(pyridin-3-yl)pyrimidin-2-yl]benzene-1,3-diamine. Used (as its mesylate salt) for treatment of chronic myelogenous leukemia and gastrointestinal stromal tumours.",
            "name": "imatinib",
            "relationship": {
              "has_role": [
                "CHEBI:68495",
                "CHEBI:38637",
                "CHEBI:35610"
              ],
              "has_functional_parent": "CHEBI:28179"
            },
            "star": 3,
            "num_children": 0,
            "num_parents": 5,
            "num_descendants": 0,
            "num_ancestors": 43,
            "parents": [
              "CHEBI:46920",
              "CHEBI:26421",
              "CHEBI:22702",
              "CHEBI:33860",
              "CHEBI:39447"
            ],
            "ancestors": [
              "CHEBI:39447",
              "CHEBI:22702",
              "CHEBI:23367",
              ......
            ],
            "inchi": "InChI=1S/C29H31N7O/c1-21-5-10-25(18-27(21)34-29-31-13-11-26(33-29)24-4-3-12-30-19-24)32-28(37)23-8-6-22(7-9-23)20-36-16-14-35(2)15-17-36/h3-13,18-19H,14-17,20H2,1-2H3,(H,32,37)(H,31,33,34)",
            "inchikey": "KTUFNOKKBVMGRW-UHFFFAOYSA-N",
            "smiles": "CN1CCN(Cc2ccc(cc2)C(=O)Nc2ccc(C)c(Nc3nccc(n3)-c3cccnc3)c2)CC1",
            "formulae": "C29H31N7O",
            ......
   ......
}

CHEBI:25106, macrolide (ontology fields only)

{
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_20210719_58fd7iko",
        "_type": "chem",
        "_id": "CHEBI:25106",
        "_score": 1,
        "_source": {
          "chebi": {
            "id": "CHEBI:25106",
            "definition": "A macrocyclic lactone with a ring of twelve or more members derived from a polyketide.",
            "name": "macrolide",
            "star": 3,
            "num_children": 217,
            "num_parents": 2,
            "num_descendants": 383,
            "num_ancestors": 27,
            "children": [
              "CHEBI:600103",
              "CHEBI:77757",
              "CHEBI:77759",
              ......
            ],
            "parents": [
              "CHEBI:26188",
              "CHEBI:63944"
            ],
            "descendants": [
              "CHEBI:94551",
              "CHEBI:2697",
              "CHEBI:2841",
              ......
            ],
            "ancestors": [
              "CHEBI:38104",
              "CHEBI:25000",
              "CHEBI:5686",
              ......
            ]
          },
        }
      }
    ]
  }
}

@erikyao erikyao requested review from andrewsu and newgene July 19, 2021 07:46
…; (2) there exists a ChEBI id in the SDF file but not in the obo file, causing KeyError when reading ontology; (3) intermediate compound documents' ids are lists of 1 string, not single strings
@erikyao erikyao changed the title Fix Issue#83 Fix to Issue#83 Jul 19, 2021
Copy link
Member

@newgene newgene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just two minor comments, we can then merge.

requirements_hub.txt Outdated Show resolved Hide resolved
Copy link
Member

@andrewsu andrewsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one minor comment aside, this PR looks good to me

@erikyao erikyao requested a review from newgene August 3, 2021 17:33
@erikyao
Copy link
Contributor Author

erikyao commented Aug 3, 2021

Please ignore the failed checks for now. It’s related to how Biothings API 0.10.x changes things around.

@erikyao erikyao merged commit 1f1d79a into biothings:master Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

missing ChEBI records?
3 participants