Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source: repoDB #77

Closed
erikyao opened this issue Jul 29, 2022 · 20 comments
Closed

Data source: repoDB #77

erikyao opened this issue Jul 29, 2022 · 20 comments
Labels
api deployment done data source Data source pending to create a new API On Test Match https://github.com/biothings/biothings_explorer/labels x-bte

Comments

@erikyao
Copy link
Contributor

erikyao commented Jul 29, 2022

Requirement originally discussed in: smartAPI - Issue#85

Plugin repo: https://github.com/erikyao/repoDB

Bug description: Due to the reason explained in this comment, the parser previously (back in 2020) relied on MyChem to query drugbank.id => drugbank.name. However since 2021 MyChem no longer provides drugbank data (see https://docs.mychem.info/en/latest/doc/data_source.html#drugbank).

Solution: find another API for drugbank.id => drug_name queries, or pre-process the data file full.csv

@erikyao erikyao self-assigned this Jul 29, 2022
@rjawesome
Copy link
Collaborator

This CSV Drugbank Vocabulary seems to be open source and contains drugbank id to name data. It also contains names for some of the IDs you were not able to find in mychem (ie. DB12430).

@erikyao
Copy link
Contributor Author

erikyao commented Jul 29, 2022

Thank you @rjawesome for the information! That CSV would definitely help!

@rjawesome
Copy link
Collaborator

I can also make a PR for this on the parser if you want...

@erikyao
Copy link
Contributor Author

erikyao commented Jul 29, 2022

Sure, @rjawesome, I appreciate your help!

@rjawesome
Copy link
Collaborator

See this pr

@erikyao
Copy link
Contributor Author

erikyao commented Jul 29, 2022

Thank you, @rjawesome! Yep I realized that injective relation is enough for "one-to-one"...

@colleenXu
Copy link

Don't know if this needs SmartAPI annotation...

@erikyao
Copy link
Contributor Author

erikyao commented Oct 3, 2022

Don't know if this needs SmartAPI annotation...

Hi @colleenXu, this is a bug fix to the old repoDB API. It should have been annotated before.

@colleenXu
Copy link

If it does, it's likely very old. It's not incorporated into BTE at the moment.

@andrewsu
Copy link
Member

andrewsu commented Oct 5, 2022

Let's use this as an opportunity to add a SmartAPI annotation for BTE integration. I'm going to reopen the ticket, unassign @erikyao and @rjawesome, and add it to the "Needs SmartAPI / BTE annotation" section of our project tracker...

@andrewsu andrewsu reopened this Oct 5, 2022
@andrewsu andrewsu changed the title Data source: repoDB bugfix Data source: repoDB Aug 23, 2023
@andrewsu
Copy link
Member

example record https://biothings.ncats.io/repodb/chemical/DB14707 :

{
  "_id": "DB14707",
  "_version": 1,
  "repodb": {
    "drugbank": "DB14707",
    "indications": [
      {
        "NCT": "NA",
        "detailed_status": "NA",
        "name": "Squamous cell carcinoma",
        "phase": "NA",
        "status": "Approved",
        "umls": "C0007137"
      }
    ],
    "name": "Cemiplimab"
  }
}

@colleenXu colleenXu added x-bte data source Data source pending to create a new API labels Nov 1, 2023
@colleenXu
Copy link

colleenXu commented Dec 8, 2023

Related infores stuff is ready:

@colleenXu
Copy link

colleenXu commented Dec 22, 2023

Here's the SmartAPI yaml w/ x-bte annotation for BioThings repoDB. This yaml is registered in SmartAPI Registry.

I haven't made a PR to add this to BTE's regular use (for the config file, API_LIST variable): I'm waiting until we're closer to the next release cycle to make a PR with all the KPs we want to add.

Example query

send a POST request to the api-specific endpoint, BioThings repoDB only. Like http://localhost:3000/v1/smartapi/1138c3297e8e403b6ac10cff5609b319/query. This works even when the KP isn't included in BTE's config

Put this in the request body: It's querying with the drug Cetuximab (aka DRUGBANK:DB00002)

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["DRUGBANK:DB00002"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:treats"]
                }
            }
        }
    }
}

You should get a response with this edge (from this record in the BioThings API, based on this operation's example:

  • subject: Cetuximab (primary ID in SRI NodeNorm PUBCHEM.COMPOUND:14122979, DRUGBANK ID in the BioThings API is DB00002)
  • object: Malignant tumor of colon (primary ID in SRI NodeNorm MONDO:0021063, UMLS ID in BioThings API is C0007102)
                "c50bcf1f5d6c4c55c44535cc3e9c49d2": {
                    "predicate": "biolink:treats",
                    "subject": "PUBCHEM.COMPOUND:14122979",
                    "object": "MONDO:0021063",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:repodb",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:biothings-repodb",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:repodb"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biothings-repodb"
                            ]
                        }
                    ]
                }
            }

@colleenXu
Copy link

However, I have some observations / possible next steps:

  1. Does this API use the latest data from repoDB? I notice a data update (v2.1 2023-06-15) in the version history section of the repodb website
  2. What does the field value "NA" mean? If it basically means "not available/applicable", I'd find it helpful if the parser removed the fields with "NA" values. That way BTE would be able to use this field without post-processing to remove "NA".
"NA" is a common value for these fields

  1. I think changing the parser to create association-centric data (unique combos of drug-disease-status) rather than drug-centric (current) would be helpful, particularly for the upcoming "treats" refactor. Currently, there are problems retrieving info when querying with the disease ID ("reverse operations", related to for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations biothings_explorer#316 and tune the use of AEOLUS indications from mychem.info biothings_explorer#727 (comment)).
Mockup of what association-centric data may look like

Right now, there's 1 record for the drug Rituximab.

It'd be transformed into multiple records, 1 for each combo of rituximab + unique disease + unique status.

So for rituximab + "Lymphoma, Non-Hodgkin" C0024305, there'd be 3 records (3 diff statuses). I didn't include all the info for the "Terminated" record since there's currently 18 objects/clinical-trials in the data.

[
  {
    "drug_drugbank_id": "DB00073",
    "drug_name": "rituximab",
    "indication_umls": "C0024305",
    "indication_name": "Lymphoma, Non-Hodgkin",
    "status": "Approved"
  },
  {
    "drug_drugbank_id": "DB00073",
    "drug_name": "rituximab",
    "indication_umls": "C0024305",
    "indication_name": "Lymphoma, Non-Hodgkin",
    "status": "Terminated",
    "clinical_trial_info": [
      {
        "NCT": "NCT00057343",
        "phase": "Phase 3"
      },
      {
        "NCT": "NCT00057447",
        "detailed_status": "administrative reasons",
        "phase": "Phase 1/Phase 2"
      },
      ....
    ]
  },
  {
    "drug_drugbank_id": "DB00073",
    "drug_name": "rituximab",
    "indication_umls": "C0024305",
    "indication_name": "Lymphoma, Non-Hodgkin",
    "status": "Withdrawn",
    "clinical_trial_info": [
      {
        "NCT": "NCT02408042",
        "phase": "Phase 1/Phase 2"
      }
    ]
  }
]

Problem 1: When I query with a disease ID and a specific indication status, I can't retrieve only the hits where both constraints are true in the same nested object.

Related to biothings/biothings_explorer#727 (comment)

For example, I can try querying for the indication C0032797 (Postpartum Hemorrhage) and I want only drugs where the indication status isn't approved:

curl --location --globoff 'https://biothings.ncats.io/repodb/query?size=1000&fields=repodb.indications%2Crepodb.drugbank%2Crepodb.name&jmespath=repodb.indications%7C[%3F(status%3D%3D%60Terminated%60%7C%7Cstatus%3D%3D%60Withdrawn%60%7C%7Cstatus%3D%3D%60Suspended%60)]' \
--header 'Content-Type: application/json' \
--data '{
    "q": "C0032797",
    "scopes":"repodb.indications.umls"
}'

I'll get hits like this in the response, which show that the indication matched but the status didn't. At the moment, we don't have BTE post-processing to recognize and remove hits like this: BTE will use them for answer edges even though they didn't actually match what I wanted.

    {
        "query": "C0032797",
        "_id": "DB00353",
        "_score": 8.514726,
        "repodb": {
            "drugbank": "DB00353",
            "indications": [],
            "name": "Methylergometrine"
        }
    },
    {
        "query": "C0032797",
        "_id": "DB00429",
        "_score": 8.514726,
        "repodb": {
            "drugbank": "DB00429",
            "indications": [],
            "name": "Carboprost tromethamine"
        }
    },

Problem 2: When I query with a disease ID, I don't get back info for only that disease ID. So I have to exclude useful info like the disease-name field from the response-mapping

Related to biothings/biothings_explorer#316 (comment)

I can take the rev-disease-drug operation and try to include the disease-name field:

  • add repodb.indications.name to the parameters.field section
  • add input_name: repodb.indications.name to the drug response-mapping

And then test the operation with a local BTE override and a disease ID that SRI NodeNorm doesn't recognize (C0334634, Malignant lymphoma, lymphocytic, intermediate differentiation, diffuse in BioThings repodb)

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["UMLS:C0334634"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            }
        }
    }
}

In the response repodbC0334634.txt, BTE has given that ID the wrong label "Precocious Puberty"...probably because the subquery's first hit has C0034013 "Precocious Puberty" in the first nested object, rather than the disease I asked for.

            "nodes": {
                "UMLS:C0334634": {
                    "categories": [
                        "biolink:Disease"
                    ],
                    "name": "Precocious Puberty",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:xref",
                            "value": [
                                "UMLS:C0334634"
                            ]
                        },
                        {
                            "attribute_type_id": "biolink:synonym",
                            "value": [
                                "UMLS:C0334634"
                            ]
                        }
                    ]
                },

@colleenXu
Copy link

After discussion with Andrew yesterday, I've opened an issue for the next steps.

However, it should be fine if these next steps aren't done by the time we add this API to BTE's regular use - we can still go forward with deploying.

@colleenXu
Copy link

Will need to update the x-bte annotation once the #169 is addressed for all instances (ncats.io and all ITRB instances transltr.io).

Can create separate operations depending on status, so we can map it to different predicates during the treats refactor/biolink-model update

@colleenXu
Copy link

repoDB has been updated on all instances (under the hood, the internal routing is now to biothings.transltr.io - ITRB Prod instance...not biothings.ncats.io).

So I'm moving this issue back to a to-do, to update the x-bte annotation.

@colleenXu
Copy link

Updated the SmartAPI yaml w/ x-bte annotation to match the parser/API updates - master branch only uses the "approved" treatment operations NCATS-Tangerine/translator-api-registry@fa1f36e

Also updated the SmartAPI registration. So it's ready to add to BTE's regular use (for the config file, API_LIST variable) - so I added it to the PR linked above.

We'll try to get it into Translator's Lobster release (dev/CI -> Test this Friday).


There's another version in biolink-4-update biothings/biothings_explorer#788 with "clinical trial only" operations available: NCATS-Tangerine/translator-api-registry@50634e7. I've adjusted the PR biothings/bte-server#19 to add an override to this.

@colleenXu colleenXu added the On CI Match https://github.com/biothings/biothings_explorer/labels label Mar 7, 2024
@tokebe tokebe added On Test Match https://github.com/biothings/biothings_explorer/labels and removed On CI Match https://github.com/biothings/biothings_explorer/labels labels Mar 14, 2024
@tokebe
Copy link
Member

tokebe commented Apr 17, 2024

@colleenXu Should this issue be closed?

@colleenXu
Copy link

Yep, confirmed that it's live by posting an example query to https://bte.transltr.io/v1/team/Service Provider/query (Prod instance).

Example

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["DRUGBANK:DB00002"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:treats"]
                }
            }
        }
    }
}

There should be edges like this that come from repodb

                "7cc54b63aaf016ef67d50252c2323b04": {
                    "predicate": "biolink:treats",
                    "subject": "PUBCHEM.COMPOUND:14122979",
                    "object": "MONDO:0021063",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:repodb",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:biothings-repodb",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:repodb"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biothings-repodb"
                            ]
                        }
                    ]
                },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api deployment done data source Data source pending to create a new API On Test Match https://github.com/biothings/biothings_explorer/labels x-bte
Projects
None yet
Development

No branches or pull requests

5 participants