Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

Closed
andrewsu opened this issue Dec 6, 2022 · 27 comments

Comments

@andrewsu
Copy link
Member

andrewsu commented Dec 6, 2022

Translator consortium target for implementation in Feb 2023. Exact TRAPI query templates will be created soon per Architecture meeting 2022-12-06...

EDIT Implementation dates:

  • Jan 25: Development
  • Feb 13: ITRB CI
@colleenXu
Copy link
Collaborator

Note:

  • it's not clear to me from this title if increases/decreases is a variable provided by the user or not...
  • I'm unsure right now how to do this "creatively"

Possibly relevant, from Slack:

Andy Crouse (Unsecret Agent, UI team)
1:13 PM
This is the test set of genes and drugs for next creative mode query.
Note there are some ‘destroyer of worlds’ genes and drugs (valproic acid, vitamin A, TP53, P450)
So if the lights dim when testing that is why. But they should be good for pressure testing!
I included some that are not classically druggable like none coding and transcription factors. So hopefully this will cover the gambit.
https://docs.google.com/document/d/1X3XbHhS_AIGkSyxSaYUuiCqVNh33Wz7f3EEyKZEctL8/edit?usp=sharing

@tokebe
Copy link
Member

tokebe commented Dec 14, 2022

#534 Is now addressed in add-qualifiers PRs, so integration testing is now possible.

@colleenXu
Copy link
Collaborator

colleenXu commented Dec 14, 2022

Noting:

  • the two "increases" starting-TRAPI are identical, except for which QNode has an ID at the start. With our current implementation, this means we'll pick the same templateGroup for both starting-TRAPI. I think that's okay.
    • However, the sub-query QEdge traversals will differ depending on the starting TRAPI (which QNode has an ID at the start). So the edge-attributes can differ due to the "reverse operation" issue - and maybe the number of nodes/edges/results...
  • same with the two "decreases" starting-TRAPI
  • Because we don't have qualifier-hierarchy traversal yet...we'll likely include the various qualifier-constraint-combos in our templates using qualifier-sets (OR) (FYI @tokebe )

@tokebe
Copy link
Member

tokebe commented Dec 14, 2022

@colleenXu Presently, the qualifier-matching only supports one qualifier-set per templateGroup. Would you prefer I change the implementation to support multiple qualifier-sets in a templateGroup for OR matching within a single group (presently, this would be covered by making multiple templateGroups)?

Examples:

Presently:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": {
      "some_qualifier_type": "some_qualifier_value"
    }
  }

Proposed:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": [
      {
        "some_qualifier_type": "some_qualifier_value"
      },
      {
        "some_other_type": "this_would_be_OR"
      }
    ]
  }

@colleenXu
Copy link
Collaborator

I think using 1 qualifier-set for matching to templateGroup is fine because that's what we see in this issue (and near-future).

In the third bullet point of my post above, I'm saying that for OUR individual templates, I'll need to use multiple qualifier-sets to query for "activity" vs "activity_or_abundance" because we don't do qualifier-hierarchy expansion yet.

@colleenXu
Copy link
Collaborator

Knowledge resources related to this set of creative-mode questions (My analysis)

This is a work-in-progress, so edits will be done over time

Resources ready and BTE already uses

  • DGIdb
    • 12 sets of chem - gene relationships (4 up, 4 down, 4 neither). However, the majority of the data (44557 records) has no relationship assigned (not_applicable)
  • text-mining
    • has chem - gene (up/down)
    • also has gene - gene (up/down)

Note that Multiomics APIs has some potentially relevant info but we have some concerns:

  • drug response: has gene-gene relationships, for negative and positive correlation. But not clear where this comes from or how to use this
  • Wellness: has chem-chem and chem-gene relationships, for correlation and related_to. But not clear where this comes from or how to use this

Want update and BTE already uses

Semmeddb

  • improve with a data update + generating operations that use the ncbigene fields
    • update may be done soon? See internal lab Slack links. Will involve deprecated IDs/pipes/ncbigene handling/only having novelty=1 records
    • When the update happens: we'll want to update the x-bte annotation ASAP-ish (for master + biolink3 branches). It'll help to have the post-API-deployment analysis of metatriples.....
    • If I don't have that post-API-deployment analysis of metatriples...
      • if I know the semmeddb data file, I can quickly generate umls operations
      • generating ncbigene operations without the analysis of metatriples is possible but tricky?
        • check which subject/object semmeddb semantic-types have only the ncbigene field for ID. Kinda quick.
        • modify the final combos df to add rows when the metatriple has those subject/object semmeddb-semantic types

BindingDB

  • looking at the bindingdb website, we may be able to pick a more specific relationship for a chemical-gene pair
    • can we get the assay description and find keywords in that text? For example, the assay description for this pair seems to say this chem is an agonist of this gene...
    • can we use the existence of certain fields + the values of those fields? For example, if the IC50 field exists and its value is small, maybe this chem is an inhibitor of this gene.
      • The Ki field existence and value can maybe used as well. I'm not sure if other fields can be used.
      • the values are tricky because some are integers, some are floats, and some are integers ">40000"
  • info: see my old note here, listing fields here, bindingDB's info

Want new pending API

MyChem chembl.drug_mechanisms data, in subject-association-object format

  • We want 1 association per unique combo of subject/object/action_type.
    • Right now in MyChem, all data is aggregated by unique chemical (chem-centric format). So when a chemical has multiple drug_mechanisms, we can't retrieve only the sections that have a specific action_type.
    • MyChem currently has info for 5262 unique chemicals
  • We want the gene ID to be NCBIGene (or UniProtKB or ENSEMBL ENSG maybe).
    • Right now, it's CHEMBL.TARGET and Translator's Node Normalizer doesn't have cross-mappings/name-retrieval for that ID namespace

MyChem drugcentral.bioactivity data, in subject-association-object format

Not promising

expand to read

Bioplanet

  • within a pathway, there can be "activation" and "inhibition" arrows. However....
    • I find these hard to understand (see example)
    • the actions may be done by complexes or done on complexes (not individual genes or chemicals)
    • not clear how many pathways involve exogenous chemicals...
    • I don't see an easy download/way of parsing these pathways to get "chemical-gene" associations (biopax?)

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 6, 2023

Results for the direct-increases.json template with Gene Inputs

Reasonable number of results

Gene Name Gene ID Time to run (s) Number of results Notes
MAPK8IP3 NCBIGene:23162 11 4 but only 1 looks correct; Text-Mining
P450 (CYP27B1?) NCBIGene:1594 12 48 semmeddb + text-mining
XIST NCBIGene:7503 12 4 semmeddb
ADNP NCBIGene:23394 12 4 semmeddb + text-mining
MEF2C NCBIGene:4208 12 48 text-mining
MALAT1 NCBIGene:378938 11 3 semmeddb
DYRK1A NCBIGene:1859 11 14 text-mining
PCSK1N NCBIGene:27344 11 6 semmeddb + text-mining
TTR NCBIGene:7276 12 91 semmeddb + text-mining
GNAS NCBIGene:2778 11 9 semmeddb
IAPP NCBIGene:3375 12 51 text-mining
SCG5 NCBIGene:6447 11 1 text-mining
B2M NCBIGene:567 12 66 text-mining
Cytochrome B NCBIGene:4519 12 16 text-mining
RBP4 NCBIGene:5950 12 16 text-mining
NGLY1 NCBIGene:55768 11 2 text-mining
RHOBTB2 NCBIGene:23221 11 3 text-mining
WDFY3 NCBIGene:23001 12 5 text-mining
KCNT1 NCBIGene:57582 11 5 DGIDB + text-mining
NALCN NCBIGene:259232 11 1 semmeddb
CACNA1A NCBIGene:773 11 2 text-mining
BICD2 NCBIGene:23299 10 4 semmeddb + text-mining
FHL1 (CFH) NCBIGene:3075 11 4 semmeddb + text-mining
SETBP1 NCBIGene:26040 12 3 text-mining

Many results (exploding)

Gene Name Gene ID Time to run (s) Number of results Notes
TP53 NCBIGene:7157 32 1744 semmeddb + text-mining
MMP9 NCBIGene:4318 20 753 semmeddb + text-mining
GAPDH NCBIGene:2597 15 220 semmeddb + text-mining
Cytochrome oxidase NCBIGene:4512 14 177 semmeddb + text-mining
PPP3CA NCBIGene:5530 14 182 semmeddb + text-mining
ATF3 NCBIGene:467 14 226 semmeddb + text-mining

No results

Gene Name Gene ID Time to run (s) Number of results Notes
TSIX NCBIGene:9383 13 0
BORCS8-MEF2B NCBIGene:4207 9 0
FAU NCBIGene:2197 10 0
DHX30 NCBIGene:22907 10 0
SAMD9L NCBIGene:219285 10 0
Engase NCBIGene:64772 10 0

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 11, 2023

I found that 2 templateGroup items worked: one for Chem-increases-Gene and one for Chem-decreases-Gene (input ID can be on the Chem QNode or the Gene QNode). I wrote the templateGroups and the templates @andrewsu + I discussed on Monday and did a bit of testing (see next post). These are in this branch biothings/bte_trapi_query_graph_handler#132

template ideas we decided to write up
  • Chem (increases/decreases) gene (direct)
  • Chem increases another gene, then that gene (upregs/downregs) Gene (upregs -> overall increase, downregs -> overall decrease)
    • I modified QEdges to be more specific: removing increased activity_or_abundance for eA and increased/decreased activity_or_abundance for eB. This removed semmeddb and some dgidb edges...
  • Chem decreases another gene, then that gene (downregs/upregs) Gene (downregs -> overall increase, upregs -> overall decrease)
    • I modified QEdges to be more specific (removing increased activity_or_abundance for eA and increased/decreased activity_or_abundance for eB). This removed semmeddb and some dgidb edges...
  • Chem interacts with another gene, then that gene (upreg/downreg) gene
    • I modified QEdges to be more specific: using physically_interacts_with for eA and increased/decreased activity_or_abundance for eB. This removed semmeddb edges and a major dgidb edge...
  • Chem interacts with gene (direct)
    • I modified QEdges to be more specific: using physically_interacts_with for eA. This removed semmeddb edges and a major dgidb edge...

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 11, 2023

How I tested

Check out fix_issue_532 branch for query-handler and dev branches for all other modules (including bte-trapi-workspace).

For the "user"/"from UI/ARS" query...the main skeleton of the query is this. What varies is (1) which QNode also has an ids field with 1 user-submitted ID and (2) whether the object_direction_qualifier value is increased or decreased (I put X there).

skeleton
{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "X"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}
So here's an example of ChemX (sumatriptan) increases genes
{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Discussed

EDIT: discussed below + during Wednesday 1/11 meeting. Decisions added to each point

@tokebe

After testing biothings/bte_trapi_query_graph_handler#132 (it's branched off dev, and I tested it with all my other branches as dev)....overall it looks close to done. However, I have the following questions / issues:

  1. Looking at the logs, I'm not sure if some results are "pruned / removed" incorrectly. A confounding / related issue is that I don't see "same results from multiple templates merging" console logs and the "result merging" TRAPI-logs are buggy. -> JC to look into this (aka issues with creative-mode)
    • Using the chem sumatriptan PUBCHEM.COMPOUND:5358 is a good example (runs fairly fast, results from multiple templates and merging happens). See its entries in the Chem tables of the next post.
  2. BTE isn't doing "exact qualifier-set" matching to templateGroups. Is that alright? -> CX asked Translator group (Translator private Slack link). This behavior is fine right now
    • For example, if I send a query with an "inferred" "activity_or_abundance" QEdge (no increased/decreased direction qualifier), all 10 templates would run...
query with no direction qualifier so all (EDIT: 9) templates will be loaded
{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}
  1. I noticed something odd with the "limiting execution" code - where the records accumulating for a QEdge seemed a lot higher than the stated limit...The console log said bte:call-apis:query QEdge eA obtained 59903 records, exceeding maximum of 30000. Skipping remaining 1 (0 planned/1 paged) queries for this edge. Your query may be too general? +0ms. -> It checks after each individual sub-query. Sometimes the excess is a lot! JC will adjust code to truncate and remove the records over the maximum (he'll decide how to implement).
    • this log happened during the 4th template for Chem decreases GAPDH/NCBIGene:2597. It then was taking a very long time for the ID resolution (so I killed the execution manually).
    • Something similar happens for increases GAPDH but there are between 30k-40k records there (still takes too long for ID resolution).
    • See GAPDH's entries in the Gene tables of the next post.
Starting query for Chem decreases GeneY (GAPDH)
{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"],
                    "ids": ["NCBIGene:2597"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}
  1. Still having issues with the logs summarizing the execution of each QEdge: "eA03 execution: 0 queries (0 success/0 fail) and (0) cached qEdges return (80) records". I mentioned that here (lab's internal Slack) -> JC to look into this (aka issues with creative-mode)
  2. I think we can consider drastically dropping the max number of results to return (maybe 200? 400?). -> we discussed and agreed to drop max number to 500 for all creative-mode (same as ARAX)

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 11, 2023

Testing

[reverted; this records the testing done 1/10]

Increases

Chem Name Chem ID Final results Template breakdown and notes
metformin PUBCHEM.COMPOUND:4091 1000 > 1000 on first template
palmitic acid PUBCHEM.COMPOUND:12358543 0 hmmm...example of no results
sumatriptan PUBCHEM.COMPOUND:5358 137 37, 4, 2, 85, 32 (160 sum). Results dropping or merging?
Gene Name Gene ID Final results Template breakdown and notes
MAPK8IP3 NCBIGene:23162 4 4 from first template, none from the rest. Compare to decreases entry.
P450 (CYP27B1?) NCBIGene:1594 832 48, 0, 0, 770, 18 (836 sum). Results dropping or merging?
MEF2C NCBIGene:4208 1000 48, 1, 0, 6453 (stop).
GAPDH NCBIGene:2597 presume 1000 220, 88, 112, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

Decreases

Chem Name Chem ID Final results Template breakdown and notes
metformin PUBCHEM.COMPOUND:4091 1000 > 1000 on first template
palmitic acid PUBCHEM.COMPOUND:12358543 0 hmmm...example of no results
sumatriptan PUBCHEM.COMPOUND:5358 177 56, 19, 3, 132, 32 (242 sum). Results dropping or merging?
Gene Name Gene ID Final results Template breakdown and notes
MAPK8IP3 NCBIGene:23162 761 7, 0, 0, 754, 0 (sum 761, so no dropping/merging)
P450 (CYP27B1?) NCBIGene:1594 1000 31, 0, 0, 6694 (stop)
MEF2C NCBIGene:4208 1000 34, 0, 0, 6923 (stop)
GAPDH NCBIGene:2597 presume 1000 290, 120, 24, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

@tokebe
Copy link
Member

tokebe commented Jan 11, 2023

  1. I'll have to take a closer look at issues with results merging/pruning.
  2. I had previously stated that, currently, templates are matched purely by the query qualifier set being a subset of the templateGroup, rather than an exact match (i.e. both sets are equivalent, templateGroup has no additional qualifiers). I can change this to expecting an exact match instead?
  3. the record cutoff is checked after each query. If a query returns an absurd number of records, it's possible to exceed the record cutoff by an absurd amount. Should we instead delete records above the cutoff? I'd like @andrewsu's opinion on this as well.
  4. Looking into it.
  5. I don't have a problem with this -- @andrewsu?

@andrewsu
Copy link
Member Author

andrewsu commented Jan 11, 2023

  1. @colleenXu can you post the question to one of the translator slack channels? I can see arguments both ways, and I think this is something that should be standardized across ARAs.
  2. Originally my impression was that if we have the results beyond our threshold, might as well keep them. But I hadn't thought of the fact that other downstream things need to happen (like ID resolution). If those downstream things are essentially preventing BTE from returning the partial results, then I vote for deleting excess records. If not (and the issue is only the oddity of exceeding our threshold by a significant margin), then I vote for keeping the excess records. EDIT: decision made during 2023-01-11 meeting to delete excess records.

5. I'd suggest 500 to match what ARAX returns.

@tokebe
Copy link
Member

tokebe commented Jan 12, 2023

@colleenXu I've fixed the results merging count and logging. As a result, I can confirm that no results are being dropped. Running sumatriptan, I get 433 results added across the templates, and 244 results in the final response. Logging now shows 189 results that were combined. 433 - 189 = 244, everything checks out. I was actually running in circles for a while because I thought I had to account for the number of results merged into, but this is simply a count of results from the 244 that were combined from multiple templates.

I'm not sure what's different between our locals that cause a difference in numbers, (perhaps I haven't updated specs recently?), but all results are accounted for, and no code is dropping results.

I've also fixed issue 4 -- a piece of code in call-apis was still using pre-qedge-refactor accessors, breaking the counts.

Additionally, I've pushed changes so records are truncated to the cutoff, and changed the creative limit to 500.

All set to proceed with testing again 👍

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 20, 2023

@tokebe

Questions I have after my second round of testing:

  • Followup on Point 1 of this post: When I run a decreased query for sumatriptan (PUBCHEM.COMPOUND:5358), I notice that there are 234 final results but one of the templates reports getting 247 results (in the logs). I'm then confused...if we're not truncating results, how can a template get more results than there is at the final count?
TRAPI query for `decreased` sumatriptan + screenshot of logs
{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "decreased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I circled in orange the numbers mentioned above.
Screen Shot 2023-01-19 at 3 50 51 PM

  • Another followup on Point 1: I'm still confused by the "merge" logs (TRAPI and console) 😖 since I don't understand how, when merging the results from two templates, the "merged result" and "final result" numbers can be larger than the result count for one of those templates. The chem estrogen's queries are good examples (see the collapsed section below for the TRAPI query / info.
    • related question: I notice only 1 "merge" log at the end of creative-mode execution, even when we execute > 2 templates. However, it kinda looks like we merge previous results with new results after running each template - so it'd make more sense to put logs there (how many were merged / how many results there are so far). But maybe I'm incorrect?
    • related question: it looks like we merge results before we truncate to 500. Just checking if my understanding is correct...
TRAPI query for `increased` estrogen + screenshot of logs

For increases, the estrogen query logs say 271 + 5692 results have 3409 results "merged" to "1001 final results".

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5991"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I circled in orange the numbers mentioned above.
Screen Shot 2023-01-19 at 11 24 06 PM

  • Followup on Point 3 of this post: my local BTE instance is still getting stuck on GAPDH queries (see notes in the "second round of testing" collapsed section, gene tables). Maybe we'll address this with optimizations later...but it makes me wonder if BTE will crash during some testing of these creative-modes, and if 30k records is still kinda high...

(Points 4-5 from the post have been addressed by the recent changes (yay), and point 2 is still pending / doesn't seem to be an issue)

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 20, 2023

Second round of testing

[EDITED 1/23-1/26 after is_set: true added to templates and fixes added to logging. Rearranged list to match original Translator issue curie lists]

I tested more chemicals and genes this time, including all chemicals listed in the Translator posts

Increased

starting with chem
Chem Name Chem ID Final results Template breakdown and notes Time
Amphetamine PUBCHEM.COMPOUND:3007 500 306, 292 (merge 58, add 234), then stopped execution. 540 total and truncated 33 s
Dextroamphetamine PUBCHEM.COMPOUND:5826 230 36, 104 (merge 8, add 96), 60 (merge 4, add 56), 54 (merge 36, add 18), 31 (merge 7, add 24) 28 s
(+/-)-Methylphenidate hydrochloride (curie given) PUBCHEM.COMPOUND:44246724 0 no results 14 s
Methylphenidate (better curie) PUBCHEM.COMPOUND:4158 314 131, 139 (merge 29, add 110), 59 (merge 11, add 48), 40 (merge 20, add 20), 8 (merge 3, add 5) 35 s
metformin PUBCHEM.COMPOUND:4091 500 only 1st template (1184 results) 27 s
Atorvastatin PUBCHEM.COMPOUND:60823 500 only 1st template (564 results) 17 s
Valproic acid glucuronide (curie given) PUBCHEM.COMPOUND:88111 0 no results 14 s
Valproic acid (better curie) PUBCHEM.COMPOUND:3121 500 only 1st template (2553 results) 1 min 7 s
Vitamin A / retinol PUBCHEM.COMPOUND:445354 500 487, 359 (merge 86, add 273), then stopped execution. 760 total, and truncated 32 s
Vitamin C (ascorbic acid) PUBCHEM.COMPOUND:54670067 500 only 1st template (850 results) 22 s
Vitamin D (Cholecalciferol) PUBCHEM.COMPOUND:5280795 500 only 1st template (537 results) 16 s
Maltodextrin (curie given) PUBCHEM.COMPOUND:79025 3 1 from 4th template, 2 from 5th template. None merged. 17 s
Glucose (better curie) PUBCHEM.COMPOUND:24749 54 only 1st template. But none scored... 15 s
Magnesium ion (curie given) PUBCHEM.COMPOUND:888 46 only 1st template 14 s
Magnesium (atom, better curie) PUBCHEM.COMPOUND:5462224 500 211, 236 (merge 9, add 227), 212 (52 merged, add 160), then stopped execution. 598 total and truncated 23 s
DHEA (Dehydroepiandrosterone) PUBCHEM.COMPOUND:5881 500 438, 121 (merge 46, add 75), then stopped execution. 513 total, and truncated 25 s
Testosterone PUBCHEM.COMPOUND:6013 500 only 1st template (1069 results) 30 s
Ethinylestradiol (curie given) PUBCHEM.COMPOUND:5991 500 271, 2390 (merge 107, add 2283), then stopped execution. 2554 total, and truncated 1 min 19 s
Estrogens (another curie) UMLS:C0014939 500 only 1st template (3531 results) 2 min 2 s
Somatostatin acetate (curie given) PUBCHEM.COMPOUND:16129681 184 only 1st template 20 s
Somatostatin (better curie) PUBCHEM.COMPOUND:101826531 218 183, 16 (merge 10, add 6), 0, 42 (merge 26, add 16), 21 (merge 8, add 13) 24 s
Amitriptyline PUBCHEM.COMPOUND:2160 500 172, 216 (merge 31, add 185), 456 (merge 84, add 372), then stopped execution. 729 total, and truncated 37 s
Gabapentin PUBCHEM.COMPOUND:3446 186 70, 3 (merge 1, add 2), 107 (merge 15, add 92), 18 (merge 7, add 11), 13 (merge 2, add 11) 28 s
Propranolol PUBCHEM.COMPOUND:4946 500 327, 220 (merge 57, 163), 277 (merge 101, add 176), then stopped execution. 666 total, and truncated 34 s
sumatriptan PUBCHEM.COMPOUND:5358 185 38, 4 (merge 1, add 3), 11 (merge 2, add 9), 121 (merge 7, add 114), 37 (merge 16, add 21) 31 s
d4-Palmitic acid (curie given) PUBCHEM.COMPOUND:12358543 0 no results 14 s
palmitic acid (better curie) PUBCHEM.COMPOUND:135369651 500 only 1st template (905 results) 18 s
starting with gene
Gene Name Gene ID Final results Template breakdown and notes Time
MAPK8IP3 NCBIGene:23162 4 only 1st template, none scores. Compare to decreases response below 19 s
TP53 NCBIGene:7157 500 only 1st template (1762 results) 31 s
CYP27B1 (a P450) NCBIGene:1594 500 48, 0, 0, 770 (merge 1, add 769), then stopped execution. 817 total, and truncated 26 s
XIST NCBIGene:7503 6 only 1st template 13 s
TSIX NCBIGene:9383 0 no results 11 s
ADNP NCBIGene:23394 297 from 1st template (4) and 4th template (293), no merging. None scored 22 s
BORCS8-MEF2B NCBIGene:4207 0 no results 12 s
MMP9 NCBIGene:4318 500 only 1st template (760 results) 20 s
MEF2C NCBIGene:4208 500 48, 1, 0, 6431 (merge 7, add 6424), then stopped execution. 6473 total, and truncates 1 min 48 s
MALAT1 NCBIGene:378938 4 only 1st template 12 s
DYRK1A NCBIGene:1859 500 14, 0, 0, 7982 (merge 1, add 7981), then stopped execution. 7995 total, and truncates 2 min 1 s
PCSK1N NCBIGene:27344 12 from 1st template (6) and 4th template (6), no merging 17 s
TTR NCBIGene:7276 500 92, 30 (merge 1, add 29), 0, 9259 (merge 31, add 9228), then stop execution and truncate. 2 min 54 s

Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running > 14 min

Decreased

starting with chem
Chem Name Chem ID Final results Template breakdown and notes Time
Amphetamine PUBCHEM.COMPOUND:3007 500 207, 330 (merge 34, add 296), then stopped execution. 503 total and truncated 25 s
Dextroamphetamine PUBCHEM.COMPOUND:5826 237 21, 131, 40 (merge 11, add 29), 67 (merge 34, add 33), 31 (merge 8, add 23) 30 s
(+/-)-Methylphenidate hydrochloride (curie given) PUBCHEM.COMPOUND:44246724 0 no results 15 s
Methylphenidate (better curie) PUBCHEM.COMPOUND:4158 335 113, 182 (merge 23, add 159), 42 (merge 9, add 33), 54 (merge 25, add 29), 8 (merge 7, add 1) 34 s
metformin PUBCHEM.COMPOUND:4091 500 only 1st template (1549 results) 32 s
Atorvastatin PUBCHEM.COMPOUND:60823 500 only 1st template (788 results) 18 s
Valproic acid glucuronide (curie given) PUBCHEM.COMPOUND:88111 0 no results 14 s
Valproic acid (better curie) PUBCHEM.COMPOUND:3121 500 only 1st template (2034 results) 54 s
Vitamin A / retinol PUBCHEM.COMPOUND:445354 500 353, 408 (merge 89, add 319), then stopped execution. 672 total, and truncated 27 s
Vitamin C (ascorbic acid) PUBCHEM.COMPOUND:54670067 500 only 1st template (826 results) 22 s
Vitamin D (Cholecalciferol) PUBCHEM.COMPOUND:5280795 500 428, 115 (merge 36, add 79), then stopped execution. 507 total, and truncates 23 s
Maltodextrin (curie given) PUBCHEM.COMPOUND:79025 2 only from 5th template. None scored 16 s
Glucose (better curie) PUBCHEM.COMPOUND:24749 27 only 1st template. None scored 15 s
Magnesium ion (curie given) PUBCHEM.COMPOUND:888 17 only 1st template 15 s
Magnesium (atom, better curie) PUBCHEM.COMPOUND:5462224 497 138, 266 (merge 6, add 260), 145 (merge 46, add 99), 0, 0 30 s
DHEA (Dehydroepiandrosterone) PUBCHEM.COMPOUND:5881 500 316, 146 (merge 45, add 101), 122 (merge 54, add 68), 62 (merge 36, add 26), then stopped execution. 511 total, and truncated 36 s
Testosterone PUBCHEM.COMPOUND:6013 500 only 1st template (846 results) 24 s
Ethinylestradiol (curie given) PUBCHEM.COMPOUND:5991 500 217, 2353 (merge 76, add 2277), then stopped execution. 2494 total, and truncated 1 min 11 s
Estrogens (another curie) UMLS:C0014939 500 only 1st template (3294 results) 1 min 22 s
Somatostatin acetate (curie given) PUBCHEM.COMPOUND:16129681 39 only 1st template 16 s
Somatostatin (better curie) PUBCHEM.COMPOUND:101826531 276 231, 62 (merge 43, add 19), 0, 95 (merge 70, add 25), 21 (merge 20, add 1 28 s
Amitriptyline PUBCHEM.COMPOUND:2160 500 268, 253 (merge 54, add 199), 335 (merge 96, add 239), then stopped execution. 706 total, and truncated 34 s
Gabapentin PUBCHEM.COMPOUND:3446 224 131, 3, 60 (merge 11, add 49), 47 (merge 13, add 34), 13 (merge 6, add 7) 27 s
Propranolol PUBCHEM.COMPOUND:4946 500 only 1st template (618 results) 18 s
sumatriptan PUBCHEM.COMPOUND:5358 234 73, 18 (merge 8, add 10), 4 (merge 2, add 2), 166 (merge 27, add 139), 37 (merge 27, add 10) 31 s
d4-Palmitic acid (curie given) PUBCHEM.COMPOUND:12358543 0 no results 14 s
palmitic acid (better curie) PUBCHEM.COMPOUND:135369651 500 only 1st template (693 results) 20 s
starting with gene
Gene Name Gene ID Final results Template breakdown and notes Time
MAPK8IP3 NCBIGene:23162 500 7, 0, 0, 756, then stopped execution. 763 total and truncated 24 s
TP53 NCBIGene:7157 500 only 1st template (1232 results) 27 s
CYP27B1 (a P450) NCBIGene:1594 500 31, 0, 0, 6421 (merge 3, add 6418), then stopped execution. 6449 total, and truncated 1 min 43 s
XIST NCBIGene:7503 7 only 1st template 13 s
TSIX NCBIGene:9383 0 no results 12 s
ADNP NCBIGene:23394 364 from 1st template (3) and 4th template (361), no merging. None scored 22 s
BORCS8-MEF2B NCBIGene:4207 0 no results 12 s
MMP9 NCBIGene:4318 500 only 1st template (1404 results) 27 s
MEF2C NCBIGene:4208 500 34, 0, 0, 6841 (merge 3, add 6424), then stopped execution. 6872 total, and truncates 1 min 43 s
MALAT1 NCBIGene:378938 9 only 1st template 13 s
DYRK1A NCBIGene:1859 500 49, 0, 0, 4023 (merge 14, add 4009), then stopped execution. 4058 total, and truncates 1 min 3 s
PCSK1N NCBIGene:27344 0 no results 1 min 31 s
TTR NCBIGene:7276 500 84, 7, 16, 4003 (merge 19, add 3984), then stopped execution. 4091 total, and truncates 1 min 8 s

Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running 15 min

@tokebe
Copy link
Member

tokebe commented Jan 20, 2023

@colleenXu Regarding Point 1, I think your confusion comes from assuming that the merge log is supposed to be per-template. It isn't. Records are merged per-template, but the only merge log is a summary at the end of all merged results across all templates. Please pull and run again, and you'll see the log had been updated to show both the number of results merged, the number they were merged into, and the actual result count decrease. If you add up the results for each template, and then subtract the actual result count decrease, the math checks out (I spent a considerable amount of time verifying this last round...).

We could log per-template as well, but I don't particularly see the need to do so?

Running GAPDH on my local takes 19.9 minutes, which I agree is a little too much. Some of my optimizations may help with this, but I do think there's a case to be made for further decreasing the max records allowed.

@tokebe
Copy link
Member

tokebe commented Jan 24, 2023

After a meeting with @colleenXu the problem was confirmed. I've investigated the issue and found the reason:

Multiple results from the same template can end up being merged if that template is a multi-hop and the results connect the same subject and object via different intermediate nodes. IIRC, this was an intended behavior to keep results relatively well-organized. Such results would not be merged in non-creative execution (which another question worth asking somewhere else).

As a side effect of this, merging can show more results merged than what one might expect: instead of the maximum number merged in a step being equal to the smallest of either the current result set or the current template, it is actually the sum of those two.

@colleenXu I'm working on a fix to change the logging behavior to explicitly point this out when it occurs, and will push that change to this branch (and main) when it's done.

@tokebe
Copy link
Member

tokebe commented Jan 24, 2023

@colleenXu I've pushed multiple creative mode logging fixes and improvements and the math appears to check out now. Please run a couple tests and let me know if the logging seems better.

@colleenXu
Copy link
Collaborator

@tokebe Err...I'm not sure if you missed my update 3 yesterday (the internal Slack thread here).

I think the issue was that I didn't set the is_set: true parameter for intermediate QNodes in the templates. I pushed a commit here and tested, and then the "merging" logs looked reasonable...

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 25, 2023

Feedback:

  • APIs summary log at the end is broken when only the first template is run?
   2023-01-25T03:02:54.891Z INFO:    [Template-1]: Execution Summary: (906) nodes / (942) edges / (905) results; (3/36) queries returned results from (2) unique APIs
   2023-01-25T03:02:54.891Z INFO:    [Template-1]: APIs: BioThings SEMMEDDB API, Text Mining Targeted Association API
   2023-01-25T03:02:54.895Z INFO:    (0) results from Template-1 were merged with other results from the template. (0) results were merged with existing results from previous templates. Current result count is 905 (+905)
   2023-01-25T03:02:54.895Z INFO:    Addition of 905 results from Template 1 exceeds creative result maximum of 500 (reaching 905 merged). Response will be truncated to top-scoring 500 results. Skipping remaining 4 templates.
   2023-01-25T03:02:54.895Z INFO:    Final result count (before truncation): 905
   2023-01-25T03:02:54.897Z INFO:    Execution Summary: (501) nodes / (537) edges / (500) results; (0/36) queries returned results from (0) unique APIs
   2023-01-25T03:02:54.897Z INFO:    APIs:
   2023-01-25T03:02:54.897Z INFO:    Scoring Summary: (273) scored / (227) unscored

Otherwise, new logs look good!

@tokebe
Copy link
Member

tokebe commented Jan 25, 2023

I didn’t miss your update. Regardless of is_set behavior, the behavior without is_set looked wrong and needed fixing. I confirmed what was wrong with log clarity for those cases and fixed them.

Fix for the API end summary incoming.

@tokebe
Copy link
Member

tokebe commented Jan 25, 2023

Pushed the fix; yet another fun case of a change somehow not making it into a commit while silently remaining on my local, making me think I'm losing my mind lol.

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 27, 2023

Sorry for the late reply. I reran a bunch of queries and I think things look good! I like the new logs.

Perhaps we're ready to make a request for ITRB CI?

@colleenXu
Copy link
Collaborator

Note that templates were changed, replacing physically_interacts_with predicate with interacts_with (more general). Allows us to use dgidb for those templates (and mychem after its edits biothings/pending.api#101 (comment))

biothings/bte_trapi_query_graph_handler#135

@tokebe
Copy link
Member

tokebe commented Mar 28, 2023

Deployed to prod 🚀

@colleenXu
Copy link
Collaborator

Noting here just in case:

Old template ideas that weren't implemented (intended effect is "downregulates"):

  • Chem downregulates (non-human) gene, then Gene is ortholog of human gene
    • not sure that this would work, especially when I can't specify human vs non-human genes right now
  • Chem upregulates a Gene, then that Gene does a BP that negatively-regulates another BP done by another Gene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants