Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

andrewsu · 2022-12-06T19:33:44Z

Translator consortium target for implementation in Feb 2023. Exact TRAPI query templates will be created soon per Architecture meeting 2022-12-06...

EDIT Implementation dates:

Jan 25: Development
Feb 13: ITRB CI

colleenXu · 2022-12-06T22:20:49Z

Note:

it's not clear to me from this title if increases/decreases is a variable provided by the user or not...
I'm unsure right now how to do this "creatively"

Possibly relevant, from Slack:

Andy Crouse (Unsecret Agent, UI team)
1:13 PM
This is the test set of genes and drugs for next creative mode query.
Note there are some ‘destroyer of worlds’ genes and drugs (valproic acid, vitamin A, TP53, P450)
So if the lights dim when testing that is why. But they should be good for pressure testing!
I included some that are not classically druggable like none coding and transcription factors. So hopefully this will cover the gambit.
https://docs.google.com/document/d/1X3XbHhS_AIGkSyxSaYUuiCqVNh33Wz7f3EEyKZEctL8/edit?usp=sharing

andrewsu · 2022-12-07T16:41:08Z

The proposed TRAPI query templates are defined in these issues:

Creative: Implement "What Chemical increases a particular gene's activity or abundance"? NCATSTranslator/TranslatorArchitecture#79
Creative: Implement "What Chemical decreases a particular gene's activity or abundance" NCATSTranslator/TranslatorArchitecture#80
Creative: Implement "What gene's activity or abundance is increased by a particular chemical?" NCATSTranslator/TranslatorArchitecture#81
Creative: Implement "What gene's activity or abundance is decreased by a particular chemical?" NCATSTranslator/TranslatorArchitecture#82

tokebe · 2022-12-14T17:47:25Z

#534 Is now addressed in add-qualifiers PRs, so integration testing is now possible.

colleenXu · 2022-12-14T18:12:51Z

Noting:

the two "increases" starting-TRAPI are identical, except for which QNode has an ID at the start. With our current implementation, this means we'll pick the same templateGroup for both starting-TRAPI. I think that's okay.
- However, the sub-query QEdge traversals will differ depending on the starting TRAPI (which QNode has an ID at the start). So the edge-attributes can differ due to the "reverse operation" issue - and maybe the number of nodes/edges/results...
same with the two "decreases" starting-TRAPI
Because we don't have qualifier-hierarchy traversal yet...we'll likely include the various qualifier-constraint-combos in our templates using qualifier-sets (OR) (FYI @tokebe )

tokebe · 2022-12-14T18:15:45Z

@colleenXu Presently, the qualifier-matching only supports one qualifier-set per templateGroup. Would you prefer I change the implementation to support multiple qualifier-sets in a templateGroup for OR matching within a single group (presently, this would be covered by making multiple templateGroups)?

Examples:

Presently:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": {
      "some_qualifier_type": "some_qualifier_value"
    }
  }

Proposed:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": [
      {
        "some_qualifier_type": "some_qualifier_value"
      },
      {
        "some_other_type": "this_would_be_OR"
      }
    ]
  }

colleenXu · 2022-12-14T18:21:29Z

I think using 1 qualifier-set for matching to templateGroup is fine because that's what we see in this issue (and near-future).

In the third bullet point of my post above, I'm saying that for OUR individual templates, I'll need to use multiple qualifier-sets to query for "activity" vs "activity_or_abundance" because we don't do qualifier-hierarchy expansion yet.

colleenXu · 2022-12-20T02:53:20Z

Knowledge resources related to this set of creative-mode questions (My analysis)

This is a work-in-progress, so edits will be done over time

Resources ready and BTE already uses

DGIdb
- 12 sets of chem - gene relationships (4 up, 4 down, 4 neither). However, the majority of the data (44557 records) has no relationship assigned (not_applicable)
text-mining
- has chem - gene (up/down)
- also has gene - gene (up/down)

Note that Multiomics APIs has some potentially relevant info but we have some concerns:

drug response: has gene-gene relationships, for negative and positive correlation. But not clear where this comes from or how to use this

Wellness: has chem-chem and chem-gene relationships, for correlation and related_to. But not clear where this comes from or how to use this

Want update and BTE already uses

Semmeddb

improve with a data update + generating operations that use the ncbigene fields
- update may be done soon? See internal lab Slack links. Will involve deprecated IDs/pipes/ncbigene handling/only having novelty=1 records
- When the update happens: we'll want to update the x-bte annotation ASAP-ish (for master + biolink3 branches). It'll help to have the post-API-deployment analysis of metatriples.....
- If I don't have that post-API-deployment analysis of metatriples...
  - if I know the semmeddb data file, I can quickly generate umls operations
  - generating ncbigene operations without the analysis of metatriples is possible but tricky?
    - check which subject/object semmeddb semantic-types have only the ncbigene field for ID. Kinda quick.
    - modify the final combos df to add rows when the metatriple has those subject/object semmeddb-semantic types

BindingDB

looking at the bindingdb website, we may be able to pick a more specific relationship for a chemical-gene pair
- can we get the assay description and find keywords in that text? For example, the assay description for this pair seems to say this chem is an agonist of this gene...
- can we use the existence of certain fields + the values of those fields? For example, if the IC50 field exists and its value is small, maybe this chem is an inhibitor of this gene.
  - The Ki field existence and value can maybe used as well. I'm not sure if other fields can be used.
  - the values are tricky because some are integers, some are floats, and some are integers ">40000"
info: see my old note here, listing fields here, bindingDB's info

Want new pending API

MyChem `chembl.drug_mechanisms` data, in subject-association-object format

We want 1 association per unique combo of subject/object/action_type.
- Right now in MyChem, all data is aggregated by unique chemical (chem-centric format). So when a chemical has multiple drug_mechanisms, we can't retrieve only the sections that have a specific action_type.
- MyChem currently has info for 5262 unique chemicals
We want the gene ID to be NCBIGene (or UniProtKB or ENSEMBL ENSG maybe).
- Right now, it's CHEMBL.TARGET and Translator's Node Normalizer doesn't have cross-mappings/name-retrieval for that ID namespace

MyChem `drugcentral.bioactivity` data, in subject-association-object format

We want 1 association per unique combo of subject/object/action_type.
- Right now in MyChem, all data is aggregated by unique chemical (chem-centric format). So when a chemical has multiple bioactivity, we can't retrieve only the sections that have a specific action_type.
- MyChem currently has info for 2772 unique chemicals
We want the gene ID to be UniProtKB (or NCBIGene or ENSEMBL ENSG maybe).
- Right now, it's UniProtKB but 119 chemicals lack the UniProtKB record...and their bioactivity info looks odd (not acting on genes?)

Not promising

expand to read

Bioplanet

within a pathway, there can be "activation" and "inhibition" arrows. However....
- I find these hard to understand (see example)
- the actions may be done by complexes or done on complexes (not individual genes or chemicals)
- not clear how many pathways involve exogenous chemicals...
- I don't see an easy download/way of parsing these pathways to get "chemical-gene" associations (biopax?)

colleenXu · 2023-01-06T21:36:55Z

Results for the direct-increases.json template with Gene Inputs

Reasonable number of results

Gene Name	Gene ID	Time to run (s)	Number of results	Notes
MAPK8IP3	NCBIGene:23162	11	4	but only 1 looks correct; Text-Mining
P450 (CYP27B1?)	NCBIGene:1594	12	48	semmeddb + text-mining
XIST	NCBIGene:7503	12	4	semmeddb
ADNP	NCBIGene:23394	12	4	semmeddb + text-mining
MEF2C	NCBIGene:4208	12	48	text-mining
MALAT1	NCBIGene:378938	11	3	semmeddb
DYRK1A	NCBIGene:1859	11	14	text-mining
PCSK1N	NCBIGene:27344	11	6	semmeddb + text-mining
TTR	NCBIGene:7276	12	91	semmeddb + text-mining
GNAS	NCBIGene:2778	11	9	semmeddb
IAPP	NCBIGene:3375	12	51	text-mining
SCG5	NCBIGene:6447	11	1	text-mining
B2M	NCBIGene:567	12	66	text-mining
Cytochrome B	NCBIGene:4519	12	16	text-mining
RBP4	NCBIGene:5950	12	16	text-mining
NGLY1	NCBIGene:55768	11	2	text-mining
RHOBTB2	NCBIGene:23221	11	3	text-mining
WDFY3	NCBIGene:23001	12	5	text-mining
KCNT1	NCBIGene:57582	11	5	DGIDB + text-mining
NALCN	NCBIGene:259232	11	1	semmeddb
CACNA1A	NCBIGene:773	11	2	text-mining
BICD2	NCBIGene:23299	10	4	semmeddb + text-mining
FHL1 (CFH)	NCBIGene:3075	11	4	semmeddb + text-mining
SETBP1	NCBIGene:26040	12	3	text-mining

Many results (exploding)

Gene Name	Gene ID	Time to run (s)	Number of results	Notes
TP53	NCBIGene:7157	32	1744	semmeddb + text-mining
MMP9	NCBIGene:4318	20	753	semmeddb + text-mining
GAPDH	NCBIGene:2597	15	220	semmeddb + text-mining
Cytochrome oxidase	NCBIGene:4512	14	177	semmeddb + text-mining
PPP3CA	NCBIGene:5530	14	182	semmeddb + text-mining
ATF3	NCBIGene:467	14	226	semmeddb + text-mining

No results

Gene Name	Gene ID	Time to run (s)
TSIX	NCBIGene:9383	13
BORCS8-MEF2B	NCBIGene:4207	9
FAU	NCBIGene:2197	10
DHX30	NCBIGene:22907	10
SAMD9L	NCBIGene:219285	10
Engase	NCBIGene:64772	10

colleenXu · 2023-01-11T06:33:13Z

I found that 2 templateGroup items worked: one for Chem-increases-Gene and one for Chem-decreases-Gene (input ID can be on the Chem QNode or the Gene QNode). I wrote the templateGroups and the templates @andrewsu + I discussed on Monday and did a bit of testing (see next post). These are in this branch biothings/bte_trapi_query_graph_handler#132

template ideas we decided to write up

Chem (increases/decreases) gene (direct)
Chem increases another gene, then that gene (upregs/downregs) Gene (upregs -> overall increase, downregs -> overall decrease)
- I modified QEdges to be more specific: removing increased activity_or_abundance for eA and increased/decreased activity_or_abundance for eB. This removed semmeddb and some dgidb edges...
Chem decreases another gene, then that gene (downregs/upregs) Gene (downregs -> overall increase, upregs -> overall decrease)
- I modified QEdges to be more specific (removing increased activity_or_abundance for eA and increased/decreased activity_or_abundance for eB). This removed semmeddb and some dgidb edges...
Chem interacts with another gene, then that gene (upreg/downreg) gene
- I modified QEdges to be more specific: using physically_interacts_with for eA and increased/decreased activity_or_abundance for eB. This removed semmeddb edges and a major dgidb edge...
Chem interacts with gene (direct)
- I modified QEdges to be more specific: using physically_interacts_with for eA. This removed semmeddb edges and a major dgidb edge...

colleenXu · 2023-01-11T07:09:00Z

How I tested

Check out fix_issue_532 branch for query-handler and dev branches for all other modules (including bte-trapi-workspace).

For the "user"/"from UI/ARS" query...the main skeleton of the query is this. What varies is (1) which QNode also has an ids field with 1 user-submitted ID and (2) whether the object_direction_qualifier value is increased or decreased (I put X there).

skeleton

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "X"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

So here's an example of ChemX (sumatriptan) increases genes

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Discussed

EDIT: discussed below + during Wednesday 1/11 meeting. Decisions added to each point

@tokebe

After testing biothings/bte_trapi_query_graph_handler#132 (it's branched off dev, and I tested it with all my other branches as dev)....overall it looks close to done. However, I have the following questions / issues:

Looking at the logs, I'm not sure if some results are "pruned / removed" incorrectly. A confounding / related issue is that I don't see "same results from multiple templates merging" console logs and the "result merging" TRAPI-logs are buggy. -> JC to look into this (aka issues with creative-mode)
- Using the chem sumatriptan PUBCHEM.COMPOUND:5358 is a good example (runs fairly fast, results from multiple templates and merging happens). See its entries in the Chem tables of the next post.
BTE isn't doing "exact qualifier-set" matching to templateGroups. Is that alright? -> CX asked Translator group (Translator private Slack link). This behavior is fine right now
- For example, if I send a query with an "inferred" "activity_or_abundance" QEdge (no increased/decreased direction qualifier), all 10 templates would run...

query with no direction qualifier so all (EDIT: 9) templates will be loaded

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I noticed something odd with the "limiting execution" code - where the records accumulating for a QEdge seemed a lot higher than the stated limit...The console log said bte:call-apis:query QEdge eA obtained 59903 records, exceeding maximum of 30000. Skipping remaining 1 (0 planned/1 paged) queries for this edge. Your query may be too general? +0ms. -> It checks after each individual sub-query. Sometimes the excess is a lot! JC will adjust code to truncate and remove the records over the maximum (he'll decide how to implement).
- this log happened during the 4th template for Chem decreases GAPDH/NCBIGene:2597. It then was taking a very long time for the ID resolution (so I killed the execution manually).
- Something similar happens for increases GAPDH but there are between 30k-40k records there (still takes too long for ID resolution).
- See GAPDH's entries in the Gene tables of the next post.

Starting query for Chem decreases GeneY (GAPDH)

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"],
                    "ids": ["NCBIGene:2597"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Still having issues with the logs summarizing the execution of each QEdge: "eA03 execution: 0 queries (0 success/0 fail) and (0) cached qEdges return (80) records". I mentioned that here (lab's internal Slack) -> JC to look into this (aka issues with creative-mode)
I think we can consider drastically dropping the max number of results to return (maybe 200? 400?). -> we discussed and agreed to drop max number to 500 for all creative-mode (same as ARAX)

colleenXu · 2023-01-11T07:20:06Z

Testing

[reverted; this records the testing done 1/10]

Increases

Chem Name	Chem ID	Final results	Template breakdown and notes
metformin	PUBCHEM.COMPOUND:4091	1000	> 1000 on first template
palmitic acid	PUBCHEM.COMPOUND:12358543	0	hmmm...example of no results
sumatriptan	PUBCHEM.COMPOUND:5358	137	37, 4, 2, 85, 32 (160 sum). Results dropping or merging?

Gene Name	Gene ID	Final results	Template breakdown and notes
MAPK8IP3	NCBIGene:23162	4	4 from first template, none from the rest. Compare to `decreases` entry.
P450 (CYP27B1?)	NCBIGene:1594	832	48, 0, 0, 770, 18 (836 sum). Results dropping or merging?
MEF2C	NCBIGene:4208	1000	48, 1, 0, 6453 (stop).
GAPDH	NCBIGene:2597	presume 1000	220, 88, 112, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

Decreases

Chem Name	Chem ID	Final results	Template breakdown and notes
metformin	PUBCHEM.COMPOUND:4091	1000	> 1000 on first template
palmitic acid	PUBCHEM.COMPOUND:12358543	0	hmmm...example of no results
sumatriptan	PUBCHEM.COMPOUND:5358	177	56, 19, 3, 132, 32 (242 sum). Results dropping or merging?

Gene Name	Gene ID	Final results	Template breakdown and notes
MAPK8IP3	NCBIGene:23162	761	7, 0, 0, 754, 0 (sum 761, so no dropping/merging)
P450 (CYP27B1?)	NCBIGene:1594	1000	31, 0, 0, 6694 (stop)
MEF2C	NCBIGene:4208	1000	34, 0, 0, 6923 (stop)
GAPDH	NCBIGene:2597	presume 1000	290, 120, 24, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

tokebe · 2023-01-11T15:30:03Z

I'll have to take a closer look at issues with results merging/pruning.
I had previously stated that, currently, templates are matched purely by the query qualifier set being a subset of the templateGroup, rather than an exact match (i.e. both sets are equivalent, templateGroup has no additional qualifiers). I can change this to expecting an exact match instead?
the record cutoff is checked after each query. If a query returns an absurd number of records, it's possible to exceed the record cutoff by an absurd amount. Should we instead delete records above the cutoff? I'd like @andrewsu's opinion on this as well.
Looking into it.
I don't have a problem with this -- @andrewsu?

andrewsu · 2023-01-11T16:28:46Z

@colleenXu can you post the question to one of the translator slack channels? I can see arguments both ways, and I think this is something that should be standardized across ARAs.
Originally my impression was that if we have the results beyond our threshold, might as well keep them. But I hadn't thought of the fact that other downstream things need to happen (like ID resolution). If those downstream things are essentially preventing BTE from returning the partial results, then I vote for deleting excess records. If not (and the issue is only the oddity of exceeding our threshold by a significant margin), then I vote for keeping the excess records. EDIT: decision made during 2023-01-11 meeting to delete excess records.

5. I'd suggest 500 to match what ARAX returns.

tokebe · 2023-01-12T20:38:18Z

@colleenXu I've fixed the results merging count and logging. As a result, I can confirm that no results are being dropped. Running sumatriptan, I get 433 results added across the templates, and 244 results in the final response. Logging now shows 189 results that were combined. 433 - 189 = 244, everything checks out. I was actually running in circles for a while because I thought I had to account for the number of results merged into, but this is simply a count of results from the 244 that were combined from multiple templates.

I'm not sure what's different between our locals that cause a difference in numbers, (perhaps I haven't updated specs recently?), but all results are accounted for, and no code is dropping results.

I've also fixed issue 4 -- a piece of code in call-apis was still using pre-qedge-refactor accessors, breaking the counts.

Additionally, I've pushed changes so records are truncated to the cutoff, and changed the creative limit to 500.

All set to proceed with testing again 👍

colleenXu · 2023-01-20T02:02:06Z

@tokebe

Questions I have after my second round of testing:

Followup on Point 1 of this post: When I run a decreased query for sumatriptan (PUBCHEM.COMPOUND:5358), I notice that there are 234 final results but one of the templates reports getting 247 results (in the logs). I'm then confused...if we're not truncating results, how can a template get more results than there is at the final count?

TRAPI query for `decreased` sumatriptan + screenshot of logs

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "decreased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I circled in orange the numbers mentioned above.

Another followup on Point 1: I'm still confused by the "merge" logs (TRAPI and console) 😖 since I don't understand how, when merging the results from two templates, the "merged result" and "final result" numbers can be larger than the result count for one of those templates. The chem estrogen's queries are good examples (see the collapsed section below for the TRAPI query / info.
- related question: I notice only 1 "merge" log at the end of creative-mode execution, even when we execute > 2 templates. However, it kinda looks like we merge previous results with new results after running each template - so it'd make more sense to put logs there (how many were merged / how many results there are so far). But maybe I'm incorrect?
- related question: it looks like we merge results before we truncate to 500. Just checking if my understanding is correct...

TRAPI query for `increased` estrogen + screenshot of logs

For increases, the estrogen query logs say 271 + 5692 results have 3409 results "merged" to "1001 final results".

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5991"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I circled in orange the numbers mentioned above.

Followup on Point 3 of this post: my local BTE instance is still getting stuck on GAPDH queries (see notes in the "second round of testing" collapsed section, gene tables). Maybe we'll address this with optimizations later...but it makes me wonder if BTE will crash during some testing of these creative-modes, and if 30k records is still kinda high...

(Points 4-5 from the post have been addressed by the recent changes (yay), and point 2 is still pending / doesn't seem to be an issue)

colleenXu · 2023-01-20T07:26:37Z

Second round of testing

[EDITED 1/23-1/26 after is_set: true added to templates and fixes added to logging. Rearranged list to match original Translator issue curie lists]

I tested more chemicals and genes this time, including all chemicals listed in the Translator posts

Increased

starting with chem

Chem Name	Chem ID	Final results	Template breakdown and notes	Time
Amphetamine	PUBCHEM.COMPOUND:3007	500	306, 292 (merge 58, add 234), then stopped execution. 540 total and truncated	33 s
Dextroamphetamine	PUBCHEM.COMPOUND:5826	230	36, 104 (merge 8, add 96), 60 (merge 4, add 56), 54 (merge 36, add 18), 31 (merge 7, add 24)	28 s
(+/-)-Methylphenidate hydrochloride (curie given)	PUBCHEM.COMPOUND:44246724	0	no results	14 s
Methylphenidate (better curie)	PUBCHEM.COMPOUND:4158	314	131, 139 (merge 29, add 110), 59 (merge 11, add 48), 40 (merge 20, add 20), 8 (merge 3, add 5)	35 s
metformin	PUBCHEM.COMPOUND:4091	500	only 1st template (1184 results)	27 s
Atorvastatin	PUBCHEM.COMPOUND:60823	500	only 1st template (564 results)	17 s
Valproic acid glucuronide (curie given)	PUBCHEM.COMPOUND:88111	0	no results	14 s
Valproic acid (better curie)	PUBCHEM.COMPOUND:3121	500	only 1st template (2553 results)	1 min 7 s
Vitamin A / retinol	PUBCHEM.COMPOUND:445354	500	487, 359 (merge 86, add 273), then stopped execution. 760 total, and truncated	32 s
Vitamin C (ascorbic acid)	PUBCHEM.COMPOUND:54670067	500	only 1st template (850 results)	22 s
Vitamin D (Cholecalciferol)	PUBCHEM.COMPOUND:5280795	500	only 1st template (537 results)	16 s
Maltodextrin (curie given)	PUBCHEM.COMPOUND:79025	3	1 from 4th template, 2 from 5th template. None merged.	17 s
Glucose (better curie)	PUBCHEM.COMPOUND:24749	54	only 1st template. But none scored...	15 s
Magnesium ion (curie given)	PUBCHEM.COMPOUND:888	46	only 1st template	14 s
Magnesium (atom, better curie)	PUBCHEM.COMPOUND:5462224	500	211, 236 (merge 9, add 227), 212 (52 merged, add 160), then stopped execution. 598 total and truncated	23 s
DHEA (Dehydroepiandrosterone)	PUBCHEM.COMPOUND:5881	500	438, 121 (merge 46, add 75), then stopped execution. 513 total, and truncated	25 s
Testosterone	PUBCHEM.COMPOUND:6013	500	only 1st template (1069 results)	30 s
Ethinylestradiol (curie given)	PUBCHEM.COMPOUND:5991	500	271, 2390 (merge 107, add 2283), then stopped execution. 2554 total, and truncated	1 min 19 s
Estrogens (another curie)	UMLS:C0014939	500	only 1st template (3531 results)	2 min 2 s
Somatostatin acetate (curie given)	PUBCHEM.COMPOUND:16129681	184	only 1st template	20 s
Somatostatin (better curie)	PUBCHEM.COMPOUND:101826531	218	183, 16 (merge 10, add 6), 0, 42 (merge 26, add 16), 21 (merge 8, add 13)	24 s
Amitriptyline	PUBCHEM.COMPOUND:2160	500	172, 216 (merge 31, add 185), 456 (merge 84, add 372), then stopped execution. 729 total, and truncated	37 s
Gabapentin	PUBCHEM.COMPOUND:3446	186	70, 3 (merge 1, add 2), 107 (merge 15, add 92), 18 (merge 7, add 11), 13 (merge 2, add 11)	28 s
Propranolol	PUBCHEM.COMPOUND:4946	500	327, 220 (merge 57, 163), 277 (merge 101, add 176), then stopped execution. 666 total, and truncated	34 s
sumatriptan	PUBCHEM.COMPOUND:5358	185	38, 4 (merge 1, add 3), 11 (merge 2, add 9), 121 (merge 7, add 114), 37 (merge 16, add 21)	31 s
d4-Palmitic acid (curie given)	PUBCHEM.COMPOUND:12358543	0	no results	14 s
palmitic acid (better curie)	PUBCHEM.COMPOUND:135369651	500	only 1st template (905 results)	18 s

starting with gene

Gene Name	Gene ID	Final results	Template breakdown and notes	Time
MAPK8IP3	NCBIGene:23162	4	only 1st template, none scores. Compare to `decreases` response below	19 s
TP53	NCBIGene:7157	500	only 1st template (1762 results)	31 s
CYP27B1 (a P450)	NCBIGene:1594	500	48, 0, 0, 770 (merge 1, add 769), then stopped execution. 817 total, and truncated	26 s
XIST	NCBIGene:7503	6	only 1st template	13 s
TSIX	NCBIGene:9383	0	no results	11 s
ADNP	NCBIGene:23394	297	from 1st template (4) and 4th template (293), no merging. None scored	22 s
BORCS8-MEF2B	NCBIGene:4207	0	no results	12 s
MMP9	NCBIGene:4318	500	only 1st template (760 results)	20 s
MEF2C	NCBIGene:4208	500	48, 1, 0, 6431 (merge 7, add 6424), then stopped execution. 6473 total, and truncates	1 min 48 s
MALAT1	NCBIGene:378938	4	only 1st template	12 s
DYRK1A	NCBIGene:1859	500	14, 0, 0, 7982 (merge 1, add 7981), then stopped execution. 7995 total, and truncates	2 min 1 s
PCSK1N	NCBIGene:27344	12	from 1st template (6) and 4th template (6), no merging	17 s
TTR	NCBIGene:7276	500	92, 30 (merge 1, add 29), 0, 9259 (merge 31, add 9228), then stop execution and truncate.	2 min 54 s

Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running > 14 min

Decreased

starting with chem

Chem Name	Chem ID	Final results	Template breakdown and notes	Time
Amphetamine	PUBCHEM.COMPOUND:3007	500	207, 330 (merge 34, add 296), then stopped execution. 503 total and truncated	25 s
Dextroamphetamine	PUBCHEM.COMPOUND:5826	237	21, 131, 40 (merge 11, add 29), 67 (merge 34, add 33), 31 (merge 8, add 23)	30 s
(+/-)-Methylphenidate hydrochloride (curie given)	PUBCHEM.COMPOUND:44246724	0	no results	15 s
Methylphenidate (better curie)	PUBCHEM.COMPOUND:4158	335	113, 182 (merge 23, add 159), 42 (merge 9, add 33), 54 (merge 25, add 29), 8 (merge 7, add 1)	34 s
metformin	PUBCHEM.COMPOUND:4091	500	only 1st template (1549 results)	32 s
Atorvastatin	PUBCHEM.COMPOUND:60823	500	only 1st template (788 results)	18 s
Valproic acid glucuronide (curie given)	PUBCHEM.COMPOUND:88111	0	no results	14 s
Valproic acid (better curie)	PUBCHEM.COMPOUND:3121	500	only 1st template (2034 results)	54 s
Vitamin A / retinol	PUBCHEM.COMPOUND:445354	500	353, 408 (merge 89, add 319), then stopped execution. 672 total, and truncated	27 s
Vitamin C (ascorbic acid)	PUBCHEM.COMPOUND:54670067	500	only 1st template (826 results)	22 s
Vitamin D (Cholecalciferol)	PUBCHEM.COMPOUND:5280795	500	428, 115 (merge 36, add 79), then stopped execution. 507 total, and truncates	23 s
Maltodextrin (curie given)	PUBCHEM.COMPOUND:79025	2	only from 5th template. None scored	16 s
Glucose (better curie)	PUBCHEM.COMPOUND:24749	27	only 1st template. None scored	15 s
Magnesium ion (curie given)	PUBCHEM.COMPOUND:888	17	only 1st template	15 s
Magnesium (atom, better curie)	PUBCHEM.COMPOUND:5462224	497	138, 266 (merge 6, add 260), 145 (merge 46, add 99), 0, 0	30 s
DHEA (Dehydroepiandrosterone)	PUBCHEM.COMPOUND:5881	500	316, 146 (merge 45, add 101), 122 (merge 54, add 68), 62 (merge 36, add 26), then stopped execution. 511 total, and truncated	36 s
Testosterone	PUBCHEM.COMPOUND:6013	500	only 1st template (846 results)	24 s
Ethinylestradiol (curie given)	PUBCHEM.COMPOUND:5991	500	217, 2353 (merge 76, add 2277), then stopped execution. 2494 total, and truncated	1 min 11 s
Estrogens (another curie)	UMLS:C0014939	500	only 1st template (3294 results)	1 min 22 s
Somatostatin acetate (curie given)	PUBCHEM.COMPOUND:16129681	39	only 1st template	16 s
Somatostatin (better curie)	PUBCHEM.COMPOUND:101826531	276	231, 62 (merge 43, add 19), 0, 95 (merge 70, add 25), 21 (merge 20, add 1	28 s
Amitriptyline	PUBCHEM.COMPOUND:2160	500	268, 253 (merge 54, add 199), 335 (merge 96, add 239), then stopped execution. 706 total, and truncated	34 s
Gabapentin	PUBCHEM.COMPOUND:3446	224	131, 3, 60 (merge 11, add 49), 47 (merge 13, add 34), 13 (merge 6, add 7)	27 s
Propranolol	PUBCHEM.COMPOUND:4946	500	only 1st template (618 results)	18 s
sumatriptan	PUBCHEM.COMPOUND:5358	234	73, 18 (merge 8, add 10), 4 (merge 2, add 2), 166 (merge 27, add 139), 37 (merge 27, add 10)	31 s
d4-Palmitic acid (curie given)	PUBCHEM.COMPOUND:12358543	0	no results	14 s
palmitic acid (better curie)	PUBCHEM.COMPOUND:135369651	500	only 1st template (693 results)	20 s

starting with gene

Gene Name	Gene ID	Final results	Template breakdown and notes	Time
MAPK8IP3	NCBIGene:23162	500	7, 0, 0, 756, then stopped execution. 763 total and truncated	24 s
TP53	NCBIGene:7157	500	only 1st template (1232 results)	27 s
CYP27B1 (a P450)	NCBIGene:1594	500	31, 0, 0, 6421 (merge 3, add 6418), then stopped execution. 6449 total, and truncated	1 min 43 s
XIST	NCBIGene:7503	7	only 1st template	13 s
TSIX	NCBIGene:9383	0	no results	12 s
ADNP	NCBIGene:23394	364	from 1st template (3) and 4th template (361), no merging. None scored	22 s
BORCS8-MEF2B	NCBIGene:4207	0	no results	12 s
MMP9	NCBIGene:4318	500	only 1st template (1404 results)	27 s
MEF2C	NCBIGene:4208	500	34, 0, 0, 6841 (merge 3, add 6424), then stopped execution. 6872 total, and truncates	1 min 43 s
MALAT1	NCBIGene:378938	9	only 1st template	13 s
DYRK1A	NCBIGene:1859	500	49, 0, 0, 4023 (merge 14, add 4009), then stopped execution. 4058 total, and truncates	1 min 3 s
PCSK1N	NCBIGene:27344	0	no results	1 min 31 s
TTR	NCBIGene:7276	500	84, 7, 16, 4003 (merge 19, add 3984), then stopped execution. 4091 total, and truncates	1 min 8 s

Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running 15 min

tokebe · 2023-01-20T17:59:44Z

@colleenXu Regarding Point 1, I think your confusion comes from assuming that the merge log is supposed to be per-template. It isn't. Records are merged per-template, but the only merge log is a summary at the end of all merged results across all templates. Please pull and run again, and you'll see the log had been updated to show both the number of results merged, the number they were merged into, and the actual result count decrease. If you add up the results for each template, and then subtract the actual result count decrease, the math checks out (I spent a considerable amount of time verifying this last round...).

We could log per-template as well, but I don't particularly see the need to do so?

Running GAPDH on my local takes 19.9 minutes, which I agree is a little too much. Some of my optimizations may help with this, but I do think there's a case to be made for further decreasing the max records allowed.

tokebe · 2023-01-24T18:05:29Z

After a meeting with @colleenXu the problem was confirmed. I've investigated the issue and found the reason:

Multiple results from the same template can end up being merged if that template is a multi-hop and the results connect the same subject and object via different intermediate nodes. IIRC, this was an intended behavior to keep results relatively well-organized. Such results would not be merged in non-creative execution (which another question worth asking somewhere else).

As a side effect of this, merging can show more results merged than what one might expect: instead of the maximum number merged in a step being equal to the smallest of either the current result set or the current template, it is actually the sum of those two.

@colleenXu I'm working on a fix to change the logging behavior to explicitly point this out when it occurs, and will push that change to this branch (and main) when it's done.

tokebe · 2023-01-24T20:55:53Z

@colleenXu I've pushed multiple creative mode logging fixes and improvements and the math appears to check out now. Please run a couple tests and let me know if the logging seems better.

colleenXu · 2023-01-25T02:58:37Z

@tokebe Err...I'm not sure if you missed my update 3 yesterday (the internal Slack thread here).

I think the issue was that I didn't set the is_set: true parameter for intermediate QNodes in the templates. I pushed a commit here and tested, and then the "merging" logs looked reasonable...

colleenXu · 2023-01-25T03:05:05Z

Feedback:

APIs summary log at the end is broken when only the first template is run?

   2023-01-25T03:02:54.891Z INFO:    [Template-1]: Execution Summary: (906) nodes / (942) edges / (905) results; (3/36) queries returned results from (2) unique APIs
   2023-01-25T03:02:54.891Z INFO:    [Template-1]: APIs: BioThings SEMMEDDB API, Text Mining Targeted Association API
   2023-01-25T03:02:54.895Z INFO:    (0) results from Template-1 were merged with other results from the template. (0) results were merged with existing results from previous templates. Current result count is 905 (+905)
   2023-01-25T03:02:54.895Z INFO:    Addition of 905 results from Template 1 exceeds creative result maximum of 500 (reaching 905 merged). Response will be truncated to top-scoring 500 results. Skipping remaining 4 templates.
   2023-01-25T03:02:54.895Z INFO:    Final result count (before truncation): 905
   2023-01-25T03:02:54.897Z INFO:    Execution Summary: (501) nodes / (537) edges / (500) results; (0/36) queries returned results from (0) unique APIs
   2023-01-25T03:02:54.897Z INFO:    APIs:
   2023-01-25T03:02:54.897Z INFO:    Scoring Summary: (273) scored / (227) unscored

Otherwise, new logs look good!

tokebe · 2023-01-25T16:35:06Z

I didn’t miss your update. Regardless of is_set behavior, the behavior without is_set looked wrong and needed fixing. I confirmed what was wrong with log clarity for those cases and fixed them.

Fix for the API end summary incoming.

tokebe · 2023-01-25T19:28:44Z

Pushed the fix; yet another fun case of a change somehow not making it into a commit while silently remaining on my local, making me think I'm losing my mind lol.

colleenXu · 2023-01-27T07:19:08Z

Sorry for the late reply. I reran a bunch of queries and I think things look good! I like the new logs.

Perhaps we're ready to make a request for ITRB CI?

colleenXu · 2023-02-06T08:11:59Z

Note that templates were changed, replacing physically_interacts_with predicate with interacts_with (more general). Allows us to use dgidb for those templates (and mychem after its edits biothings/pending.api#101 (comment))

biothings/bte_trapi_query_graph_handler#135

tokebe · 2023-03-28T20:35:29Z

Deployed to prod 🚀

colleenXu · 2023-04-19T08:04:08Z

Noting here just in case:

Old template ideas that weren't implemented (intended effect is "downregulates"):

Chem downregulates (non-human) gene, then Gene is ortholog of human gene
- not sure that this would work, especially when I can't specify human vs non-human genes right now
Chem upregulates a Gene, then that Gene does a BP that negatively-regulates another BP done by another Gene

andrewsu mentioned this issue Dec 6, 2022

Create creative mode templates for "What {gene, sequence variant, biological process} [contributes to] [disease]?" #533

Closed

tokebe mentioned this issue Dec 7, 2022

Support qualifiers in creative query template matching #534

Closed

colleenXu mentioned this issue Jan 11, 2023

next creative-mode templates biothings/bte_trapi_query_graph_handler#132

Merged

tokebe closed this as completed Mar 28, 2023

colleenXu mentioned this issue Apr 15, 2023

qualifiers in meta knowledge graph #610

Merged

colleenXu mentioned this issue Aug 2, 2023

update example and test queries #681

Open

This was referenced Aug 17, 2023

Template revisions biothings/bte_trapi_query_graph_handler#167

Merged

Template revised (fixed merge conflicts) biothings/bte_trapi_query_graph_handler#169

Merged

Do we want more dev work on templates, for current creative-modes? #704

Open

colleenXu mentioned this issue Aug 30, 2023

Examples of queries that time out (> 5 mins) #716

Open

colleenXu mentioned this issue Sep 6, 2023

BioThings BindingDB: can the relationship be more specific? #718

Open

colleenXu mentioned this issue Apr 19, 2024

Changing Creative mode threshold from results to time #808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

andrewsu commented Dec 6, 2022 •

edited

Loading

colleenXu commented Dec 6, 2022

andrewsu commented Dec 7, 2022

tokebe commented Dec 14, 2022 •

edited

Loading

colleenXu commented Dec 14, 2022 •

edited

Loading

tokebe commented Dec 14, 2022 •

edited

Loading

colleenXu commented Dec 14, 2022

colleenXu commented Dec 20, 2022

Bioplanet

colleenXu commented Jan 6, 2023 •

edited

Loading

colleenXu commented Jan 11, 2023 •

edited

Loading

colleenXu commented Jan 11, 2023 •

edited

Loading

colleenXu commented Jan 11, 2023 •

edited

Loading

tokebe commented Jan 11, 2023

andrewsu commented Jan 11, 2023 •

edited

Loading

tokebe commented Jan 12, 2023

colleenXu commented Jan 20, 2023 •

edited

Loading

colleenXu commented Jan 20, 2023 •

edited

Loading

tokebe commented Jan 20, 2023

tokebe commented Jan 24, 2023

tokebe commented Jan 24, 2023

colleenXu commented Jan 25, 2023

colleenXu commented Jan 25, 2023 •

edited

Loading

tokebe commented Jan 25, 2023

tokebe commented Jan 25, 2023

colleenXu commented Jan 27, 2023 •

edited

Loading

colleenXu commented Feb 6, 2023

tokebe commented Mar 28, 2023

colleenXu commented Apr 19, 2023

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

Comments

andrewsu commented Dec 6, 2022 • edited Loading

colleenXu commented Dec 6, 2022

andrewsu commented Dec 7, 2022

tokebe commented Dec 14, 2022 • edited Loading

colleenXu commented Dec 14, 2022 • edited Loading

tokebe commented Dec 14, 2022 • edited Loading

colleenXu commented Dec 14, 2022

colleenXu commented Dec 20, 2022

Knowledge resources related to this set of creative-mode questions (My analysis)

Resources ready and BTE already uses

Want update and BTE already uses

Semmeddb

BindingDB

Want new pending API

MyChem chembl.drug_mechanisms data, in subject-association-object format

MyChem drugcentral.bioactivity data, in subject-association-object format

Not promising

Bioplanet

colleenXu commented Jan 6, 2023 • edited Loading

colleenXu commented Jan 11, 2023 • edited Loading

colleenXu commented Jan 11, 2023 • edited Loading

Discussed

colleenXu commented Jan 11, 2023 • edited Loading

Testing

Increases

Decreases

tokebe commented Jan 11, 2023

andrewsu commented Jan 11, 2023 • edited Loading

tokebe commented Jan 12, 2023

colleenXu commented Jan 20, 2023 • edited Loading

colleenXu commented Jan 20, 2023 • edited Loading

Second round of testing

Increased

Decreased

tokebe commented Jan 20, 2023

tokebe commented Jan 24, 2023

tokebe commented Jan 24, 2023

colleenXu commented Jan 25, 2023

colleenXu commented Jan 25, 2023 • edited Loading

tokebe commented Jan 25, 2023

tokebe commented Jan 25, 2023

colleenXu commented Jan 27, 2023 • edited Loading

colleenXu commented Feb 6, 2023

tokebe commented Mar 28, 2023

colleenXu commented Apr 19, 2023

andrewsu commented Dec 6, 2022 •

edited

Loading

tokebe commented Dec 14, 2022 •

edited

Loading

colleenXu commented Dec 14, 2022 •

edited

Loading

tokebe commented Dec 14, 2022 •

edited

Loading

MyChem `chembl.drug_mechanisms` data, in subject-association-object format

MyChem `drugcentral.bioactivity` data, in subject-association-object format

colleenXu commented Jan 6, 2023 •

edited

Loading

colleenXu commented Jan 11, 2023 •

edited

Loading

colleenXu commented Jan 11, 2023 •

edited

Loading

colleenXu commented Jan 11, 2023 •

edited

Loading

andrewsu commented Jan 11, 2023 •

edited

Loading

colleenXu commented Jan 20, 2023 •

edited

Loading

colleenXu commented Jan 20, 2023 •

edited

Loading

colleenXu commented Jan 25, 2023 •

edited

Loading

colleenXu commented Jan 27, 2023 •

edited

Loading