-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement "creative/inferred mode" for "what drugs may treat disease X" query #449
Comments
Just adding my general plan here for quick reference:
Main issue before start of implementation: I need that template query library, or a good place to start in generating one. |
At least initially, we can use this query template library created by @Carolina1396 https://github.com/Carolina1396/drug_mechanisms_rare_diseases_BioThingsExplorer/tree/main/src/query_templates, which was based at least in part on an analysis of the most common metapaths in DrugMechDB shown in https://github.com/SuLab/DMDB_Analysis/blob/main/1_code/1_basic_dmdb_analysis.ipynb. |
@andrewsu I'm realizing there are some questions I've failed to ask regarding scope: Should BTE support inferred-mode edges in multi-hop queries? More than one inferred edge? I believe this should still be relatively simple in this implementation: we use our template query to replace the inferred edge, keeping the 'outside' topology the same, for as many templated queries as we have (in the case of multiple inferred edges, one graph for each possible combination of template queries as appropriate). Implementation-wise this isn't very complex -- just a few extra loops to make sure we get all the permutations. Since we're not re-assembling results afterward, just de-duplicating and concatenating them, there's no overt complications with results assembly. I do however see some possible cases where we have a single multi-hop query with multiple inferred edges exploding very quickly. Another simple clarifying question: When encountering an inferred edge, obviously BTE must create every query that can be generated from templates related to that inferred edge. I assume BTE should also still execute the original edge in addition? |
No, we can make the assumption (and enforce the limit) that inferred-mode queries are single-hop queries. If that's not true, then feel free to have BTE fail out with some appropriate error message.
Yes, I think that makes sense. Feel free to implement that in code, or by adding the appropriate template to the query library. Whichever route makes the most sense to you... |
if it's clear how to fix it, go for it. If not, skip it... |
I suggest some changes to the templates for this particular issue...
|
I've taken all the templates and turned them into 7 TRAPI queries, with some mods....#453 (comment) However, these may need further tweaking (predicates, node constraint for the drug regulatory status aka whether it is a drug, EDIT: is_set:True on QNodes...)? EDIT: node constraints not working: #174 (comment) |
And...it's unclear to me how we'll assemble the separate TRAPI queries / QGraphs and their responses into 1 response / thing....is that the goal? |
@colleenXu As for the response assembly: This was discussed in a meeting, the graphs and results get de-duplicated (remove/combine duplicate node/edge references) and then concatenated into one response. Inferred query graphs are dropped, at least until behavior to the contrary is specified. Regarding the templates, I'll attempt to include the suggestions you made to the best of my ability (where appropriate -- I think there are some things that require more discussion, such as adding predicates). My goal is to make a system where we can easily add/change templates, so anything else can be easily fixed after primary implementation. |
Note that we may be discussing more implementation details during this relay. I have a note from current attendance of the "creative mode" session: "grouping answers by drug". I'm not sure exactly what that would look like right now. EDIT: I'm going to adjust my templates to keep the qNodeID naming given in the "initial query" thingy. |
I also noted some question as to predicate/direction -- something along the lines of "does this creative query result specify something that actually helps, or is it the opposite?" I'm not sure I was entirely clear if that reached an answer. Do we want to be incredibly careful with our predicate selection (both matching a creative qEdge to appropriate templates, and possibly controlling additional predicates in our templates, either statically or dynamically), or for this initial implementation should the focus be more on getting a result graph that is definitely related, whether it strictly fits the creative qEdge's predicate or not entirely? |
My previous comment may be safely ignored for now - it has been reasonably answered by discussion in the relay, and my implementation will be trivial to fix if I misunderstood. I'm cataloguing items which need to be fixed/combined when combining/de-duplicating results, which has led me to certain node attributes, namely:
I'm not totally familiar with their purpose -- how should these behave when combining multiple results? Remove source/target qg_nodes which aren't present in the original query_graph and adjust counts accordingly? Should counts be added between responses for overlapping nodes? |
Regarding the node attributes listed above, check to see if they have any internal usage. If not, I think they can be removed from the output for now. I believe these were added as some sort of attempt to look at node degree (so that paths using very common nodes, or "hub nodes", can eventually be down-weighted in scoring). But, I don't think we actually got around to implementing anything using those attributes. |
My current implementation searches for folders matching the This has, however, led to another question: should BTE attempt to expand the search to possible additional appropriate template groups based on the biolink hierarchy? For instance, if the user queries something to the effect of |
I suppose expanding the query classes to more specific subclasses would be useful, i.e., But since we only have one creative mode template right now, I think it's also fine to punt on this class-traversal functionality until later. Up to you to decide based on the effort involved in just tackling it now... |
@andrewsu @colleenXu As a note for the creation of query templates, I'm setting up the handling such that a template is expected to have one node named {
"message": {
"query_graph": {
"nodes": {
"subject": {
"categories": ["biolink:ChemicalEntity"]
},
"n1": {
"categories": ["biolink:Gene"]
},
"object": {
"categories": ["biolink:DiseaseOrPhenotypicFeature"]
}
},
"edges": {
"e01": {
"subject": "subject",
"object": "n1",
"predicates": ["biolink:affects"]
},
"e03": {
"subject": "object",
"object": "n1",
"predicates": ["biolink:caused_by"]
}
}
}
}
} |
@tokebe I've read up on this issue, and:
|
Also, with reversing I think you've misunderstood what I mean? I have to handle reverse cases by nature of how my code searches for templates, something like:
The reason for this is that template groups are searched for by folder name, where the actual name of the folder is something like |
From discussion (first Andrew-I, then Jackson-I) Big picture
Template handling details
|
Some of my notes related to this discussion: For clarity in discussion of templates:
Regarding different qNodeIDs/qEdgeIDs for responses from different templates:
|
(Moving from Slack discussion to here for ease-of-record) It's been proposed that a result cap should be imposed to prevent oversized responses. What should this cap be, and are there any special considerations regarding its behavior/implementation? I imagine the simplest form is something like "If the cumulative results reaches/surpasses , do not proceed with additional template sub-queries; Assemble final results and return." Tagging @colleenXu @andrewsu for discussion. |
I agree that this is the simplest form of adding a result cap, and I think it's where we should start... |
Further testing: with @colleenXu's template revision and now template ordering, it's possible now to stop at a reasonable number of results a fair amount of the time. However, it's still possible at times to get very large responses for one template, which causes BTE to stop after that one, but still maintain possibly very large numbers of results. It might be worth figuring out how to pass a result cap in sub-queries, or even to implement sub-query time-outs. Additionally, I'm still concerned that we'll end up stopping after just the first template, which is going to be the basic one-hop, so I think further discussion is still warranted about the presence (or perhaps just the ordering) of the basic one-hop? |
Additional behavioral question: currently, the user-query subject and object categories and IDs are merged with those of any matching templates, meaning if the user-query specifies Given current template revision efforts, it seems like category merging may not be the best idea, so I've removed it in the latest commit. Should I re-enable this? |
Replying to @tokebe's first post #449 (comment), first paragraph on "too many results / takes too long":
|
Replying to @tokebe's first post #449 (comment), second paragraph on "is it okay to do only the 1-hop and finish":
|
I'm a little confused by @tokebe 's second comment #449 (comment):
|
In response to @colleenXu's first response:I've introduced undue confusion here by used the term "sub-query" instead of something like "template-query" (I should change sub-query to template-query in the new code's logging as well...). Each template is handled as its own independent query, just bypassing our endpoints and with some special handling. What I was proposing was either passing a result limit to each template's execution, or limiting each template's execution time. That said, there are definitely some complications with either of those solutions. Given your recent extended testing in Slack, I think it might be fine to leave the behavior as-is, and instead focus on just how many results is an acceptable amount before stopping? In response to the second:The impression I've gotten is that "creative" is more important than "fast", however this is partially assumption on my part based on the nature of expecting BTE to sort of "expand" the initial query (as well as what it looked like other teams were doing seeming to lean more toward "creative". Additionally, I thought @andrewsu had commented more along the lines of concern that the straightforward one-hop may not be "creative enough" for Translator's taste. In the end, I'm just implementing, but I'd like to double-check with @andrewsu with regards to this. In response to the third:I'm not sure what's confusing (except for my incorrect use of sub-query instead of template-query), especially as your response perfectly addresses why I shouldn't re-enable category merging between the user query and our templates. I'll leave it disabled and remove the commented-out code as it won't be used. |
Regarding how many results to limit creative mode to, I propose a two-number method. The first number, X, dictates when to stop. If the queries added together have X or more results, don't execute any more templates. The second number, Y, can be a sort of hard-coded buffer. If the currently-executed results are less than X, we only add the newest query's results if that number plus X is less than X + Y. (Otherwise, we might reasonably assume that all queries will exceed X + Y, and stop executing templates) This should prevent cases where, say, X = 1000, and we reach 980 results, and then the next query has another 1000 results, meaning our response would have some number less than or equal to 1980, way over our intended limit. I would say that Y could be somewhere between 100 and 500. I don't have a solid idea of what X should be, perhaps something between 1000-2000? |
vocab:
Regarding limits, I've been assuming:
creativity:
"merging UI incoming query with templates":
|
@tokebe also adding a note on what another team is doing to "recognize creative-mode and its qNodes" (Translator Slack) https://ncatstranslator.slack.com/archives/C013Q5TVC87/p1656113934859089?thread_ts=1656022814.576499&cid=C013Q5TVC87 |
This brings up an interesting point -- currently, I have no special behavior for multiple IDs. If the user/ui supplies multiple IDs, they're all merged in an used in template-queries. Should we require that the user/ui query has only 1 ID? |
Slight caveat to the X + Y method -- I can only have that take affect after the first template-query, or else it could return just a little over the proposed 1000 + 200 limit and then nothing gets returned. So there's still no guarantee we're returning 1200 or fewer results in the event the first template gets more, just a decent likelihood... |
Noting some old UI discussion of creative-mode in internal Translator Slack: https://ncatstranslator.slack.com/archives/C02PG1W7HD0/p1651610305586929 |
recording the "Chris Bizon" provided incoming/creative-mode query:
|
Also looks like we decided not to run the literal edge. We previously chose to run the literal edge, but I couldn't find a documented reason why we decided not to. I think this is fine. We basically run the "literal edge" with the first template. And we decided not to do a time-limit on each template-run at the moment (only results capping) |
Recording decisions on what the creative-mode incoming query looks like (from UI / ARS): 1 Disease ID, 1 "treats" predicate What another team was checking:
|
From the 8/10 lab meeting on Translator stuff (added feature): (Jackson's plate of work) "should be technically feasible to "top off" results and trim resulting KG"
Also edit the TRAPI logs to make the cap clearer: "the cap of 1500 is reached because the template just run has 3000 results" |
MVP is complete |
Up to this point, Translator has focused on explicitly-declared query topologies, expressed as a subgraph of nodes and edges. Translator now wants ARAs to accept "inferred edges" in the query, which essentially give the ARA the freedom to return answers that are supported by any query topology. This design has been colloquially referred to as "creative mode".
By the end of June, Translator would like ARAs to be able to respond to one type of "inferred edge" query focusing on the question "what drugs may treat disease X". "Inferred mode" is specified by adding a
"knowledge_type": "inferred"
on the edge of a one-hop predict query (as proposed in this PR).For the output, paths between the disease and the drug should have a maximum of three edges. There is currently no cap on the overall number of nodes or edges in a result.
This doc contains a list of example diseases to be queried: https://docs.google.com/document/d/1cuYrWHv6MzT3w7lZqpHQVc7WbMY5WuLiLOh9qI3TVRo/edit
In terms of our BTE implementation of inferred mode, I think we should build on the idea of using a library of common query topologies / metapaths extracted from DrugMechDB. This would be a relatively straightforward implementation of this feature, and it would build on work that we've already done for #181.
The text was updated successfully, but these errors were encountered: