-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful exit before full execution for queries that would have very large responses #363
Comments
If we went back to the older way that This would require changes to query_results.js and would require that whatever calls |
note from 12/8 lab meeting: Andrew suggests... either have strict thresholds (say no we can't handle it) or give an option for user to override (say "no I want the super large response, give it to me") |
Just as an FYI, here's a rough draft paragraph I put together to describe this issue:
|
Leaving this here as a reminder for a future discussion. We may be able to efficiently detect whether a result set will explode by using bloom filters or cuckoo filters. For each endpoint, normalize the IDs it contains, load each set of IDs into a bloom filter and then serialize each bloom filter to save it for future use. Whenever a BTE instance is launched, it will load the serialized bloom filters so it can perform set intersections for the IDs from endpoints for each QNode in order to get an upper limit of number of results that can be returned. It won't give the actual results, just the number of results. The bloom filters will require less memory than trying to actually save the full sets of IDs. |
@andrewsu and I had a discussion about this yesterday, a relatively simple solution that we will try and see how it works is to limit the results per edge in call-apis. This takes advantage of the behavior I noted in slack, where the call-apis package buckets the queries into groups to limit it to 3 concurrent calls per api. It then builds up an array of results as these 'query buckets' return results. When the length of this results array reaches a certain threshold, we can stop querying for this edge and send back the partial results. We can then query the next edge using these partial results and continue repeating this process. In theory, if we have a limit of 1000 results per edge, then a query that goes 1 -> 2000 -> 50000 could be turned into 1 -> ~1000 -> ~1000. To give the user control of this, we could let them set this parameter using a query parameter. @ariutta Interesting idea, I don't know how bloom filters work but do you need to make the queries first to get the number of results or is this done another way? |
For the bloom filter idea, you'd need to do a one-time query somehow to get a list of all IDs from an endpoint. If the endpoint has NCBIGene IDs connected to MONDO IDs, you'd get all the NCBIGene IDs from that endpoint and load them into a bloom filter, serialize that bloom filter and cache it. You'd do the same for all the MONDO IDs. All that would only need to happen once. The reason for the bloom filter is memory savings. It's not practical to cache everything from every endpoint, but it might be feasible to keep in memory two bloom filters for every endpoint, where each bloom filter could probabilistically tell you whether a given ID is in that endpoint. Here's an example: take a user query like the one below, where we start with 1 specified ID, get 1k records after
PS: I said the caching would only need to happen once for an endpoint, but more accurately, it would need to happen whenever the endpoint experiences a major change, e.g., if it used to support 2k diseases but now it supports 3k. |
I'm not sure whether it'd be worth it, but we could use the bloom filters before running any queries. For example, we could get the maximum number of IDs that are common to each node for each endpoint. Maybe that could be useful for query planning? |
If we're using multiple endpoints to get records for a single QEdge, the actual cached bloom filters might not be per endpoint. Rather, we might generate two bloom filters for each predicate we support and/or two bloom filters for every pairwise combination of QNode categories that we support. |
@ericz1803 I have tried the linked PRs and discussed with @andrewsu...there are 3 areas that need clarifying / some more discussion. First, to clarify:
Second, I ran the demo queries and this code does not seem to be working as-intended.
This is my analysis, with particularly concerning ones marked 🔺:
the * means that there may be less results depending on sub-queries failing or APIs timing out during sync cron job. Third, the demo query D3 hits an error....and it looks like it's related to the restriction and getting 0 results logs
|
To address some of your points:
Different approach:
|
Incremental results assembly may be the long-term solution, but I'd be hesitant to try getting that done in the next few days. It could involve quite a few changes to different parts of the codebase. |
Is there a way to make the limit on the edge dynamic (like 6000 records per input ID)? The number above comes from what I'm running: I'll post an update soon but the limit needed to get all 16 results for demo query D3 is ~16k (a one-hop explain w/ 6 IDs on one node, 3 IDs on the other node) |
To me, this dropping/filtering step is the key. I was hoping/assuming that the consolidation of edges to results would be at least somewhat predictable; maybe 1000 edges generally gets consolidated to ~900 results. But clearly in practice, that varies quite a bit. In deciding the path forward, I'd like more clarity on two issues:
My guess is that the second item (and possibly the first) will be easiest to discuss live, perhaps at our Wednesday meeting... |
I checked a couple of the queries and it looks like you're right, consolidation in the results building step looks like the main reason for the lower number of results. For B.2b, 1000 records are sent to the query results builder which turns them into 613 results sent back to the user. For predict queries (the first 5 that @colleenXu listed), this looks like the main cause. For explain queries (the rest), my guess is that it is a combination of intersecting IDs and this results consolidation. For D3 specifically, it is a one hop explain so that is probably why that is happening. |
Here's my findings from increasing the limit (see the full google sheet called "limit" here), with key findings with this symbol 📗 My impression is that a flexible limit dependent on the number of input IDs for the edge may be helpful.
|
Also! Setting the current limit to "1000 results out of the call-apis module" for a 1-hop Predict query doesn't result in...
📗 I found it surprising that it didn't lead to 1000 edges... I got this info by checking the knowledge-graphs for one-hop predicts (B2a / B2b / B2d / B2e) after setting the limit to 1000:
|
Notes from meeting:
|
From my perspective...
|
@ericz1803, @colleenXu and @marcodarko, here's a list of lines that might be useful for logging in query_results.js:
We could add additional logging if desired to indicate which preresults are consolidated based on multiple edges for the same primary IDs, but that could be overwhelming, because there could be a large number of these. Probably easier to just find them after the fact by looking for results with multiple values in |
not relevant given recently implemented limits on time and record counts |
@andrewsu proposes finding a way to detect when a query is going to return a ridiculously high number of results and return an error (will return too many results, please narrow the scope of the query) and gracefully exit.
Examples include demo queries B.3a (around 100k results), C3 (45k results), and D6 (61k results). Look here for those queries
Andrew proposed that this could be detected during the sub-querying step: when a sub-query returns a massive amount of records. However, there are some concerns:
@ariutta proposed calculating the potential size of the results, right before the results assembly step....and stopping there. The concern there would be that queries may still execute for a while (scale of 10-30 min) before hitting this step, so we may want to find where we can intervene earlier....
Builds on the caps issue and promiscuous node issues, but a different angle + intervention point...
The text was updated successfully, but these errors were encountered: