-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix large dataset param performance #13
Comments
I explored the poor performance of the ID search SQL with larger datasets (e.g. 15k or more). We originally thought it was due to the newish "original order" sorting feature, or maybe because we're pulling the dataset over the db link. But avoiding those does not fix the performance The fix is to remove the wildcard translation. The change below reduces a 60-second query to 2 seconds. I'm not exactly sure what it provides- looks like some additional whitespace protection and so users can use a * wildcard on IDs. But not sure that's useful given our site search / solr search. @aurreco-uga has a plan to bring this solution to UX the next time we have hanging queries that impact the sites. Along with the "we shouldn't need this with the new SOLR search" argument, this will hopefully exert enough pressure to persuade them to go with this solution. Until then, it sits.
|
Sep 9 2022 |
sep 8, 2022: i committed in prod and master removing the regex_replace space and teh replace wildcard. reason to remove space: reasons to remove wildcard: 1.a) using an rtf original file with 1200 genes: 2---is hardly used anymore, now that we have ss |
(this issue was triggered by the connection timeout in VB on July 25-26 , seems the cause was a user running geneId slow searches, each trying to create a new cache table and some got stuck in oracle)
This issue has been co-opted from the description below to instead rework how dataset param data is imported into a WDK cache table. Reading data over the DB link as a subquery has been shown to be consistently slow and exponentially slower as data size grows. Thus, maybe we can get a speed-up if we create a temp table in appDb containing the dataset_values rows for a particular dataset and use, as the dataset param's internal value, a select * from the_tmp_table. This would not be hard to set up (would just need to build the table when asked for the dataset param's internal value), but we will have to add a mechanism to tear the table down when complete. Steve suggested reusing tables (one per dataset_id) but this would require additional concurrency logic akin to the WDK cache. Or we could just let them sit out there and add their deletion to the wdkCache clearing logic.
First step, however is to confirm that ID query's ID sql is significantly faster when reading from a tmp table in appDb than it is when reading from dataset_values over the db link AND make sure that create table as (select from dataset_values) is not so slow that it offsets any performance gains. In short, check to make sure creating a tmp table from select over db_link + reading from tmp table is significantly faster than selecting directly over db_link.
The text was updated successfully, but these errors were encountered: