Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix large dataset param performance #13

Closed
ryanrdoherty opened this issue Jul 26, 2022 · 3 comments · May be fixed by #12
Closed

Fix large dataset param performance #13

ryanrdoherty opened this issue Jul 26, 2022 · 3 comments · May be fixed by #12
Assignees
Labels
enhancement New feature or request

Comments

@ryanrdoherty
Copy link
Member

ryanrdoherty commented Jul 26, 2022

(this issue was triggered by the connection timeout in VB on July 25-26 , seems the cause was a user running geneId slow searches, each trying to create a new cache table and some got stuck in oracle)

This issue has been co-opted from the description below to instead rework how dataset param data is imported into a WDK cache table. Reading data over the DB link as a subquery has been shown to be consistently slow and exponentially slower as data size grows. Thus, maybe we can get a speed-up if we create a temp table in appDb containing the dataset_values rows for a particular dataset and use, as the dataset param's internal value, a select * from the_tmp_table. This would not be hard to set up (would just need to build the table when asked for the dataset param's internal value), but we will have to add a mechanism to tear the table down when complete. Steve suggested reusing tables (one per dataset_id) but this would require additional concurrency logic akin to the WDK cache. Or we could just let them sit out there and add their deletion to the wdkCache clearing logic.

First step, however is to confirm that ID query's ID sql is significantly faster when reading from a tmp table in appDb than it is when reading from dataset_values over the db link AND make sure that create table as (select from dataset_values) is not so slow that it offsets any performance gains. In short, check to make sure creating a tmp table from select over db_link + reading from tmp table is significantly faster than selecting directly over db_link.

@ryanrdoherty ryanrdoherty added the enhancement New feature or request label Jul 26, 2022
@ryanrdoherty ryanrdoherty changed the title Improve cache insertion concurrency model Fix large dataset param performance Aug 3, 2022
@ryanrdoherty ryanrdoherty self-assigned this Aug 8, 2022
@ryanrdoherty
Copy link
Member Author

I explored the poor performance of the ID search SQL with larger datasets (e.g. 15k or more). We originally thought it was due to the newish "original order" sorting feature, or maybe because we're pulling the dataset over the db link. But avoiding those does not fix the performance The fix is to remove the wildcard translation. The change below reduces a 60-second query to 2 seconds. I'm not exactly sure what it provides- looks like some additional whitespace protection and so users can use a * wildcard on IDs. But not sure that's useful given our site search / solr search.

@aurreco-uga has a plan to bring this solution to UX the next time we have hanging queries that impact the sites. Along with the "we shouldn't need this with the new SOLR search" argument, this will hopefully exert enough pressure to persuade them to go with this solution. Until then, it sits.

diff --git a/Model/lib/wdk/model/questions/queries/geneQueries.xml b/Model/lib/wdk/model/questions/queries/geneQueries.xml
index ff86fda0f..cbcfb6876 100644
--- a/Model/lib/wdk/model/questions/queries/geneQueries.xml
+++ b/Model/lib/wdk/model/questions/queries/geneQueries.xml
@@ -1051,7 +1051,7 @@ and ps.na_feature_id in ($$phenotypeScoreFileterParam$$)
             (
               SELECT ds.gene_source_id, gi.gene, ds.dataset_value_order
               FROM ApidbTuning.GeneId gi, ( $$ds_gene_ids$$ ) ds
-              WHERE LOWER(gi.id) LIKE REGEXP_REPLACE(REPLACE(LOWER(ds.gene_source_id), '*', '%'), '[[:space:]]', '')
+              WHERE LOWER(gi.id) = LOWER(ds.gene_source_id)
             ) t
         WHERE t.gene = ta.gene_source_id
         GROUP BY ta.gene_source_id, ta.source_id, ta.project_id

@ryanrdoherty ryanrdoherty transferred this issue from VEuPathDB/WDK Aug 24, 2022
@aurreco-uga
Copy link
Member

aurreco-uga commented Sep 12, 2022

Sep 9 2022
i did some testing in my vectorbase (durian): (had to comment out in ListDatasetParser.java lines 78-81 to allow an rtf file)
1- the times using regex and replace, or only replace or nothing, using an rtf original file with 1200 genes:
20mn with rgex and replace, 4mn with only replace and 17 sec with nothing.
i should probably try with a larger txt file..
2- about dataset values stored with a space in the middle of the value (we saw we dont have cases with trailing or leading spaces), example:
SQL> select data1 from userlogins5.dataset_values where dataset_id = 220700633;
AAEL000551, AAEL007948, AAEL008489
i tested an input with 3 iDs separated by comma and space and it works well so no idea how that happened..

@aurreco-uga
Copy link
Member

aurreco-uga commented Oct 6, 2022

sep 8, 2022: i committed in prod and master removing the regex_replace space and teh replace wildcard.
207c21b

reason to remove space:
no cases in database of a space coming with an ID (leading , trailing or in the middle).

reasons to remove wildcard:
1---wildcard is very slow, and breaks the system:
-TIMES are CACHING TABLE SQL (maybe post caching included?):
(loading into userdb and attributes table both under the mn.. 30 sec avg each)
-tested in my vectorbase (durian) using N databases:
(had to comment out in [ListDatasetParser.java] lines 78-81 to allow an rtf file)

1.a) using an rtf original file with 1200 genes:
with no regex replace: about 17 sec
with wildcard only: about 4 mn
with wildcard and space replace: about 20 min
1.b) 14K genes (tab delimited file obtained from a genes result)
with no regex replace: about 40sec
with wildcard and space replace: about 6mn

2---is hardly used anymore, now that we have ss
about wildcard use it does seem it has decreased
2018: 27312
2019: 1336
2020: 5122
2021: 598
2022: 177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants