Fix large dataset param performance #13

ryanrdoherty · 2022-07-26T18:48:26Z

(this issue was triggered by the connection timeout in VB on July 25-26 , seems the cause was a user running geneId slow searches, each trying to create a new cache table and some got stuck in oracle)

This issue has been co-opted from the description below to instead rework how dataset param data is imported into a WDK cache table. Reading data over the DB link as a subquery has been shown to be consistently slow and exponentially slower as data size grows. Thus, maybe we can get a speed-up if we create a temp table in appDb containing the dataset_values rows for a particular dataset and use, as the dataset param's internal value, a select * from the_tmp_table. This would not be hard to set up (would just need to build the table when asked for the dataset param's internal value), but we will have to add a mechanism to tear the table down when complete. Steve suggested reusing tables (one per dataset_id) but this would require additional concurrency logic akin to the WDK cache. Or we could just let them sit out there and add their deletion to the wdkCache clearing logic.

First step, however is to confirm that ID query's ID sql is significantly faster when reading from a tmp table in appDb than it is when reading from dataset_values over the db link AND make sure that create table as (select from dataset_values) is not so slow that it offsets any performance gains. In short, check to make sure creating a tmp table from select over db_link + reading from tmp table is significantly faster than selecting directly over db_link.

ryanrdoherty · 2022-08-09T20:21:20Z

I explored the poor performance of the ID search SQL with larger datasets (e.g. 15k or more). We originally thought it was due to the newish "original order" sorting feature, or maybe because we're pulling the dataset over the db link. But avoiding those does not fix the performance The fix is to remove the wildcard translation. The change below reduces a 60-second query to 2 seconds. I'm not exactly sure what it provides- looks like some additional whitespace protection and so users can use a * wildcard on IDs. But not sure that's useful given our site search / solr search.

@aurreco-uga has a plan to bring this solution to UX the next time we have hanging queries that impact the sites. Along with the "we shouldn't need this with the new SOLR search" argument, this will hopefully exert enough pressure to persuade them to go with this solution. Until then, it sits.

diff --git a/Model/lib/wdk/model/questions/queries/geneQueries.xml b/Model/lib/wdk/model/questions/queries/geneQueries.xml
index ff86fda0f..cbcfb6876 100644
--- a/Model/lib/wdk/model/questions/queries/geneQueries.xml
+++ b/Model/lib/wdk/model/questions/queries/geneQueries.xml
@@ -1051,7 +1051,7 @@ and ps.na_feature_id in ($$phenotypeScoreFileterParam$$)
             (
               SELECT ds.gene_source_id, gi.gene, ds.dataset_value_order
               FROM ApidbTuning.GeneId gi, ( $$ds_gene_ids$$ ) ds
-              WHERE LOWER(gi.id) LIKE REGEXP_REPLACE(REPLACE(LOWER(ds.gene_source_id), '*', '%'), '[[:space:]]', '')
+              WHERE LOWER(gi.id) = LOWER(ds.gene_source_id)
             ) t
         WHERE t.gene = ta.gene_source_id
         GROUP BY ta.gene_source_id, ta.source_id, ta.project_id

aurreco-uga · 2022-09-12T11:42:27Z

Sep 9 2022
i did some testing in my vectorbase (durian): (had to comment out in ListDatasetParser.java lines 78-81 to allow an rtf file)
1- the times using regex and replace, or only replace or nothing, using an rtf original file with 1200 genes:
20mn with rgex and replace, 4mn with only replace and 17 sec with nothing.
i should probably try with a larger txt file..
2- about dataset values stored with a space in the middle of the value (we saw we dont have cases with trailing or leading spaces), example:
SQL> select data1 from userlogins5.dataset_values where dataset_id = 220700633;
AAEL000551, AAEL007948, AAEL008489
i tested an input with 3 iDs separated by comma and space and it works well so no idea how that happened..

aurreco-uga · 2022-10-06T18:37:54Z

sep 8, 2022: i committed in prod and master removing the regex_replace space and teh replace wildcard.
207c21b

reason to remove space:
no cases in database of a space coming with an ID (leading , trailing or in the middle).

reasons to remove wildcard:
1---wildcard is very slow, and breaks the system:
-TIMES are CACHING TABLE SQL (maybe post caching included?):
(loading into userdb and attributes table both under the mn.. 30 sec avg each)
-tested in my vectorbase (durian) using N databases:
(had to comment out in [ListDatasetParser.java] lines 78-81 to allow an rtf file)

1.a) using an rtf original file with 1200 genes:
with no regex replace: about 17 sec
with wildcard only: about 4 mn
with wildcard and space replace: about 20 min
1.b) 14K genes (tab delimited file obtained from a genes result)
with no regex replace: about 40sec
with wildcard and space replace: about 6mn

2---is hardly used anymore, now that we have ss
about wildcard use it does seem it has decreased
2018: 27312
2019: 1336
2020: 5122
2021: 598
2022: 177

ryanrdoherty added the enhancement New feature or request label Jul 26, 2022

ryanrdoherty changed the title ~~Improve cache insertion concurrency model~~ Fix large dataset param performance Aug 3, 2022

ryanrdoherty self-assigned this Aug 8, 2022

ryanrdoherty mentioned this issue Aug 15, 2022

Dataset upload fix VEuPathDB/WDK#46

Merged

ryanrdoherty transferred this issue from VEuPathDB/WDK Aug 24, 2022

ryanrdoherty linked a pull request Aug 24, 2022 that will close this issue

Vastly improve text search performance at cost of wildcards #12

Draft

steve-fischer-200 closed this as completed Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix large dataset param performance #13

Fix large dataset param performance #13

ryanrdoherty commented Jul 26, 2022 •

edited

Loading

ryanrdoherty commented Aug 9, 2022

aurreco-uga commented Sep 12, 2022 •

edited

Loading

aurreco-uga commented Oct 6, 2022 •

edited

Loading

Fix large dataset param performance #13

Fix large dataset param performance #13

Comments

ryanrdoherty commented Jul 26, 2022 • edited Loading

ryanrdoherty commented Aug 9, 2022

aurreco-uga commented Sep 12, 2022 • edited Loading

aurreco-uga commented Oct 6, 2022 • edited Loading

ryanrdoherty commented Jul 26, 2022 •

edited

Loading

aurreco-uga commented Sep 12, 2022 •

edited

Loading

aurreco-uga commented Oct 6, 2022 •

edited

Loading