This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimize presto SQL Lab query performance. (apache#5132)
By stop polling when presto query already finished. When user make queries to Presto via SQL Lab, presto will run the query and then it can return all data back to superset in one shot. However, the default implementation of superset has enabled a default polling for presto to: - Get the fancy progress bar - Get the data back when the query finished. However, the polling implementation of superset is not right. I've done a profiling with a table of 1 billion rows, here're some data: - Total number of rows: 1.02 Billion - SQL Lab query limit: 1 million - Output Data: 1.5 GB - Superset memory consumed: about 10-20 GB - Time: 7 minutes to finish in Presto, takes additional 15 minutes for superset to get and store data. The problems with default issue is, even if presto has finished the query (7 minutes with above profiling), superset still do lots of wasted polling, in above profiling, superset sent about 540 polling in total, and at half of the polling is not necessary. Part of the simplied polling response: ``` { "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u", "id": "20180525_042715_03742_nza9u", "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11", "stats": { "state": "FINISHED", "queuedSplits": 21701, "progressPercentage": 35.98264191882267, "elapsedTimeMillis": 1029, "nodes": 116, "completedSplits": 15257, "scheduled": true, "wallTimeMillis": 2571904, "peakMemoryBytes": 0, "processedBytes": 40825519532, "processedRows": 47734066, "queuedTimeMillis": 0, "queued": false, "cpuTimeMillis": 849228, "rootStage": { "state": "FINISHED", "queuedSplits": 0, "nodes": 1, "totalSplits": 17, "processedBytes": 16829644, "processedRows": 11495, "completedSplits": 17, "stageId": "0", "done": true, "cpuTimeMillis": 69, "subStages": [ { "state": "CANCELED", "queuedSplits": 21701, "nodes": 116, "totalSplits": 42384, "processedBytes": 40825519532, "processedRows": 47734066, "completedSplits": 15240, "stageId": "1", "done": true, "cpuTimeMillis": 849159, "subStages": [], "wallTimeMillis": 2570374, "userTimeMillis": 730020, "runningSplits": 5443 } ], "wallTimeMillis": 1530, "userTimeMillis": 50, "runningSplits": 0 }, "totalSplits": 42401, "userTimeMillis": 730070, "runningSplits": 5443 } } } ``` Superset will terminate the polling when it finds that `nextUri` becomes none, but actually, when `["stats"]["state"] == "FINISHED"`, it means that presto has already finished the query and superset can stop polling and get the data back. After this simple optimization, we get a 2-5x performance boost for Presto SQL Lab queries.
- Loading branch information