Better support for large query results #78

arikfr · 2014-02-05T13:33:36Z

If a query has large result (~50K rows) it will make the UI to get stuck. We need to detect large results sets and handle them differently (server side pagination?).

natict · 2014-03-10T16:12:35Z

No pagination + Indicator + CSV download
should be a good start :)

arikfr · 2014-03-10T16:16:07Z

yep, another optimization is to mark big data sets when we store the query result object.

arikfr · 2015-07-26T18:58:43Z

Relevant discussion: https://groups.google.com/forum/#!topic/redash-users/UbwvXewsJrQ

ChrisLocus · 2016-08-26T14:59:47Z

Anybody heard anything about any work on this front? Running into this now.

eshubhamgarg · 2017-01-30T10:32:55Z

@arikfr Do you have any updates on this one ? Like when can be expect a feature release

arikfr · 2017-01-31T09:16:53Z

It's very low priority compared to other stuff, as usually you don't need large result sets in Redash. So far on work was done on this one.

adfel70 · 2017-04-26T12:16:58Z

Hi,
Can you explain what the bottleneck is?
Thanks

bboe · 2017-05-30T23:54:06Z

@arikfr how open would you be to a pull request in this area? I have little knowledge of Redash internals, however, we would like to solve the issue and may be able to throw some resources at it if we can work to get any changes incorporated into the project.

Do you have a ballpark estimate on the amount of effort it would require to detect a large result set, and offer a download?

arikfr · 2017-06-01T10:50:43Z

@bboe how open? very much :) this is low priority for me, but I definitely want to better handle this.

It's hard to give an estimate without looking into this in more detail & understanding what kind of solution you want to achieve. Shoot me an email and let's talk further (arik at redash io).

jesse-osiecki · 2017-08-23T18:24:53Z

@arikfr @bboe was a pull request ever made regarding this?

bboe · 2017-08-23T20:58:55Z

Not from my end. Development time for value ended up not being worth it.

antwan · 2017-09-20T18:17:44Z

Value is to use redash to export/browse large sets of data. Currently this is only suitable for statistics generation.

A quick workaround would be to add an option truncate data on the backend (after 1000 entries) so users can still hit the button "export" without laggy UI due to massive JSON being parsed.

jezdez · 2018-08-16T15:47:14Z

This has been merged! ✨

arikfr · 2018-08-27T15:46:53Z

@jezdez this issue is about large query results and not a long list of queries :)

jezdez · 2018-08-28T17:53:33Z

Ugh, being able to read would clearly be an advantage 😬

changchichung · 2018-08-30T02:29:13Z

is this issue solved ? I gave same situation when return rows > 50K .

arikfr · 2018-09-07T14:47:06Z

@changchichung unfortunately not yet. Although if you don't have much more than 50K, maybe just giving more memory to Redash will resolve your issue.

koooge · 2018-10-19T07:18:31Z

+1
Version 5.0.1+b4851 on EC2 t2.small
redash server cannot resnponse during its processing.

ismailsimsek · 2018-12-13T16:31:05Z

Version 5.0.1+b4851 on EC2 t2.small EC2 m3.large
getting "redash Worker exited prematurely: signal 9 (SIGKILL)."
our main requirement is ability to download large datasets

arikfr · 2018-12-15T17:11:24Z

@ismailsimsek try using a larger instance (depends on the dataset size you're trying to download).

ismailsimsek · 2019-01-17T16:39:35Z

@arikfr what do you think about adding pagination to query_runner? using server side cursor where database is handling the large result set. then client application can process the result in batches.
This probably requires an message in the UI when the full result-set is not passed to UI.

Thanks for the great software btw.

arikfr · 2019-01-20T09:10:01Z

@ismailsimsek pagination/server side cursors won't help without changing how we store the data, because we can't stream the data into Postgres (where we currently store results cache). Also it won't help with serving the results to the browser, because we serve the results to the browser from the cache.

It will help once we change how we store the results and will significantly reduce the memory footprint of the workers.

harveyrendell · 2019-08-15T23:53:41Z

@arikfr I've been following this issue and I'm keen to contribute back if possible. We've had to deal with bad queries locking up our whole Redash service and would like a way to limit the maximum response sizes that are returned, (either response size in memory or row count).

Could a minimum solution to this simply be adding a configuration option to set a maximum query size, and fail safely if it is exceeded. Some use-cases have been mentioned that include paging the query results into the database and I'm interested to hear how these might be made available for download e.g. as csv.

diwu1989 · 2020-06-30T05:42:58Z

Can we set up a config for enforcing limit clause automatically?
Many SQL IDEs do this by default to provide better user experience and prevent users from shooting themselves in the foot.

Default limit to 10k is a reasonable threshold. Nobody actually pages through 10k of results line by line anyways and their UI would stutter.

arikfr · 2020-07-01T08:30:25Z

Can we set up a config for enforcing limit clause automatically?

Yes.. just need to find a way to do it in a "scalable" way for all the data sources (not all of them have to support it though).

syang61-dev · 2021-06-04T00:06:40Z

Can we set up a config for enforcing limit clause automatically?

Yes.. just need to find a way to do it in a "scalable" way for all the data sources (not all of them have to support it though).

@arikfr

Can we make the result payload ‘paginated’ and have a default page size of 1000 rows?

Major ‘big data’ query engines seem to have such nob to control, can we borrow the idea here? API May look something like the following:

GET queries/{queryID}?page=N&pageSize=1000

the above api will make the backend execute the corresponding SQL statement on top of the cache

Isn’t this something a ‘scalable’ solution? If it is, I’d be happy to see how could I help (I am simply a user running into this situation now).

susodapop · 2021-07-23T16:09:34Z

I agree pagination makes sense. But we store cached results as serialised JSON today. So even if we fetch 1000 records at a time, each request would deserialise the entire result before plucking some some rows and returning them.

This is fine for result sets <50k rows. But if a user runs a query with 1m rows the serialisation overhead would balloon 🤔

syang61-dev · 2021-07-23T17:15:13Z

I agree pagination makes sense. But we store cached results as serialised JSON today. So even if we fetch 1000 records at a time, each request would deserialise the entire result before plucking some some rows and returning them.

This is fine for result sets <50k rows. But if a user runs a query with 1m rows the serialisation overhead would balloon 🤔

For results with 1m rows, maybe we could have the result (BI result cache) chunked and store those chunks. If the community is serious about working out a solution, please let me know and I'd like to see how I could help.

susodapop · 2021-07-23T18:57:07Z

We're not going to work on this until at least after the V10 release later this summer. Later this year we'll introduce some processes for improving work planning with the OSS community as we don't want to see this work stagnate. I'll ping this issue once that channel is available.

williswyf · 2022-07-04T08:14:47Z

Hi there,
is there update on this issue? I still got this issue on v10.1. I just expect that my user is able to run query and download the big result set, and no need to catch it into postgresql.

susodapop · 2022-07-04T11:02:16Z

No update to share at this time. But we have not forgotten about this use case.

and no need to catch it into postgresql.

The results are always cached. Because running the query and downloading results are distinct tasks. Postgres is where Redash saves the state (query result) between these tasks. We can't skip the cache without a significant redesign.

lscapim · 2023-10-17T13:43:39Z

Hi,
Is there update on this issue?

spapas · 2023-11-23T10:11:56Z

Hello friends, this issue is like 10 years old. Will anybody give any love to this or we'll wait until it gets a driver's license ?

guidopetri · 2023-11-23T12:09:17Z

Since our community-led launch, we're all doing this as a side project, so priorities may be different. I'd happily accept a PR though.

wtfiwtz · 2023-12-11T03:52:40Z

I think replacing the JSON encoder and using Flask streaming are good options.
https://github.com/getredash/redash/blob/master/redash/utils/__init__.py#L107-L121
Flask stream_with_context - https://stackoverflow.com/questions/71991359/how-to-make-flask-stream-output-line-by-line
Fixing the front-end pagination to convert to back-end pagination is a much larger piece of work.
I've been planning to test this theory out but something else got a higher priority. I'll see how I go later this week or next.

See also #6218

wtfiwtz · 2023-12-12T21:11:02Z

Here's a first attempt: orchestrated-io@7540768

You'll have to excuse my React skills, I haven't got that part correct yet. This just downloads 1000 rows at first, and then attempts to refresh the query result table (visualization) with the additional data when it arrives later.

I attempted to use send_file, stream_with_context, and the Python JSON encoder (ujson didn't work as you need custom encoding for fields such as dates) but they had a negligible impact on a 120k row Google Sheets data source (cached - ~10Mb).

Removing the whitespace from JSON did help a bit - as it reduces the file size by about 1/6th.

I'll have another go at the React component later.

Cheers,
Nigel

wtfiwtz · 2024-01-09T04:50:22Z

Here's a better version... it's a bit hacky but it does the job for now.
orchestrated-io#2

wtfiwtz · 2024-03-20T00:34:28Z

If you do apply this, I also recommend upgrading gunicorn as you might see TCP disconnects when gunicorn goes to shutdown its workers after X requests have been processed. See #5641 (comment)

zachliu · 2024-06-14T20:11:19Z

For query engines such as AWS Athena, I wish there will be a way to:

Show only the partial results if data set is too large
Pass through the S3 URL as the download link

jezdez mentioned this issue Jun 5, 2018

Fix #78 - Implement server side pagination and sorting for queries #2566

Closed

jezdez mentioned this issue Jul 18, 2018

Implement server side pagination and sorting for queries lists #2686

Merged

1 task

jezdez closed this as completed Aug 16, 2018

arikfr reopened this Aug 27, 2018

arikfr mentioned this issue Aug 27, 2018

Tables/graphs with too many data points cannot render #2728

Closed

arikfr added the backlog label Dec 17, 2018

arikfr mentioned this issue Jan 8, 2019

Memory usage #3241

Closed

arikfr mentioned this issue Sep 16, 2019

Add interface to abstract query result persistence #4147

Merged

1 task

noxdafox mentioned this issue May 10, 2020

Redash OOM with mid-sized queries #4867

Closed

billux mentioned this issue May 25, 2020

"invalid memory alloc request size" from PostgreSQL with large query results #4918

Closed

noxdafox mentioned this issue Nov 23, 2023

Redash crashes on query that returns a lot of results #6608

Open

Ariffirdausazman mentioned this issue Dec 22, 2023

[Snyk] Fix for 46 vulnerabilities Reflektion/redash#4

Open

wtfiwtz mentioned this issue Mar 19, 2024

unexpected EOF on client connection with an open transaction #5641

Open

zachliu mentioned this issue Jul 27, 2024

Very high memory consumption after updating to latest redash version #7048

Open

Better support for large query results #78

Better support for large query results #78

Comments

arikfr commented Feb 5, 2014

natict commented Mar 10, 2014

arikfr commented Mar 10, 2014

arikfr commented Jul 26, 2015

ChrisLocus commented Aug 26, 2016

eshubhamgarg commented Jan 30, 2017

arikfr commented Jan 31, 2017

adfel70 commented Apr 26, 2017

bboe commented May 30, 2017 • edited Loading

arikfr commented Jun 1, 2017

jesse-osiecki commented Aug 23, 2017

bboe commented Aug 23, 2017

antwan commented Sep 20, 2017

jezdez commented Aug 16, 2018

arikfr commented Aug 27, 2018

jezdez commented Aug 28, 2018

changchichung commented Aug 30, 2018

arikfr commented Sep 7, 2018

koooge commented Oct 19, 2018 • edited Loading

ismailsimsek commented Dec 13, 2018

arikfr commented Dec 15, 2018

ismailsimsek commented Jan 17, 2019 • edited Loading

arikfr commented Jan 20, 2019

harveyrendell commented Aug 15, 2019

diwu1989 commented Jun 30, 2020

arikfr commented Jul 1, 2020

syang61-dev commented Jun 4, 2021

susodapop commented Jul 23, 2021

syang61-dev commented Jul 23, 2021 • edited Loading

susodapop commented Jul 23, 2021

williswyf commented Jul 4, 2022

susodapop commented Jul 4, 2022

lscapim commented Oct 17, 2023

spapas commented Nov 23, 2023

guidopetri commented Nov 23, 2023

wtfiwtz commented Dec 11, 2023 • edited Loading

wtfiwtz commented Dec 12, 2023

wtfiwtz commented Jan 9, 2024

wtfiwtz commented Mar 20, 2024

zachliu commented Jun 14, 2024

bboe commented May 30, 2017 •

edited

Loading

koooge commented Oct 19, 2018 •

edited

Loading

ismailsimsek commented Jan 17, 2019 •

edited

Loading

syang61-dev commented Jul 23, 2021 •

edited

Loading

wtfiwtz commented Dec 11, 2023 •

edited

Loading