-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage #3241
Comments
Large result sets are a known issue in Redash (#78), but we never prioritized handling it better because usually there is little use for it in Redash directly: you can't visualize it or review it manually. With Redash, you would usually use your database to slice and dice your data and use the resulting data sets for sharing or visualizations. Can you share a bit more about what were you planning using those result sets for? |
To be honest - my main use case here is to allow people to export to CSV so they can slice up the data in their own way. I have users who are technical enough to write their own queries but who I don't really want to give access to the AWS Console to use Athena directly. I notice that there is a Query Results data source which would potentially be interesting here too but a CSV download would solve my immediate need. |
If they can write their own queries, why not let them slice and dice it using Athena via Redash? The Query Results data source will stumble at the same memory issue, as it loads all the data into a memory backed SQLite instance. |
Some people just can't let go of Excel 🤷♂️ |
Based on some of the comments in #78 it seems like most would be happy with a CSV export. I'm happy to spend a little time looking into the feasibility of a PR, I take it you're still open to a contribution in that regard? Given the query seems to be successful on the worker as the results exists in the |
True 😆
Yes, but considering how critical this code path is, I would like to discuss implementation details first.
If you keep increasing the machine size, eventually it will have enough memory :) But you will have an issue on the frontend with processing this amount of data, so some changes will be necessary regardless. |
Ok, I'm not a Python dev by any means so this probably isn't for me. I'll come up with a different method to export raw query results. Thanks for the responses and help though! 👍 |
Issue Summary
I'm trying Redash to query an Athena data source but memory usage is very very high. I initially tried running it in a Nomad cluster but I've also tried direct on an EC2 instance using the official AMI with the same results.
I'm intending to run queries on a fairly large dataset (100s to 1000s of GB) but with resultsets probably capping out at around 1.5-3G. I realise that I'll need to scale the Redash instance accordingly but even with very small tests I seem to be hitting limits.
Steps to Reproduce
That query in the Athena console takes a couple seconds and scans ~200MB of data. The resultset in CSV is 63MB and contains 1.2mil rows. Trying the same query in Redash takes around 4mins until it fails with this error in the UI:
docker logs
on the server container shows a memory error.The query seems to be completing and I can see a record created in the
query_results
table. The Athena web UI returns CSV while Redash stores JSON - JSON being larger, aselect length(data) from query_results)
shows the resultset there is about 119MB.I'm testing on a t3.medium instance which only has 4GB of memory. I'm not expecting to be able to run larger queries on this, but I would expect something with only ~100MB of result data to work fine.
Is there any guidance on hardware scaling or anything I can do to try troubleshoot this? Surely a t3.medium should be able to handle a query like this?
Technical details:
The text was updated successfully, but these errors were encountered: