Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage #3241

Closed
ccakes opened this issue Jan 2, 2019 · 7 comments
Closed

Memory usage #3241

ccakes opened this issue Jan 2, 2019 · 7 comments

Comments

@ccakes
Copy link

ccakes commented Jan 2, 2019

Issue Summary

I'm trying Redash to query an Athena data source but memory usage is very very high. I initially tried running it in a Nomad cluster but I've also tried direct on an EC2 instance using the official AMI with the same results.

I'm intending to run queries on a fairly large dataset (100s to 1000s of GB) but with resultsets probably capping out at around 1.5-3G. I realise that I'll need to scale the Redash instance accordingly but even with very small tests I seem to be hitting limits.

Steps to Reproduce

  1. Run query (data ingested to AWS using Firehose with partitions being created by a Lambda)
select stamp_inserted, as_src, as_dst, sum(bytes) bytes
from data_analytics.flowdata
where year = '2018' and month = '12' and day = '30' and hour = '20'
group by 1, 2, 3
order by stamp_inserted;

That query in the Athena console takes a couple seconds and scans ~200MB of data. The resultset in CSV is 63MB and contains 1.2mil rows. Trying the same query in Redash takes around 4mins until it fails with this error in the UI:

Error running query: failed communicating with server. Please check your Internet connection and try again.

docker logs on the server container shows a memory error.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1639, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1625, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python2.7/dist-packages/flask_restful/__init__.py", line 477, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/flask_login/utils.py", line 228, in decorated_view
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "/app/redash/handlers/base.py", line 31, in dispatch_request
    return super(BaseResource, self).dispatch_request(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/flask_restful/__init__.py", line 587, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/app/redash/permissions.py", line 48, in decorated
    return fn(*args, **kwargs)
  File "/app/redash/handlers/query_results.py", line 230, in get
    response = self.make_json_response(query_result)
  File "/app/redash/handlers/query_results.py", line 248, in make_json_response
    data = json.dumps({'query_result': query_result.to_dict()}, cls=utils.JSONEncoder)
  File "/usr/lib/python2.7/json/__init__.py", line 251, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 210, in encode
    return ''.join(chunks)
MemoryError

The query seems to be completing and I can see a record created in the query_results table. The Athena web UI returns CSV while Redash stores JSON - JSON being larger, a select length(data) from query_results) shows the resultset there is about 119MB.

  1. Any other info e.g. Why do you consider this to be a bug? What did you expect to happen instead?

I'm testing on a t3.medium instance which only has 4GB of memory. I'm not expecting to be able to run larger queries on this, but I would expect something with only ~100MB of result data to work fine.

Is there any guidance on hardware scaling or anything I can do to try troubleshoot this? Surely a t3.medium should be able to handle a query like this?

Technical details:

  • Redash Version: AWS AMI (redash:5.0.2.b5486)
  • Browser/OS: Firefox Nightly
  • How did you install Redash: AWS AMI
@arikfr
Copy link
Member

arikfr commented Jan 8, 2019

Large result sets are a known issue in Redash (#78), but we never prioritized handling it better because usually there is little use for it in Redash directly: you can't visualize it or review it manually. With Redash, you would usually use your database to slice and dice your data and use the resulting data sets for sharing or visualizations.

Can you share a bit more about what were you planning using those result sets for?

@ccakes
Copy link
Author

ccakes commented Jan 10, 2019

To be honest - my main use case here is to allow people to export to CSV so they can slice up the data in their own way. I have users who are technical enough to write their own queries but who I don't really want to give access to the AWS Console to use Athena directly.

I notice that there is a Query Results data source which would potentially be interesting here too but a CSV download would solve my immediate need.

@arikfr
Copy link
Member

arikfr commented Jan 10, 2019

If they can write their own queries, why not let them slice and dice it using Athena via Redash?

The Query Results data source will stumble at the same memory issue, as it loads all the data into a memory backed SQLite instance.

@ccakes
Copy link
Author

ccakes commented Jan 10, 2019

Some people just can't let go of Excel 🤷‍♂️

@ccakes
Copy link
Author

ccakes commented Jan 10, 2019

Based on some of the comments in #78 it seems like most would be happy with a CSV export. I'm happy to spend a little time looking into the feasibility of a PR, I take it you're still open to a contribution in that regard?

Given the query seems to be successful on the worker as the results exists in the query_results table, I'm assuming (hoping) that just serving that data shouldn't be too tricky

@arikfr
Copy link
Member

arikfr commented Jan 10, 2019

Some people just can't let go of Excel 🤷‍♂️

True 😆

I take it you're still open to a contribution in that regard?

Yes, but considering how critical this code path is, I would like to discuss implementation details first.

Given the query seems to be successful on the worker as the results exists in the query_results table, I'm assuming (hoping) that just serving that data shouldn't be too tricky

If you keep increasing the machine size, eventually it will have enough memory :) But you will have an issue on the frontend with processing this amount of data, so some changes will be necessary regardless.

@ccakes
Copy link
Author

ccakes commented Jan 10, 2019

Yes, but considering how critical this code path is, I would like to discuss implementation details first.

Ok, I'm not a Python dev by any means so this probably isn't for me. I'll come up with a different method to export raw query results.

Thanks for the responses and help though! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants