Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal Python error: Cannot recover from stack overflow. #493

Closed
lukepalmer opened this issue Sep 30, 2021 · 7 comments
Closed

Fatal Python error: Cannot recover from stack overflow. #493

lukepalmer opened this issue Sep 30, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@lukepalmer
Copy link

It happens here:

    try:
        if rule.get('scroll_id') and self.thread_data.num_hits < self.thread_data.total_hits and should_scrolling_continue(rule):
            if not self.run_query(rule, start, end, scroll=True):
                return False
    except RuntimeError:
        # It's possible to scroll far enough to hit max recursive depth
        pass

This seems like whomever coded this was aware that this could happen, but i don't believe any kind of exception handling can deal with a stack overflow.

I don't have a reproduction unfortunately. This has happened once on an otherwise static setup. Of note is that my elastic instance can be quite slow at times.

@nsano-rururu
Copy link
Collaborator

I'm not familiar with python, so please let me know. What kind of code should I fix?

@lukepalmer
Copy link
Author

I'll start by saying that I'm not particularly familiar with the code either.

At a high level I can see that the function run_query is recursive. While I don't know why the original author chose that, it generally seems like a bad idea. Rewriting it to be iterative instead of recursive seems like an improvement.

I'd welcome comments from anyone more familiar with elastalert.

@nsano-rururu
Copy link
Collaborator

Two fix pull requests have been merged in 2016 and 2019. It was the time of the original yelp / elastalert, and I don't know the details, but it seems that the problem often occurred.

2016
Yelp/elastalert#652

Fixes a bug that would sometimes cause infinite recursion:

Some rule would trigger scrolling and scroll_id would get set.
Because, scroll_id, num_hits, total_hits were all shared between rules, scrolling might trigger on another rule.
The query might use get_hits_terms or get_hits_count which don't take scroll into account at all. They keep making the same query over and over.
Crash with maximum recursion depth
To fix this, scroll_id is set PER rule, and deleted after the recursion ends. We also just end the scroll when maximum recursion depth is reached.

2019
Yelp/elastalert#2271
Yelp/elastalert#2394

@nsano-rururu
Copy link
Collaborator

I found an unmerged pull request on the original yelp / elastalert in 2019.

refactor run_query
Yelp/elastalert#2345

@lukepalmer
Copy link
Author

At a glance, that refactoring is about the right shape. It is of course done on an older version and would need to be merged forward and tested. Nice find.

@nsano-rururu nsano-rururu added the bug Something isn't working label Oct 5, 2021
@nsano-rururu
Copy link
Collaborator

@ferozsalam @jertel

What do you think about this issue? ..

Reason for question

I'm not using elastalert2, I'm just checking the operation to see if the changes work correctly only when I make a pull request. .. So I don't know if this is a frequent issue or if it needs to be addressed urgently.

@jertel
Copy link
Owner

jertel commented Oct 12, 2021

I've not seen this error myself, but it is something that should be corrected. Python's stack limit is 1000 stack frames so it you have enough search results to an ElastAlert query then it is technically possible to hit this bug, depending on the scroll size. There are multiple options, in order of complexity:

  1. Set a default value for max_scrolling_count to 990 to stop the recursion before it hits the stack limit. There's a point of diminishing returns for an alerting framework to actually need to pull tens of thousands of results back anyway.
  2. Refactor the run_query() function to use looping instead of recursion. This solves the stack overflow issue, by continuing to utilize the scroll methodology, but this is not recommended practice for deep paging. Again though, what is an alerting system going to do with tens of thousands of search results?
  3. Replace scrolling with the Elastic recommended search_after params, for deep paging. This is a bigger change and will need to use the point in time indexing.

#1 would be very simple to implement and update the docs.

@jertel jertel closed this as completed Oct 12, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants