Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid stack overflow by reducing stack usage of BinaryExpr::evaluate in debug builds #1047

Merged
merged 5 commits into from
Sep 25, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 24, 2021

Which issue does this PR close?

Closes #419

Rationale for this change

Prior to this PR, trying to evaluate BinaryExprs more than a few level deep (e.g. something like a + a + a + a + a + a + a + a ...) would result in a stack overflow

For example the test included fails like this with tree_depth of 10 (in debug builds)

thread 'physical_plan::expressions::binary::tests::relatively_deeply_nested' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass '-p datafusion --lib'

Caused by:
  process didn't exit successfully: `/Users/alamb/Software/arrow-datafusion/target/debug/deps/datafusion-68280be86fef135e deeply` (signal: 6, SIGABRT: process abort signal)

What changes are included in this PR?

  1. Break the BinaryExpr::evaluate into a few smaller functions
  2. Remove special case workaround added for Avro Table Provider #910

Are there any user-facing changes?

Not really other than avoiding stack overflows while evaluating queries in debug builds

Technical Backstory

I believe the issue is that in debug builds, each local variable gets its own (unique space in the stack). Due to the size of BinaryExpr::evaluate (largely hidden by macros) this results in a ludicrous amount of stack required for each call to BinaryExpr::evaluate. Since BinaryExpr::evaluate is implemented recursively this means even a few nesting levels exhausts a 2MB stack.

You can see evidence of looking at the disassembly. Here is how I did it on a mac:

otool -vt target/debug/deps/datafusion-68280be86fef135e > /tmp/df.asm

And the associated assembly shows the stack size to be 0x55a10 (350736 bytes)

__ZN118_$LT$datafusion..physical_plan..expressions..binary..BinaryExpr$u20$as$u20$datafusion..physical_plan..PhysicalExpr$GT$8evaluate17h2877dfaf102fa64eE:
0000000100d316d0	pushq	%rbp
0000000100d316d1	movq	%rsp, %rbp
0000000100d316d4	movl	$0x55a10, %eax                  ## the subq instruction below uses this value
0000000100d316d9	callq	0x1025dc540                     ## to effectively add 350736 (350K!) to the stack pointer
0000000100d316de	subq	%rax, %rsp
0000000100d316e1	movq	%rdx, -0x4a998(%rbp)
0000000100d316e8	movq	%rsi, -0x4a990(%rbp)
0000000100d316ef	movq	%rdi, %rax
...

In case you were curious, the same function in a release build only uses 0x398 (920 bytes) of stack space.

__ZN118_$LT$datafusion..physical_plan..expressions..binary..BinaryExpr$u20$as$u20$datafusion..physical_plan..PhysicalExpr$GT$8evaluate17h51f0595db0e2c70cE:
00000001009468a0        pushq   %rbp
00000001009468a1        movq    %rsp, %rbp
00000001009468a4        pushq   %r15
00000001009468a6        pushq   %r14
00000001009468a8        pushq   %r13
00000001009468aa        pushq   %r12
00000001009468ac        pushq   %rbx
00000001009468ad        subq    $0x398, %rsp                    ## imm = 0x398

For those of you visually minded, here is an illustration:

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─
                   │
│     Function                      Base Pointer
     Parameters    │     ┌──────       (%ebp)
│                        │
┌──────────────────┤     │
│  Return Address  │     │
├──────────────────┤     │
│Saved Base Pointer│◀────┘
├──────────────────┘
    Local Var 1    │
├ ─ ─ ─ ─ ─ ─ ─ ─ ─
    Local Var 2    │             I theorize that in a debug build, a
├ ─ ─ ─ ─ ─ ─ ─ ─ ─           distinct space is reserved for each local
        ...        │                  variable of the function.
├ ─ ─ ─ ─ ─ ─ ─ ─ ─
    Local Var N    │            Thus a large number of local variables
├ ─ ─ ─ ─ ─ ─ ─ ─ ─            will result in a large stack frame (and
                   │              thus a large amount of stack space
│     Function                         consumed for each call)
     Parameters    │
│
┌──────────────────┤
│  Return Address  │◀────┐
└──────────────────┘     │
                         │
                         │
                         │         Stack Pointer
                         └──────       (%esp)

The "Fix" if you will is to break BinaryExpr into smaller functions that each require less stack space

The stack frame size after this PR is 0x770 = 1904 bytes

__ZN118_$LT$datafusion..physical_plan..expressions..binary..BinaryExpr$u20$as$u20$datafusion..physical_plan..PhysicalExpr$GT$8evaluate17h2877dfaf102fa64eE:
0000000100d31a50        pushq   %rbp
0000000100d31a51        movq    %rsp, %rbp
0000000100d31a54        subq    $0x770, %rsp                    ## imm = 0x770
0000000100d31a5b        movq    %rdx, -0x6a8(%rbp)
0000000100d31a62        movq    %rsi, -0x6a0(%rbp)

Note that even though some of the new functions require non trivial stack size (listed below), a major difference is they are not called recursively and thus there is only ever one frame of them on the stack:

evaluate_array_scalar:       0x1DE80 (122496 bytes)
evaluate_scalar_array:       0x162A0 (90784 bytes)
evaluate_with_resolved_args: 0x1F970 (129392 bytes)

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Sep 24, 2021
@@ -105,8 +105,6 @@ jobs:
run: |
export ARROW_TEST_DATA=$(pwd)/testing/data
export PARQUET_TEST_DATA=$(pwd)/parquet-testing/data
# run tests on all workspace members with default feature list + avro
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the workaround added in #910

@@ -543,86 +543,17 @@ impl PhysicalExpr for BinaryExpr {
)));
}

// Attempt to use special kernels if one input is scalar and the other is an array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no intended changes to this function's behavior, simply breaking it up into several smaller functions

let schema = batch.schema();

// build a left deep tree ((((a + a) + a) + a ....
let tree_depth: i32 = 100;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, this test causes a stack overflow with tree_depth of 10. After the changes in this PR it passes successfully with a tree_depth of 100 (I didn't try any bigger)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the deep-dive evaluation and the fix, @alamb

let schema = batch.schema();

// build a left deep tree ((((a + a) + a) + a ....
let tree_depth: i32 = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2021

I'll plan to merge this tomorrow unless there are objections

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on the deep dive and detailed diagram :)

@houqp
Copy link
Member

houqp commented Sep 25, 2021

This kind of "match to many arms with macros" pattern is very common in our code base. It seems like in the future if we want to get fancy, we could automate detection of such problem in our code base. For example, having a linter tool automatically find out all recursive functions and check their stack size to see if it has passed a certain threshold.

@houqp houqp added the performance Make DataFusion faster label Sep 25, 2021
@houqp houqp merged commit 26399ed into apache:master Sep 25, 2021
@alamb alamb deleted the alamb/reduce_stack_usage branch September 26, 2021 10:42
@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2021

This kind of "match to many arms with macros" pattern is very common in our code base. It seems like in the future if we want to get fancy, we could automate detection of such problem in our code base. For example, having a linter tool automatically find out all recursive functions and check their stack size to see if it has passed a certain threshold.

I agree -- if we run into this problem again, investing in an automated tool like that sounds like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet::build_row_gropup_predicate hits stackoverflowed
4 participants