Regularly Segfaulting in NOT EXIST query #7394

caseykneale · 2023-08-24T11:45:01Z

Describe the bug

The data I am using is in a normalized parquet format and each "table" is in its own file, each registered to the SessionContext instance, and can be queried with a variety of other queries. The problematic query has maybe 4 JOINs (outer, inner, etc) and it ends with a NOT EXISTS clause on a subquery with another set of the roughly the same 4 JOINs on different tables. The outer join occurs inside of a view which is shared by the other 3 inner joins. The query planning succeeds. The query runs for a while, maybe 15 minutes, and then appears as though it has completed (CPU cores spin down, RAM consumption goes down to baseline). The SegFault happens during the collection of the DataFrame itself(it's awaited on) before the RecordBatches are collected from the Dataframe. For what its worth, the dataframe should be empty at the end of this query as its serving as a control for a unit test.

Then segfault currently occurs on an intel Mac. I saw an open issue about seg faulting in unit tests #5693 and don't know whether or not this could be the same issue.

I see a few blocks of unsafe code in the project, most of which look benign, but I haven't ruled out a stack overflow scenario. Not sure where to poke at. May try adjusting RUST_MIN_STACK to see if that helps? Or memoizing the subquery results before the NOT EXIST call?

Any suggestions appreciated.

To Reproduce

I can't share the data to reproduce this or the code unfortunately but something tells me I could make a MRE as I doubt this behavior is exclusive to this type of query.

Expected behavior

This may sound terse but I mean this in the most polite way possible. Ideally queries do not segfault.

Additional context

No response

The text was updated successfully, but these errors were encountered:

Dandandan · 2023-08-24T11:55:16Z

Just a check, is your project running in release mode? cargo run --release? Can you share the output / example?

caseykneale · 2023-08-24T11:57:21Z

Yes it is running in release. I do cargo build --release then run the binary. I can share the output, but it basically has a println after the query plan occurs and then a segfault which occurs before a println saying the dataframe/query was collected.

I can't share the exact example, maybe I can share the queries if I obfuscate them a bit.

caseykneale · 2023-08-24T16:11:44Z

I have distilled the query down specifically to a NOT EXIST OR a LEFT JOIN by creating in memory tables of the join queries. My segfaulting query looks like this:

    SELECT 
        lhs.uuid, 
        lhs.thing,
        lhs.other_thing,
        lhs.other_other_thing 
    FROM lhs
    WHERE NOT EXISTS (
        SELECT * FROM rhs
        WHERE lhs.uuid = rhs.uuid
            AND lhs.thing = rhs.thing
            AND lhs.other_thing = rhs.other_thing
            AND lhs.other_other_thing = rhs.other_other_thing
     )

where all types are nullable strings, or int64s.

At first I thought it had to do with nesting, now I am wondering if it has to do with nullable comparisons inside of a not exist? The same happens with LEFT OUTER JOIN for what it's worth.

caseykneale · 2023-08-24T18:07:42Z

if I replace the AND's with checks for IS NOT NULL first on the optional fields, I now OOM. The OOM is surprising, the data isn't that large (<500 MB on disk uncompressed) and the query itself isn't that demanding but I'm blowing through 50GB SWAP and 16GB RAM.

So it seems like I am possibly blowing out the stack or there is a memory bug associated with NOT EXISTS with WHERE clauses that compare NULL's. But in general the memory consumption here seems quite high for a query returning zero records.

I might do better JOINing the two tables and looking for NULLs in a for loop post serialization?

caseykneale · 2023-08-25T15:53:22Z

I was able to work around these issues by sorting the data frames and single thread walking the OUTER JOIN/NOT EXISTS in rust without getting anything near an OOM, nor a SO. This is at least 15x faster than the failed runs despite costs to serialization, etc. I recommend this approach as a workaround should anyone else run into this.

I do hope to see the performance and correctness of these queries improved because I really do like this project.

ozankabak · 2023-09-16T19:10:25Z

I think we should understand what is going on here. Can you post some dummy data and an accompanying query to reproduce on our end?

caseykneale added the bug Something isn't working label Aug 24, 2023

caseykneale changed the title ~~Regularly Segfaulting in Nested Query~~ Regularly Segfaulting in NOT EXIST query Aug 24, 2023

alamb mentioned this issue Dec 1, 2023

[Epic] A collection of Join Improvements #8398

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regularly Segfaulting in NOT EXIST query #7394

Regularly Segfaulting in NOT EXIST query #7394

caseykneale commented Aug 24, 2023

Dandandan commented Aug 24, 2023

caseykneale commented Aug 24, 2023 •

edited

Loading

caseykneale commented Aug 24, 2023 •

edited

Loading

caseykneale commented Aug 24, 2023

caseykneale commented Aug 25, 2023

ozankabak commented Sep 16, 2023

Regularly Segfaulting in NOT EXIST query #7394

Regularly Segfaulting in NOT EXIST query #7394

Comments

caseykneale commented Aug 24, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

Dandandan commented Aug 24, 2023

caseykneale commented Aug 24, 2023 • edited Loading

caseykneale commented Aug 24, 2023 • edited Loading

caseykneale commented Aug 24, 2023

caseykneale commented Aug 25, 2023

ozankabak commented Sep 16, 2023

caseykneale commented Aug 24, 2023 •

edited

Loading

caseykneale commented Aug 24, 2023 •

edited

Loading