Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regularly Segfaulting in NOT EXIST query #7394

Open
caseykneale opened this issue Aug 24, 2023 · 6 comments
Open

Regularly Segfaulting in NOT EXIST query #7394

caseykneale opened this issue Aug 24, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@caseykneale
Copy link

Describe the bug

The data I am using is in a normalized parquet format and each "table" is in its own file, each registered to the SessionContext instance, and can be queried with a variety of other queries. The problematic query has maybe 4 JOINs (outer, inner, etc) and it ends with a NOT EXISTS clause on a subquery with another set of the roughly the same 4 JOINs on different tables. The outer join occurs inside of a view which is shared by the other 3 inner joins. The query planning succeeds. The query runs for a while, maybe 15 minutes, and then appears as though it has completed (CPU cores spin down, RAM consumption goes down to baseline). The SegFault happens during the collection of the DataFrame itself(it's awaited on) before the RecordBatches are collected from the Dataframe. For what its worth, the dataframe should be empty at the end of this query as its serving as a control for a unit test.

Then segfault currently occurs on an intel Mac. I saw an open issue about seg faulting in unit tests #5693 and don't know whether or not this could be the same issue.

I see a few blocks of unsafe code in the project, most of which look benign, but I haven't ruled out a stack overflow scenario. Not sure where to poke at. May try adjusting RUST_MIN_STACK to see if that helps? Or memoizing the subquery results before the NOT EXIST call?

Any suggestions appreciated.

To Reproduce

I can't share the data to reproduce this or the code unfortunately but something tells me I could make a MRE as I doubt this behavior is exclusive to this type of query.

Expected behavior

This may sound terse but I mean this in the most polite way possible. Ideally queries do not segfault.

Additional context

No response

@caseykneale caseykneale added the bug Something isn't working label Aug 24, 2023
@Dandandan
Copy link
Contributor

Just a check, is your project running in release mode? cargo run --release? Can you share the output / example?

@caseykneale
Copy link
Author

caseykneale commented Aug 24, 2023

Yes it is running in release. I do cargo build --release then run the binary. I can share the output, but it basically has a println after the query plan occurs and then a segfault which occurs before a println saying the dataframe/query was collected.

I can't share the exact example, maybe I can share the queries if I obfuscate them a bit.

@caseykneale
Copy link
Author

caseykneale commented Aug 24, 2023

I have distilled the query down specifically to a NOT EXIST OR a LEFT JOIN by creating in memory tables of the join queries. My segfaulting query looks like this:

    SELECT 
        lhs.uuid, 
        lhs.thing,
        lhs.other_thing,
        lhs.other_other_thing 
    FROM lhs
    WHERE NOT EXISTS (
        SELECT * FROM rhs
        WHERE lhs.uuid = rhs.uuid
            AND lhs.thing = rhs.thing
            AND lhs.other_thing = rhs.other_thing
            AND lhs.other_other_thing = rhs.other_other_thing
     )

where all types are nullable strings, or int64s.

At first I thought it had to do with nesting, now I am wondering if it has to do with nullable comparisons inside of a not exist? The same happens with LEFT OUTER JOIN for what it's worth.

@caseykneale caseykneale changed the title Regularly Segfaulting in Nested Query Regularly Segfaulting in NOT EXIST query Aug 24, 2023
@caseykneale
Copy link
Author

if I replace the AND's with checks for IS NOT NULL first on the optional fields, I now OOM. The OOM is surprising, the data isn't that large (<500 MB on disk uncompressed) and the query itself isn't that demanding but I'm blowing through 50GB SWAP and 16GB RAM.

So it seems like I am possibly blowing out the stack or there is a memory bug associated with NOT EXISTS with WHERE clauses that compare NULL's. But in general the memory consumption here seems quite high for a query returning zero records.

I might do better JOINing the two tables and looking for NULLs in a for loop post serialization?

@caseykneale
Copy link
Author

I was able to work around these issues by sorting the data frames and single thread walking the OUTER JOIN/NOT EXISTS in rust without getting anything near an OOM, nor a SO. This is at least 15x faster than the failed runs despite costs to serialization, etc. I recommend this approach as a workaround should anyone else run into this.

I do hope to see the performance and correctness of these queries improved because I really do like this project.

@ozankabak
Copy link
Contributor

I think we should understand what is going on here. Can you post some dummy data and an accompanying query to reproduce on our end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants