Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe API to replace NullBuffers for Arrays #6528

Open
alamb opened this issue Oct 8, 2024 · 8 comments
Open

Safe API to replace NullBuffers for Arrays #6528

alamb opened this issue Oct 8, 2024 · 8 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While implementing apache/datafusion#12792 and various other things in DataFusion I find myself often wanting to combine a filter and a null mask

The relevant code is like

        // combine existing nulls, if any, and a filter:
        let nulls = filtered_null_mask(opt_filter, input);

        // make a new array with a new null buffer
        let output: ArrayRef = match input.data_type() {
            // TODO it would be nice to have safe apis in arrow-rs to update the null buffers in the arrays
            DataType::Utf8 => {
                let input = input.as_string::<i32>();
                // safety: values / offsets came from a valid string array, so are valid utf8
                // and we checked nulls has the same length as values
                unsafe {
                    Arc::new(StringArray::new_unchecked(
                        input.offsets().clone(),
                        input.values().clone(),
                        nulls,
                    ))
                }
            }

Describe the solution you'd like
I would like an API like with_nulls that returns a new array with the same data but a new null mask so my code would look like

        // combine existing nulls, if any, and a filter:
        let nulls = filtered_null_mask(opt_filter, input);

        // make a new array, with the same data but a new null buffer
        // panics if the nulls length doesn't match the array's length
        let output = input.with_nulls(nulls);

Describe alternatives you've considered
I can keep using the unsafe APIs

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Oct 8, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 9, 2024

I think this makes sense, the only thing to be careful with is arrays where null values may have undefined contents, e.g. dictionaries. In such cases, allowing users to go from a null to not null, could have safety implications

@tustvold
Copy link
Contributor

tustvold commented Oct 9, 2024

FWIW the nullif kernel is very similar to this, but with the caveat that nulls can only remain null, avoiding the above issue

Edit: in fact the operation you describe is the nullif kernel I think...

@tustvold
Copy link
Contributor

I think the nullif kernel provides this, so perhaps this can be closed?

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

Sounds good -- the current documentation on nullif is pretty sparse (and thus perhaps we can make it easier to discover / more likely people can find it) with some better docs

https://docs.rs/arrow/latest/arrow/compute/kernels/nullif/fn.nullif.html

I'll try and find some time

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2024

PR to improve the docs: #6658

After doing that it is somewhat of the inverse of what I was looking for (it sets the element to null when the mask is true, rather than setting the element to null when the mask is not true).

I will attempt a PR with a proposed API as well for consideration

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2024

I made #6659 to add Array::with_nulls but it has the issues @tustvold describes

The actual usecase I have is not to turn elements unnull, but instead make them null if a boolean array is not true (the inverse of what nullif does).

Maybe a better API would be something like nullifnot or similar.

I will ponder

@tustvold
Copy link
Contributor

You could either negate the boolean array, or construct a BooleanArray with a null buffer of the buffer you want to be null if false, and a values buffer of entirely false. Neither is exactly ideal, but the performance, especially of the latter, is likely going to be hard to beat.

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2024

🤔

this is now we do it today in DataFusion: https://github.com/apache/datafusion/blob/f23360f1e0f90abcf92a067de55a9053da545ed2/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/nulls.rs#L53-L102

(basically call NullBuffer::Union) maybe that is good enough 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants