Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Union columns can never be NULL #11162

Closed
samuelcolvin opened this issue Jun 28, 2024 · 7 comments · Fixed by #11321
Closed

Union columns can never be NULL #11162

samuelcolvin opened this issue Jun 28, 2024 · 7 comments · Fixed by #11321
Labels
bug Something isn't working

Comments

@samuelcolvin
Copy link
Contributor

Describe the bug

Maybe I'm doing something wrong with my union in datafusion-functions-json, but is null expressions never evaluate to true with the JsonUnion column.

They behave as expected when the function in question returns a scalar JsonUnion.

To Reproduce

See datafusion-contrib/datafusion-functions-json#24

Expected behavior

When the value in every column of the union is null:

  • <union column> is null should evaluate to null
  • I guess it would also be nice if the repr of null values in a union was an empty string same as everything else, not {<first column_name>=} (which I what I think it is now)

Additional context

No response

@samuelcolvin samuelcolvin added the bug Something isn't working label Jun 28, 2024
@samuelcolvin
Copy link
Contributor Author

@alamb any idea on where I would start looking to try and fix this?

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

I would suggest writing a standlone test case / reproducer as the first step

Then I suspect we can either help you find the code needed to be fixed (or maybe even someone would be interested in fixing it themselves)

@samuelcolvin samuelcolvin changed the title Union columns can never be NULL (I think?) Union columns can never be NULL Jul 7, 2024
@samuelcolvin
Copy link
Contributor Author

See #11314 as a demonstration of the problem for both dense and sparse unions.

After a bit of investigation, the issues lies in the first instance with

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
let arg = self.arg.evaluate(batch)?;
match arg {
ColumnarValue::Array(array) => Ok(ColumnarValue::Array(Arc::new(
compute::is_null(array.as_ref())?,
))),
ColumnarValue::Scalar(scalar) => Ok(ColumnarValue::Scalar(
ScalarValue::Boolean(Some(scalar.is_null())),
)),
}
}

Then with this code in arrow-rs:

/// Returns a non-null [BooleanArray] with whether each value of the array is null.
/// # Error
/// This function never errors.
/// # Example
/// ...
pub fn is_null(input: &dyn Array) -> Result<BooleanArray, ArrowError> {
    let values = match input.logical_nulls() {
        None => BooleanBuffer::new_unset(input.len()),
        Some(nulls) => !nulls.inner(),
    };

    Ok(BooleanArray::new(values, None))
}

And then with this code

    /// Union types always return non null as there is no validity buffer.
    /// To check validity correctly you must check the underlying vector.
    fn is_null(&self, _index: usize) -> bool {
        false
    }

Ultimately with the spec

Unlike other data types, unions do not have their own validity bitmap. Instead,
the nullness of each slot is determined exclusively by the child arrays which
are composed to create the union.


Basically arrow is saying "we're not going to tell you if a union is null, you need to look in the child arrays", but datafusion isn't listening and is just asking the union if it's null in the naive way.

Two options to move forward as far as I can tell:

  1. Decide unions in DF can never be null — I'll need to abandon unions in datafusion-functions-json and just return strings everywhere
  2. Have custom logic for unions that looks up the child array to determine if the value is null

If (as I hope) we go for the second option, there's also the issue (as demonstrated by #11314) that the representation of "null" union items doesn't match other types, it shows {A=} instead of an empty string.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 7, 2024

Have custom logic for unions that looks up the child array to determine if the value is null

+1 for second option. I think we should check the children's nullability.

@samuelcolvin
Copy link
Contributor Author

I suppose there's a third option of updating arrow-rs to correctly calculate if a UnionArray is null, but I presume that works take much longer

@samuelcolvin
Copy link
Contributor Author

I've proposed a fix in #11321.

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

I suppose there's a third option of updating arrow-rs to correctly calculate if a UnionArray is null, but I presume that works take much longer

It would likely take longer

Note there is a method that takes into account child nullability that perhaps we could use instead of is_null:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls

Update: it turns out logical_nulls is incorrect for UnionArray: apache/arrow-rs#6017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants