Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stride is not optional for new array_slice UDF #10424

Closed
Tracked by #670
Michael-J-Ward opened this issue May 8, 2024 · 17 comments · Fixed by #10469
Closed
Tracked by #670

stride is not optional for new array_slice UDF #10424

Michael-J-Ward opened this issue May 8, 2024 · 17 comments · Fixed by #10469
Labels
bug Something isn't working

Comments

@Michael-J-Ward
Copy link
Contributor

Describe the bug

The array_slice UDF takes 4 parameters.

make_udf_function!(
ArraySlice,
array_slice,
array begin end stride,
"returns a slice of the array.",
array_slice_udf
);

Which means that args.len() is always 4 in array_slice_inner, even when called with stride = Expr::Null or stride = Expr::Int64(None)

let stride = if args_len == 4 {
Some(as_int64_array(&args[3])?)
} else {
None
};

To Reproduce

I encountered this while creating the python wrapper for datafusion-python and had to set stride=1 else the function would panic.

https://github.com/apache/datafusion-python/blob/7ad526caf140f7eb76ec541c903027d49693b58f/src/functions.rs#L104-L111

Expected behavior

The stride behavior in the UDF should match what's available in the CLI

CREATE TEMPORARY VIEW data3 AS VALUES ([1.0, 2.0, 3.0, 3.0]), ([4.0, 5.0, 3.0]), ([6.0]);
❯ select * from data3;
+----------------------+
| column1              |
+----------------------+
| [1.0, 2.0, 3.0, 3.0] |
| [4.0, 5.0, 3.0]      |
| [6.0]                |
+----------------------+

> select * from array_slice(data3.column1, -1, 2)
+-----------------------------------------------+
| array_slice(data3.column1,Int64(-1),Int64(2)) |
+-----------------------------------------------+
| []                                            |
| []                                            |
| [6.0]                                         |
+-----------------------------------------------+

Additional context

No response

@alamb
Copy link
Contributor

alamb commented May 8, 2024

Thanks @Michael-J-Ward -- what is your ideal outcome? That the UDF function can take three arguments (and default the fourth value to a constant 1)?

@alamb
Copy link
Contributor

alamb commented May 8, 2024

@Michael-J-Ward
Copy link
Contributor Author

Michael-J-Ward commented May 8, 2024

what is your ideal outcome? That the UDF function can take three arguments (and default the fourth value to a constant 1)?

I would expect that there exists some value for stride = Expr (likely Expr::Null or Expr::Int64(None)) such that the behavior of calling the UDF w/ that value matches the behavior of calling the SQL api without it

Roughly:

# rust
expr_fn::array_slice(array, begin, end, stride)
   
   ==

# sql   
`select * from array_slice(array, begin, end)`.

That way datafusion-python can map stride to that value when a user calls f.array_slice(col, literal(2), literal(4), stride=None)

EDIT:
We would think stride = 1 should be that.

The frustration encountered with #10425 is that we were unable to trigger the code-path that the SQL API uses from the UDF.

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 9, 2024

I agree it is helpful if we support expr_fn::array_slice(array, begin, end).
We can't only provide three args for it now

But I can't think of any good solution to this

What you are expecting is something not supported in rust -- variadic function

fn array_slice(array:Expr, begin:Expr, end:Expr) {
    // with stride 1
}

fn array_slice(array:Expr, begin:Expr, end:Expr, stride:Expr) {
    // with stride
}

I think the issue here should be handled in datafusion-python

I encountered this while creating the python wrapper for datafusion-python and had to set stride=1 else the function would panic.

@Michael-J-Ward Did you successfully create expression without stride=1 before 38? It would be surprising if it works previously.

@jonahgao
Copy link
Member

jonahgao commented May 9, 2024

Is it possible to make the array_slice function accept a vector as its parameter?

fn array_slice(args: Vec<Expr>) {
}

When making its UDF function, we need to do

make_udf_function!(
    ArraySlice,
    array_slice,
    "returns a slice of the array.",
    array_slice_udf
);

@alamb
Copy link
Contributor

alamb commented May 9, 2024

Another potential option is to add a second expr fn with a different signature

fn array_slice_with_stride(array:Expr, begin:Expr, end:Expr, stride:Expr) {
...
}

@jayzhan211
Copy link
Contributor

Is it possible to make the array_slice function accept a vector as its parameter?

fn array_slice(args: Vec<Expr>) {
}

When making its UDF function, we need to do

make_udf_function!(
    ArraySlice,
    array_slice,
    "returns a slice of the array.",
    array_slice_udf
);

The downside of this is we loss the clear argument name and we need to check args length.

fn array_slice_with_stride(array:Expr, begin:Expr, end:Expr, stride:Expr) {
...
}

I prefer this

Michael-J-Ward added a commit to Michael-J-Ward/datafusion that referenced this issue May 10, 2024
All i did was expand the `make_udf_function` macro and add the `if let Some(stride) = stride` conditional.

To me, making the argument `Option<_>` is the natural way to make it optional in rust.

I don't know if this solution violates other datafusion constraints, but `cargo test` all passed.

Ref: apache#10424
@Michael-J-Ward
Copy link
Contributor Author

First, I'd like to emphasize that if passing stride=1 to the UDF worked the same as not providing a stride in the SQL api, then this would be just porcelain / a nicety. Reposting the example from: #10425

CREATE TEMPORARY VIEW data3 AS VALUES ([1.0, 2.0, 3.0, 3.0]), ([4.0, 5.0, 3.0]), ([6.0]);
❯ select * from data3;
+----------------------+
| column1              |
+----------------------+
| [1.0, 2.0, 3.0, 3.0] |
| [4.0, 5.0, 3.0]      |
| [6.0]                |
+----------------------+select array_slice(column1, -1, 2) from data3;
+-----------------------------------------------+
| array_slice(data3.column1,Int64(-1),Int64(2)) |
+-----------------------------------------------+
| []                                            |
| []                                            |
| [6.0]                                         |
+-----------------------------------------------+select array_slice(column1, -1, 2, 1) from data3;

thread 'main' panicked at /Users/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-51.0.0/src/transform/primitive.rs:31:43:
range end index 9 out of range for slice of length 8

Now to this issue, I agree that I like the explicit arguments versus passing a Vec, and I would hope the solution doesn't require keeping two separate UDFs.

I created a draft with what I would think is the natural rust solution in #10450. I don't know if this violates some datafusion constraint, but all the tests pass.

#[doc = "returns a slice of the array."]
pub fn array_slice(array: Expr, begin: Expr, end: Expr, stride: Option<Expr>) -> Expr {
    if let Some(stride) = stride {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end, stride],
        ))
    } else {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end],
        ))
    }
}

If that is not possible, then a second proposal would be to amend array_slice_inner w/ some null check.

    let stride = if args_len == 4 & is_not_null(&args[3]) {
        Some(as_int64_array(&args[3])?)
    } else {
        None
    };

But again, this is just a nice-to-have if passing stride=1 worked the same as not providing it in the SQL API.

@Michael-J-Ward
Copy link
Contributor Author

I believe we encountered the same issue again with regex_replace, so I suspect this issue isn't constrained to array_slice and instead applies to any UDF that previously had an optional argument.

@Omega359
Copy link
Contributor

I recall hitting something like this with the substr/substring function. One would think they would be identical however they were not (since rust doesn't do variadic functions nor does it allow defaults for args ala scala or Kotlin). For the substr it was expanded to two functions that internally used a vec<> and handled the logic of using the optional 3rd argument there.

    #[doc = "substring from the `position` to the end"]
    pub fn substr(string: Expr, position: Expr) -> Expr {
        super::substr().call(vec![string, position])
    }

    #[doc = "substring from the `position` with `length` characters"]
    pub fn substring(string: Expr, position: Expr, length: Expr) -> Expr {
        super::substr().call(vec![string, position, length])
    }
    ```

@andygrove
Copy link
Member

I am +1 for this solution suggested by @Michael-J-Ward

#[doc = "returns a slice of the array."]
pub fn array_slice(array: Expr, begin: Expr, end: Expr, stride: Option<Expr>) -> Expr {
    if let Some(stride) = stride {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end, stride],
        ))
    } else {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end],
        ))
    }
}

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 11, 2024

I think there are two issues here, one is the panic issue in datafusion-cli and another one is the API design for Expr.

  1. array_slice panic
query ?
select array_slice([1, 2, 3], 1, 2, 1);
----
[1, 2]

query ?
select array_slice([1, 2, 3], -1, 2, 1);
----
PANIC!

The reason for panic in the second command is due to the bug in array_slice

if stride.is_none() && from > to {
// return empty array
offsets.push(offsets[row_index]);
continue;
}
let stride = stride.unwrap_or(1);

We should fix this.

Another issue is that given the current API constraint of Expr, we can not provide only 3 arguments.
There are many different approaches like giving vec, or introducing another function or your proposal

#[doc = "returns a slice of the array."]
pub fn array_slice(array: Expr, begin: Expr, end: Expr, stride: Option<Expr>) -> Expr {
    if let Some(stride) = stride {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end, stride],
        ))
    } else {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end],
        ))
    }
}

I think this is more a user experience problem, how should we design it is discussable.

Given the design here, we still need to provide 4 args, where you provide None to stride if you expect stride with 1.

TLDR:
The panic issue should be fixed not by redesigning the Expr API, but by fixing the internal array slice code.

@Michael-J-Ward
Copy link
Contributor Author

Agreed @jayzhan211, these are two separate issues.

The panic issue was filed separately as #10425

@andygrove
Copy link
Member

I think this is more a user experience problem, how should we design it is discussable.

It is more than a user experience issue. The current API is causing test failures in a downstream project (the DataFusion Python bindings) and this cannot be resolved without the API changes being discussed in this issue.

@Michael-J-Ward
Copy link
Contributor Author

Since the regexp_* UDFs have the same problem, I suspect that array_slice was just our first encounter w/ the underlying issue: any "inner" UDF implementation that behaves differently based on the number of arguments passed should expose that via its public API.

Solutions discussed:

  1. Expose multiple public apis for each variant, as @Omega359 pointed out that inner substr exposes public substr and substring.

  2. Change the public APIs to take Option<Expr>, as I demoed for array_slice in DRAFT: example fix to make stride optional in array_slice UDF #10450.

  3. Having the public API just take a Vec<Expr>

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 11, 2024

The current API is causing test failures in a downstream project (the DataFusion Python bindings) and this cannot be resolved without the API changes being discussed in this issue.

I believe the failure occurred because of the bug I mentioned. Can you provide another example where df-python still experiences a panic?

#[pyfunction]
#[pyo3(signature = (array, begin, end, stride = 1))]
fn array_slice(array: PyExpr, begin: PyExpr, end: PyExpr, stride: Option<i64>) -> PyExpr {
    let stride = ScalarValue::Int64(stride);
    let stride = Expr::Literal(stride);
    datafusion_functions_array::expr_fn::array_slice(array.into(), begin.into(), end.into(), stride)
        .into()
}

If it is because of the Expr design.
Given these 3, I would prefer the second one.

@jonahgao
Copy link
Member

I submitted a formal PR #10469 to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants