Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Change Scalar::CastTo to do safe cast by default and allow to specify cast options ? #35040

Closed
jorisvandenbossche opened this issue Apr 11, 2023 · 6 comments · Fixed by #35395

Comments

@jorisvandenbossche
Copy link
Member

Describe the enhancement requested

See #34901 for a longer discussion, but summarizing: the pyarrow.Scalar object has a cast() method, but in contrast with other cast methods in pyarrow it does an unsafe cast by default. We should probably change this to do a safe cast by default, and at the same time also allow to specify CastOptions (so a user can still choose to do an unsafe cast).

Example:

# scalar behaviour
>>> pa.scalar(1.5)
<pyarrow.DoubleScalar: 1.5>
>>> pa.scalar(1.5).cast(pa.int64())
<pyarrow.Int64Scalar: 1>

# vs array behaviour
>>> pa.array([1.5]).cast(pa.int64())
...
ArrowInvalid: Float value 1.5 was truncated converting to int64

The python cast() method calls the C++ Scalar::ToCast:

// TODO(bkietz) add compute::CastOptions
Result<std::shared_ptr<Scalar>> CastTo(std::shared_ptr<DataType> to) const;

which currently indeed doesn't have the option to pass CastOptions.

In addition, it seems that for casting Scalars, we do have a somewhat custom implementation, and this doesn't use the generic Cast implementation (from the compute kernels), but has a custom CastImpl in scalar.cc. Not fully sure about the reason for this, but maybe historically we wanted to have scalar casting without relying on the optional compute module? (cfr #25025)

Component(s)

C++

@westonpace
Copy link
Member

I think this is a good idea. The result will probably be a slight bit slower but working with scalars is already slightly slower anyways and it would be nice to have only a single implementation.

@westonpace
Copy link
Member

Also, we've since extracted the cast kernels from the rest of the kernels and declared them as "essential kernels" that will always be present (even if ARROW_COMPUTE is OFF) so I think it is safe to rely on them.

@danepitkin
Copy link
Member

FYI I tried swapping the Scalar.cast() implementation to the compute kernel and current (but small) test suite passed while also fixing #35370.

...
    def cast(self, object target_type=None, safe=None, options=None, memory_pool=None):
        return _pc().cast(self, target_type, safe=safe,
                          options=options, memory_pool=memory_pool)
...

danepitkin added a commit to danepitkin/arrow that referenced this issue May 2, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue May 10, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue May 10, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue May 11, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue May 11, 2023
AlenkaF pushed a commit that referenced this issue May 11, 2023
)

### Rationale for this change

Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.

### Are these changes tested?

Added test cases for GH-35370, GH-34901, and GH-35040

### Are there any user-facing changes?

The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: #35040

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
@AlenkaF
Copy link
Member

AlenkaF commented May 11, 2023

Casting of scalar now uses the compute kernel in pyarrow: #35395

Do we want to keep this issue open to also make changes to the Scalar::CastTo in the C++? Or maybe linking the PR for pyarrow with a different issue? @jorisvandenbossche

@danepitkin
Copy link
Member

danepitkin commented May 11, 2023

I created #35560 to track the issue!

@AlenkaF
Copy link
Member

AlenkaF commented May 11, 2023

Great Dane, thanks!

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
apache#35395)

### Rationale for this change

Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.

### Are these changes tested?

Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040

### Are there any user-facing changes?

The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: apache#35040

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
apache#35395)

### Rationale for this change

Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.

### Are these changes tested?

Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040

### Are there any user-facing changes?

The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: apache#35040

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 24, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jun 6, 2023
jorisvandenbossche added a commit that referenced this issue Jun 6, 2023
…use it requires tz database (#35735)

### Rationale for this change

Fix up of #35395, skipping one of the tests added in that PR on Windows, because the test requires access to a tz database.

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants