Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40061: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor - add option to cast NULL to NaN #40803

Merged

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Mar 26, 2024

Rationale for this change

The conversion from RecordBatch to Tensor class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

What changes are included in this PR?

This PR adds a nul_to_nan option in RecordBatch::ToTensor so that null values are converted to NaN in the resulting Tensor. This for example works:

>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])

but default would raise:

>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #40061 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Member Author

AlenkaF commented Mar 27, 2024

@bkietz this is a smaller chunk of code as a follow-up of the work done in #40060. Would be happy to get your 👀 on it. Thank you!

cc @jorisvandenbossche

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.h Show resolved Hide resolved
cpp/src/arrow/record_batch_test.cc Outdated Show resolved Hide resolved
python/pyarrow/includes/libarrow.pxd Outdated Show resolved Hide resolved
python/pyarrow/table.pxi Outdated Show resolved Hide resolved
python/pyarrow/table.pxi Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 27, 2024
Co-authored-by: Joris Van den Bossche <[email protected]>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 27, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 27, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 27, 2024
cpp/src/arrow/record_batch.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Mar 27, 2024
Co-authored-by: Joris Van den Bossche <[email protected]>
@AlenkaF AlenkaF requested a review from bkietz March 27, 2024 12:40
@jorisvandenbossche jorisvandenbossche merged commit 96f686b into apache:main Mar 29, 2024
35 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @AlenkaF !

@AlenkaF AlenkaF deleted the record-batch-to-tensor-null-to-nan branch March 29, 2024 07:44
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 96f686b.

There were 2 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants