[C++] Implement cumulative sum compute function #29183

asfimport · 2021-08-02T15:48:40Z

Reporter: Antoine Pitrou / @pitrou
Assignee: Jabari Booker / @JabariBooker

Related issues:

[C++] Implement cumulative product, max, and min compute functions (relates to)

PRs and other links:

GitHub Pull Request #12460

_{Note: This issue was originally created as ARROW-13530. Please see the migration documentation for further details.}

asfimport · 2021-08-02T15:49:03Z

Antoine Pitrou / @pitrou:
cc @ianmcook

asfimport · 2022-03-17T20:09:02Z

Eduardo Ponce / @edponce:
The cumulative sum API for other libraries/languages are:

pandas: cumsum(values, skipna=True) - By default skips nulls/NaNs while the sum progresses
R: cumsum(values) - Always propagates null/NaN after the first one is encountered
numpy: cumsum(values) - Always propagates null/NaN after the first one is encountered

Based on the combinations above, I think Arrow's cumulative sum should provide an option to handle nulls/NaNs (skip_nulls). What should be the default behavior: pandas or R/numpy?

asfimport · 2022-03-17T20:37:09Z

Eduardo Ponce / @edponce:
Also, if we plan on adding other cumulative functions, such as cumprod, cummin, cummax (no JIRAs for these), then we should only need generic code that can iterate through data and accumulate while using the existing sum/min/max/multiply kernels to execute the actual calculations.

asfimport · 2022-04-02T00:44:52Z

Eduardo Ponce / @edponce:
Different cumsum implementations have different behaviors wrt to nested inputs.

Which one should Arrow's cumsum follow?
Should a ChunkedArray input output a ChunkedArray of same shape or output a flattened Array?

Pandas
{code:python}

Performs a "cumulative concatenate" operation for all nested lists

d = pd.Series([[1,2],[3,4]])
d.cumsum()
0 [1, 2]
1 [1, 2, 3, 4]
dtype: object

d = pd.Series([[1,2,[3,4]],[[5,6],7],[8,9]])
d.cumsum(axis=0)
0 [1, 2, [3, 4]]
1 [1, 2, [3, 4], [5, 6], 7]
2 [1, 2, [3, 4], [5, 6], 7, 8, 9]
dtype: object
{code}

Numpy
{code:python}

Case 1: If all of the array elements have the same nested depth, then it flattens the array and applies cumsum.

d = np.array([[1,2],[3,4]])
d.cumsum()
array([ 1, 3, 6, 10])

Case 2: If array contains different nested depths, then cumsum represents a "cumulative concatenate" operation

d = np.array([[1,2,[3,4]],[[5,6],7],[8,9]])
d.cumsum()
array([list([1, 2, [3, 4]]), list([1, 2, [3, 4], [5, 6], 7]),
list([1, 2, [3, 4], [5, 6], 7, 8, 9])], dtype=object)
{code}

R

{code:r}
Flattens the nested list, then applies a cumsum

cumsum(c(c(1,2),c(3,4)))
[1] 1 3 6 10

cumsum(c(c(1,2,c(3,4)),c(c(5,6),7),c(8,9)))
[1] 1 3 6 10 15 21 28 36 45
{code}

asfimport · 2022-04-02T01:24:57Z

Weston Pace / @westonpace:
How about option 3) Report as an invalid operation.

We don't really support nested<->nested arithmetic elsewhere do we?


>>> a = pa.array([[1, 2, 3], [4, 5, 6]])
>>> b = pa.scalar([1, 1, 1])
>>> c = pa.scalar(5)
>>> pc.add(a, b)
# pyarrow.lib.ArrowNotImplementedError: Function 'add' has no kernel matching input types (array[list<item: int64>], scalar[list<item: int64>])
>>> pc.add(a, a)
# pyarrow.lib.ArrowNotImplementedError: Function 'add' has no kernel matching input types (array[list<item: int64>], array[list<item: int64>])
>>> c = pa.scalar(5)
pyarrow.lib.ArrowNotImplementedError: Function 'add' has no kernel matching input types (array[list<item: int64>], scalar[int64])

asfimport · 2022-04-02T02:05:11Z

Eduardo Ponce / @edponce:
Correct, arithmetic functions do not work on nested types, only scalar, arrays, and chunked arrays.

asfimport · 2022-04-04T15:44:20Z

Eduardo Ponce / @edponce:
Here is the current behavior of {}cumulative_sum{}:

# Valid values
>>> pc.cumulative_sum([1,2,3,4,5])
<pyarrow.lib.Int64Array object at 0x12622c7c0>
[
  1,
  3,
  6,
  10,
  15
]

# Nulls and values
>>> pc.cumulative_sum([1,2,None,3,None,5], skip_nulls=True)
<pyarrow.lib.Int64Array object at 0x12622c760>
[
  1,
  3,
  null,
  6,
  null,
  11
]
>>> pc.cumulative_sum([1,2,None,3,None,5], skip_nulls=False)
<pyarrow.lib.Int64Array object at 0x12622c700>
[
  1,
  3,
  null,
  null,
  null,
  null
]

# NaN followed by nulls and values
>>> pc.cumulative_sum([1,np.nan,None,3,None,5], skip_nulls=True)
<pyarrow.lib.DoubleArray object at 0x12622c640>
[
  1,
  nan,
  null,
  nan,
  null,
  nan
]
>>> pc.cumulative_sum([1,np.nan,None,3,None,5], skip_nulls=False)
<pyarrow.lib.DoubleArray object at 0x12622c700>
[
  1,
  nan,
  null,
  null,
  null,
  null
]

Behavior of cumsum with nulls and NaNs:

Numpy does not supports None values and behaves the same with NaNs
Pandas converts None to NaN, and behaves the same with NaNs. Has option to toggle NaN propagation.
R propagates both NA and NaNs

asfimport · 2022-04-21T18:54:57Z

Krisztian Szucs / @kszucs:
Postponing to 9.0.

asfimport · 2022-05-31T13:23:20Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 12460
#12460

asfimport closed this as completed May 31, 2022

asfimport mentioned this issue Jan 11, 2023

[C++] Implement cumulative product, max, and min compute functions #32190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Implement cumulative sum compute function #29183

[C++] Implement cumulative sum compute function #29183

asfimport commented Aug 2, 2021 •

edited

Loading

asfimport commented Aug 2, 2021

asfimport commented Mar 17, 2022

asfimport commented Mar 17, 2022

asfimport commented Apr 2, 2022

asfimport commented Apr 2, 2022

asfimport commented Apr 2, 2022

asfimport commented Apr 4, 2022

asfimport commented Apr 21, 2022

asfimport commented May 31, 2022

[C++] Implement cumulative sum compute function #29183

[C++] Implement cumulative sum compute function #29183

Comments

asfimport commented Aug 2, 2021 • edited Loading

Related issues:

PRs and other links:

asfimport commented Aug 2, 2021

asfimport commented Mar 17, 2022

asfimport commented Mar 17, 2022

asfimport commented Apr 2, 2022

asfimport commented Apr 2, 2022

asfimport commented Apr 2, 2022

asfimport commented Apr 4, 2022

asfimport commented Apr 21, 2022

asfimport commented May 31, 2022

asfimport commented Aug 2, 2021 •

edited

Loading