Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement cumulative sum compute function #29183

Closed
asfimport opened this issue Aug 2, 2021 · 9 comments
Closed

[C++] Implement cumulative sum compute function #29183

asfimport opened this issue Aug 2, 2021 · 9 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 2, 2021

Reporter: Antoine Pitrou / @pitrou
Assignee: Jabari Booker / @JabariBooker

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13530. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
cc @ianmcook

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
The cumulative sum API for other libraries/languages are:

  • pandas: cumsum(values, skipna=True) - By default skips nulls/NaNs while the sum progresses

  • R: cumsum(values) - Always propagates null/NaN after the first one is encountered

  • numpy: cumsum(values) - Always propagates null/NaN after the first one is encountered

    Based on the combinations above, I think Arrow's cumulative sum should provide an option to handle nulls/NaNs (skip_nulls). What should be the default behavior: pandas or R/numpy?

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Also, if we plan on adding other cumulative functions, such as cumprod, cummin, cummax (no JIRAs for these), then we should only need generic code that can iterate through data and accumulate while using the existing sum/min/max/multiply kernels to execute the actual calculations.

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Different cumsum implementations have different behaviors wrt to nested inputs.

  1. Which one should Arrow's cumsum follow?
  2. Should a ChunkedArray input output a ChunkedArray of same shape or output a flattened Array?
  • Pandas
    {code:python}
  1. Performs a "cumulative concatenate" operation for all nested lists

    d = pd.Series([[1,2],[3,4]])
    d.cumsum()
    0 [1, 2]
    1 [1, 2, 3, 4]
    dtype: object

    d = pd.Series([[1,2,[3,4]],[[5,6],7],[8,9]])
    d.cumsum(axis=0)
    0 [1, 2, [3, 4]]
    1 [1, 2, [3, 4], [5, 6], 7]
    2 [1, 2, [3, 4], [5, 6], 7, 8, 9]
    dtype: object
    {code}

  • Numpy
    {code:python}
  1. Case 1: If all of the array elements have the same nested depth, then it flattens the array and applies cumsum.

d = np.array([[1,2],[3,4]])
d.cumsum()
array([ 1, 3, 6, 10])

  1. Case 2: If array contains different nested depths, then cumsum represents a "cumulative concatenate" operation

d = np.array([[1,2,[3,4]],[[5,6],7],[8,9]])
d.cumsum()
array([list([1, 2, [3, 4]]), list([1, 2, [3, 4], [5, 6], 7]),
list([1, 2, [3, 4], [5, 6], 7, 8, 9])], dtype=object)
{code}

  • R
  1. {code:r}

  2. Flattens the nested list, then applies a cumsum

    cumsum(c(c(1,2),c(3,4)))
    [1] 1 3 6 10

    cumsum(c(c(1,2,c(3,4)),c(c(5,6),7),c(8,9)))
    [1] 1 3 6 10 15 21 28 36 45
    {code}

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
How about option 3) Report as an invalid operation.

We don't really support nested<->nested arithmetic elsewhere do we?


>>> a = pa.array([[1, 2, 3], [4, 5, 6]])
>>> b = pa.scalar([1, 1, 1])
>>> c = pa.scalar(5)
>>> pc.add(a, b)
# pyarrow.lib.ArrowNotImplementedError: Function 'add' has no kernel matching input types (array[list<item: int64>], scalar[list<item: int64>])
>>> pc.add(a, a)
# pyarrow.lib.ArrowNotImplementedError: Function 'add' has no kernel matching input types (array[list<item: int64>], array[list<item: int64>])
>>> c = pa.scalar(5)
pyarrow.lib.ArrowNotImplementedError: Function 'add' has no kernel matching input types (array[list<item: int64>], scalar[int64])

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Correct, arithmetic functions do not work on nested types, only scalar, arrays, and chunked arrays.

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Here is the current behavior of {}cumulative_sum{}:

# Valid values
>>> pc.cumulative_sum([1,2,3,4,5])
<pyarrow.lib.Int64Array object at 0x12622c7c0>
[
  1,
  3,
  6,
  10,
  15
]

# Nulls and values
>>> pc.cumulative_sum([1,2,None,3,None,5], skip_nulls=True)
<pyarrow.lib.Int64Array object at 0x12622c760>
[
  1,
  3,
  null,
  6,
  null,
  11
]
>>> pc.cumulative_sum([1,2,None,3,None,5], skip_nulls=False)
<pyarrow.lib.Int64Array object at 0x12622c700>
[
  1,
  3,
  null,
  null,
  null,
  null
]

# NaN followed by nulls and values
>>> pc.cumulative_sum([1,np.nan,None,3,None,5], skip_nulls=True)
<pyarrow.lib.DoubleArray object at 0x12622c640>
[
  1,
  nan,
  null,
  nan,
  null,
  nan
]
>>> pc.cumulative_sum([1,np.nan,None,3,None,5], skip_nulls=False)
<pyarrow.lib.DoubleArray object at 0x12622c700>
[
  1,
  nan,
  null,
  null,
  null,
  null
]
 

Behavior of cumsum with nulls and NaNs:

  • Numpy does not supports None values and behaves the same with NaNs
  • Pandas converts None to NaN, and behaves the same with NaNs. Has option to toggle NaN propagation.
  • R propagates both NA and NaNs

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
Postponing to 9.0.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 12460
#12460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant