Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement cumulative product, max, and min compute functions #32190

Closed
asfimport opened this issue Jun 21, 2022 · 11 comments · Fixed by #36020
Closed

[C++] Implement cumulative product, max, and min compute functions #32190

asfimport opened this issue Jun 21, 2022 · 11 comments · Fixed by #36020

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 21, 2022

Other libraries/languages (pandas, R, numpy) have cumprod, cummax and cummin functions that useful to add to Arrow, similar to the now existing cumulative sum function.

Reporter: Jabari Booker / @JabariBooker

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-16865. Please see the migration documentation for further details.

@asfimport

This comment was marked as outdated.

@frosk1
Copy link

frosk1 commented May 7, 2023

Hey, is this coming?

@ianmcook
Copy link
Member

ianmcook commented May 9, 2023

@frosk1 As far as I know there is no work underway currently on cumulative vector functions (beyond the cumulative sum vector function which was implemented in #12460). If you're interested to take a stab at implementing one of these in a PR, following the example of the cumulative sum implementation, we would be happy to review.

@js8544
Copy link
Collaborator

js8544 commented May 26, 2023

Hi @frosk1 are you working on this? I plan to implement cumprod, cummax and cummin if no one else is working on it.

@frosk1
Copy link

frosk1 commented May 26, 2023

Hey @js8544 no unfortunately I do not have time atm. Would be great if you can work on that one.

@js8544
Copy link
Collaborator

js8544 commented May 26, 2023

take

@js8544
Copy link
Collaborator

js8544 commented Jun 1, 2023

I noticed that the functions cumprod, cummin and cummax would require the same option as CumulativeSumOptions (min and max won't need to check_overflow but I plan to remove it in #35790).

I want to avoid having four different option types with identical structure, so I intend to refactor CumulativeSumOptions to CumulativeOptions and reuse it for all cumulative functions.

The current definition is

/// \brief Options for cumulative sum function
class ARROW_EXPORT CumulativeSumOptions : public FunctionOptions {
 public:
  explicit CumulativeSumOptions(double start = 0, bool skip_nulls = false,
                                bool check_overflow = false);
  explicit CumulativeSumOptions(std::shared_ptr<Scalar> start, bool skip_nulls = false,
                                bool check_overflow = false);
  static constexpr char const kTypeName[] = "CumulativeSumOptions";
  static CumulativeSumOptions Defaults() { return CumulativeSumOptions(); }

  /// Optional starting value for cumulative operation computation
  std::shared_ptr<Scalar> start;

  /// If true, nulls in the input are ignored and produce a corresponding null output.
  /// When false, the first null encountered is propagated through the remaining output.
  bool skip_nulls = false;

  /// When true, returns an Invalid Status when overflow is detected
  bool check_overflow = false;
};

I plan to do this:

/// \brief Options for cumulative functions
class ARROW_EXPORT CumulativeOptions : public FunctionOptions {
  // ... (same as above)
};

using CumulativeSumOptions = CumulativeOptions; // for backward compatibility

I checked that with this change all compute tests still passed, both in cpp and python. But I want to make sure there are no better ways to solve this. Do you consider this as an acceptable approach? cc @pitrou @westonpace

@pitrou
Copy link
Member

pitrou commented Jun 1, 2023

@js8544 This seems like a good approach to me.

@westonpace
Copy link
Member

I agree with this approach. We have done something very similar with aggregate functions and ScalarAggregateOptions:

/// \brief Control general scalar aggregate kernel behavior
///
/// By default, null values are ignored (skip_nulls = true).
class ARROW_EXPORT ScalarAggregateOptions : public FunctionOptions {
public:
explicit ScalarAggregateOptions(bool skip_nulls = true, uint32_t min_count = 1);
static constexpr char const kTypeName[] = "ScalarAggregateOptions";
static ScalarAggregateOptions Defaults() { return ScalarAggregateOptions{}; }
/// If true (the default), null values are ignored. Otherwise, if any value is null,
/// emit null.
bool skip_nulls;
/// If less than this many non-null values are observed, emit null.
uint32_t min_count;
};

@westonpace
Copy link
Member

CC @icexelloss / @rtpsw / @ildipo may be interested in these functions when available. They are classic window functions and so could be useful in the work that is being done to add window function support to Acero.

@js8544
Copy link
Collaborator

js8544 commented Jun 10, 2023

CC @icexelloss / @rtpsw / @ildipo may be interested in these functions when available. They are classic window functions and so could be useful in the work that is being done to add window function support to Acero.

Regarding window functions. I also implemented a pairwise_diff function (#35787) you guys may also want to take a look at. pairwise functions would corresponds to LAG/LEAD window functions in SQL.

I also plan to implement a series of rolling_* functions that computes over a fixed length sliding window soon, which corresponds to the window option ROWS BETWEEN n PRECEDING AND CURRENT ROW in sql.

js8544 added a commit to js8544/arrow that referenced this issue Jun 21, 2023
bkietz added a commit that referenced this issue Jun 22, 2023
…ions (#36020)

### Rationale for this change

Implement cumulative prod, max and min compute functions
### What changes are included in this PR?

1. Add implementations, docs and tests for the three functions.
2. Refactor `CumulativeSumOptions` to `CumulativeOptions` for reusability.
3. Fix a bug where `GenericFromScalar(GenericToScalar(std::nullopt))  != std::nullopt`.
4. Remove an unnecessary Cast with the default start value.
5. Add tests to check behavior with `NaN`.

I'll explain some of the changes in comments.

### Are these changes tested?

Yes, in vector_accumulative_ops_test.cc and test_compute.py

### Are there any user-facing changes?

No. The data members of `CumulativeSumOptions` are changed, but the member functions behave as before. And std::optional<T> also can be constructed directly from T. So users should not feel any difference.
* Closes: #32190

Lead-authored-by: Jin Shang <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 13.0.0 milestone Jun 22, 2023
AlenkaF added a commit that referenced this issue Aug 22, 2023
…for independent deprecation (#36977)

**Rationale for this change**
As #36240 says, we refactor CumulativeSumOptions to a separate class.

**What changes are included in this PR?**

- independent CumulativeSumOptions 
- the original simple test before #32190 
- fix a typo in CumulativeOptions 

**Are these changes tested?**
No.
Actually, the PR can't pass the `test_option_class_equality` in test_compute.py ([Error example](https://github.com/apache/arrow/actions/runs/5728571658/job/15523443371?pr=36977)). Cause CumulativeSumOptions's C++ part is also CumulativeOptions.
![image](https://github.com/apache/arrow/assets/18380073/0a173684-47f8-4eb9-b8f4-ba72aa5aab97)

**Are there any user-facing changes?**
No.

Closes: #36240
* Closes: #36240

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Dane Pitkin <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…class for independent deprecation (apache#36977)

**Rationale for this change**
As apache#36240 says, we refactor CumulativeSumOptions to a separate class.

**What changes are included in this PR?**

- independent CumulativeSumOptions 
- the original simple test before apache#32190 
- fix a typo in CumulativeOptions 

**Are these changes tested?**
No.
Actually, the PR can't pass the `test_option_class_equality` in test_compute.py ([Error example](https://github.com/apache/arrow/actions/runs/5728571658/job/15523443371?pr=36977)). Cause CumulativeSumOptions's C++ part is also CumulativeOptions.
![image](https://github.com/apache/arrow/assets/18380073/0a173684-47f8-4eb9-b8f4-ba72aa5aab97)

**Are there any user-facing changes?**
No.

Closes: apache#36240
* Closes: apache#36240

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Dane Pitkin <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants