Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add API for generating per-list sequences #9424

Closed
wbo4958 opened this issue Oct 13, 2021 · 5 comments · Fixed by #9839 or #9972
Closed

[FEA] Add API for generating per-list sequences #9424

wbo4958 opened this issue Oct 13, 2021 · 5 comments · Fixed by #9839 or #9972
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 13, 2021

Is your feature request related to a problem? Please describe.
Could we support sequence on columns?

input

start end step
0 5 1
-2 2 2
-3 3 1

output

start end step sequence(start, end, step)
0 5 1 [0, 1, 2, 3, 4, 5]
-2 2 2 [-2, 0, 2]
-3 3 1 [-3, -2, -1, 0, 1, 2, 3]
@jrhemstad
Copy link
Contributor

/**
* @brief Fills a column with a sequence of value specified by an initial value and a step.
*
* Creates a new column and fills with @p size values starting at @p init and
* incrementing by @p step, generating the sequence
* [ init, init+step, init+2*step, ... init + (size - 1)*step]
*
* ```
* size = 3
* init = 0
* step = 2
* return = [0, 2, 4]
* ```
* @throws cudf::logic_error if @p init and @p step are not the same type.
* @throws cudf::logic_error if scalar types are not numeric.
* @throws cudf::logic_error if @p size is < 0.
*
* @param size Size of the output column
* @param init First value in the sequence
* @param step Increment value
* @param mr Device memory resource used to allocate the returned column's device memory
* @return std::unique_ptr<column> The result table containing the sequence
*/
std::unique_ptr<column> sequence(
size_type size,
scalar const& init,
scalar const& step,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

@revans2
Copy link
Contributor

revans2 commented Oct 13, 2021

I don't think this was very clean in the request. The desire it to support the Spark sequence function

https://spark.apache.org/docs/latest/api/sql/index.html#sequence

The desire would be to have an API like

std::unique_ptr<column> list_sequence(
    cudf::column_view start,
    cudf::column_view end,
    cudf::column_view step, 
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); 

The returned value would be a list column.

If you want to keep the parameters the same with sequence (size, init, and step) I think we can compute size from start, end, and step so it should be fine. I am not 100% sure about the timestamps and duration. I think we can do it if we really do just support duration (days and microseconds) and not months as a part of the intervals supported.

A few details

  1. The output data type is a list of the start data type.
  2. If start/end are Timestamp types step should be a duration
  3. If start, end, or step is null then the output row should be null.
  4. if start == end there will be one entry in the resulting list.
  5. if step == 0, start must be == end (we can enforce the checks ahead of time)
  6. if step > 0, start must be <= end (again we can do the checks ahead of time)
  7. if step < 0, start must be >= end (we can enforce this)

I hope that this clarifies things enough for you @jrhemstad

@beckernick
Copy link
Member

Perhaps #8886 is relevant for the timestamp sequence?

@jrhemstad
Copy link
Contributor

jrhemstad commented Oct 13, 2021

Got it, I missed that the intent was the output would be a list column.

@revans2 your description was very helpful.

For consistency with our other sequence APIs, I think we'd want to do a start, step, n tuple for generating each list, where n is the number of things in the list.

Furthermore, I'm inclined to make the input a structs_column_view of the constituent column_views in your example API.

If start/end are Timestamp types step should be a duration

I think we'd enforce that start/step need to be the same type, so a timestamp column would need to be first converted to duration.

@revans2
Copy link
Contributor

revans2 commented Oct 13, 2021

That sounds good to me. Months really messes things up so I am okay with skipping it for now.

@jrhemstad jrhemstad changed the title [FEA] Sequence on columns [FEA] Add API for generating per-list sequences Oct 13, 2021
@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Oct 13, 2021
@ttnghia ttnghia self-assigned this Nov 2, 2021
@rapids-bot rapids-bot bot closed this as completed in #9839 Jan 4, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 4, 2022
This PR adds `lists::sequences` API, allowing to generate per-list sequence. In particular, it allows generating a lists column in which each list is a sequence of numbers/durations. These sequences are generated individually from separate sets of (start, step, size) input values.

Closes #9424.

Note: `lists::sequences` supports only numeric types (integer types + floating-point types) and duration types.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - https://github.com/nvdbaranec
  - Karthikeyan (https://github.com/karthikeyann)

URL: #9839
rapids-bot bot pushed a commit that referenced this issue Jan 5, 2022
This PR add java binding for sequences API. and to fix #9424.

Authors:
  - Bobby Wang (https://github.com/wbo4958)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #9972
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants