Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() #8006

Merged
merged 13 commits into from
Jun 30, 2021

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Apr 20, 2021

Adds a method to concatenate the lists in a nested list Series:

In [15]: s
Out[15]:
0    [[1, 2], [3, 4]]
dtype: list

In [16]: s.list.concat()
Out[16]:
0    [1, 2, 3, 4]
dtype: list

@shwina shwina added non-breaking Non-breaking change feature request New feature or request labels Apr 27, 2021
@@ -451,3 +457,56 @@ def sort_values(
sort_lists(self._column, ascending, na_position),
retain_index=not ignore_index,
)

def ravel(self) -> ParentType:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use some help here with naming/docstring here.

Unlike np.ravel, this function only removes one level of nesting from each row. What's a better name for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a better name for that?

unpack, unbox ? (along with a parameter n for how many levels of nesting to be removed? - but maybe this is something we can do in a future PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like unbox, and I also like the suggestion for an n parameter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about flatten? That's what Spark uses: https://spark.apache.org/docs/latest/api/sql/index.html#flatten

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want the n parameter with flatten?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Spark flatten only flattens one level, whereas numpy flatten does similar to ravel where it flattens down to 1d: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html#numpy.ndarray.flatten

Any thoughts on what makes the most sense for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with lists.concat() as that's really what we're doing (concatenating the inner lists). If we only have a single inner list, nesting is removed:

In [15]: s
Out[15]:
0    [[1, 2], [3, 4]]
dtype: list

In [16]: s.list.concat()
Out[16]:
0    [1, 2, 3, 4]
dtype: list
In [18]: s
Out[18]:
0    [[1, 2]]
dtype: list

In [19]: s.list.concat()
Out[19]:
0    [1, 2]
dtype: list

1 [6.0, nan, 7.0, 8.0, 9.0]
dtype: list

Null values at the top-level in each row are dropped:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this desirable? If not, what should our behaviour be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me... This may introduce ambiguity of

flatten([[1, 2, 3], None, [4, 5]]
flatten([[1, 2, 3], [], [4, 5]]

Even though, when unwrapping nested lists it sounds reasonable to assume both empty list and null item do not contribute to concrete elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I exposed a dropna param which will drop null values by default. If set to False, the corresponding row in the result is null.

(these are the options available at the libcudf level)

@shwina shwina marked this pull request as ready for review April 27, 2021 12:36
@shwina shwina requested a review from a team as a code owner April 27, 2021 12:36
@codecov
Copy link

codecov bot commented Apr 27, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@8aceab0). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8006   +/-   ##
===============================================
  Coverage                ?   10.62%           
===============================================
  Files                   ?      109           
  Lines                   ?    18635           
  Branches                ?        0           
===============================================
  Hits                    ?     1980           
  Misses                  ?    16655           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8aceab0...6ee35dc. Read the comment docs.

@shwina shwina changed the title Add list ravel Add list flatten May 3, 2021
@@ -11,6 +11,8 @@
if TYPE_CHECKING:
from cudf.core.column import ColumnBase

ParentType = Union["cudf.Series", "cudf.Index"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this become SingleColumnFrame from #8115 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing this in #8306.

Comment on lines 327 to 330
[[1, 2], [3, 4, 5]],
[[1, 2, None], [3, 4, 5]],
[[[1, 2], [3, 4]], [[5, 6, 7], [8, 9]]],
[[["a", "c", "de", None], None, ["fg"]], [["abc", "de"], None]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try a few empty list items in here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

1 [6.0, nan, 7.0, 8.0, 9.0]
dtype: list

Null values at the top-level in each row are dropped:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me... This may introduce ambiguity of

flatten([[1, 2, 3], None, [4, 5]]
flatten([[1, 2, 3], [], [4, 5]]

Even though, when unwrapping nested lists it sounds reasonable to assume both empty list and null item do not contribute to concrete elements.

@ttnghia
Copy link
Contributor

ttnghia commented May 5, 2021

Hi @shwina! FYI, I'm working on another list concatenation API (lists::concatenate_by_key), which concatenate lists together and may beneficial for your use case. In particular, given a pair of keys-lists columns, lists having the same key will be concatenated. For example:

keys = [0, 1, 0, 2, 0]
values = [{0, 1}, {2, 3, 4}, {5}, {}, {6, 7}]
r = lists::concatenate_by_key(keys, values)
r is now [{0, 1, 5, 6, 7}, {2, 3, 4}, {}]

I'm not sure if this can be applied to this PR, or more general list flattening usages. I imagine that if we want to flatten every N lists (concatenate every N contiguous lists into one list), then just generate the same key for every N indices and call lists::concatenate_by_key. Of course, concatenating different numbers of lists for each resulting list can be done in the same way.

@shwina
Copy link
Contributor Author

shwina commented May 5, 2021

@ttnghia That does sound useful, although there's the unnecessary overhead of allocating a list column of keys. Could libcudf include a less general API that simply concatenates all the lists in each row?

In any case, what is the behaviour with nulls, i.e., what if an index corresponds to a null element?

@ttnghia
Copy link
Contributor

ttnghia commented May 5, 2021

Yes, of course we can have that API. In strings we already have strings::concatenate_list_elements, so I can implement a similar API for lists---I'll do that.

For a null list element, there is an option to choose: either to ignore the null and continue concatenating the remaining lists, or nullify the entire result (concatenation involving a null list will result in a null list).

Comment on lines 492 to 493
if not isinstance(result_dtype, ListDtype):
return self._return_or_inplace(self._column)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to mention in the docstring that this API is designed to work on Lists of Lists and that if you only give it one level it doesn't return an integral type.

Or we could decide to allow it to return an integral type.

@ttnghia
Copy link
Contributor

ttnghia commented May 12, 2021

@shwina FYI, cudf PR is online (#8231) 😃
Update: It's merged.

@shwina shwina changed the title Add list flatten Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() May 26, 2021
python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved
@shwina shwina changed the base branch from branch-21.06 to branch-21.08 May 26, 2021 22:33
@charlesbluca
Copy link
Member

rerun tests

@shwina
Copy link
Contributor Author

shwina commented Jun 30, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit df45976 into rapidsai:branch-21.08 Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants