Add Python bindings for `lists::concatenate_list_elements` and expose them as `.list.concat()` #8006

shwina · 2021-04-20T16:22:53Z

Adds a method to concatenate the lists in a nested list Series:

In [15]: s
Out[15]:
0    [[1, 2], [3, 4]]
dtype: list

In [16]: s.list.concat()
Out[16]:
0    [1, 2, 3, 4]
dtype: list

…dd-list-ravel

shwina · 2021-04-27T12:34:51Z

python/cudf/cudf/core/column/lists.py

@@ -451,3 +457,56 @@ def sort_values(
            sort_lists(self._column, ascending, na_position),
            retain_index=not ignore_index,
        )
+
+    def ravel(self) -> ParentType:


Could use some help here with naming/docstring here.

Unlike np.ravel, this function only removes one level of nesting from each row. What's a better name for that?

What's a better name for that?

unpack, unbox ? (along with a parameter n for how many levels of nesting to be removed? - but maybe this is something we can do in a future PR)

I like unbox, and I also like the suggestion for an n parameter

What about flatten? That's what Spark uses: https://spark.apache.org/docs/latest/api/sql/index.html#flatten

Sounds good to me.

Do we still want the n parameter with flatten?

Looks like Spark flatten only flattens one level, whereas numpy flatten does similar to ravel where it flattens down to 1d: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html#numpy.ndarray.flatten

Any thoughts on what makes the most sense for us?

I went with lists.concat() as that's really what we're doing (concatenating the inner lists). If we only have a single inner list, nesting is removed:

In [15]: s Out[15]: 0 [[1, 2], [3, 4]] dtype: list In [16]: s.list.concat() Out[16]: 0 [1, 2, 3, 4] dtype: list

In [18]: s Out[18]: 0 [[1, 2]] dtype: list In [19]: s.list.concat() Out[19]: 0 [1, 2] dtype: list

shwina · 2021-04-27T12:35:12Z

python/cudf/cudf/core/column/lists.py

+        1    [6.0, nan, 7.0, 8.0, 9.0]
+        dtype: list
+
+        Null values at the top-level in each row are dropped:


Is this desirable? If not, what should our behaviour be?

Seems ok to me... This may introduce ambiguity of

flatten([[1, 2, 3], None, [4, 5]] flatten([[1, 2, 3], [], [4, 5]]

Even though, when unwrapping nested lists it sounds reasonable to assume both empty list and null item do not contribute to concrete elements.

I exposed a dropna param which will drop null values by default. If set to False, the corresponding row in the result is null.

(these are the options available at the libcudf level)

codecov · 2021-04-27T15:17:55Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@8aceab0). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #8006   +/-   ##
===============================================
  Coverage                ?   10.62%           
===============================================
  Files                   ?      109           
  Lines                   ?    18635           
  Branches                ?        0           
===============================================
  Hits                    ?     1980           
  Misses                  ?    16655           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8aceab0...6ee35dc. Read the comment docs.

python/cudf/cudf/core/column/lists.py

isVoid · 2021-05-04T02:46:27Z

python/cudf/cudf/core/column/methods.py

@@ -11,6 +11,8 @@
 if TYPE_CHECKING:
    from cudf.core.column import ColumnBase

+ParentType = Union["cudf.Series", "cudf.Index"]


Can this become SingleColumnFrame from #8115 ?

I'm changing this in #8306.

isVoid · 2021-05-04T02:49:22Z

python/cudf/cudf/tests/test_list.py

+        [[1, 2], [3, 4, 5]],
+        [[1, 2, None], [3, 4, 5]],
+        [[[1, 2], [3, 4]], [[5, 6, 7], [8, 9]]],
+        [[["a", "c", "de", None], None, ["fg"]], [["abc", "de"], None]],


Try a few empty list items in here as well?

isVoid · 2021-05-04T02:54:23Z

python/cudf/cudf/core/column/lists.py

+        1    [6.0, nan, 7.0, 8.0, 9.0]
+        dtype: list
+
+        Null values at the top-level in each row are dropped:


Seems ok to me... This may introduce ambiguity of

flatten([[1, 2, 3], None, [4, 5]] flatten([[1, 2, 3], [], [4, 5]]

Even though, when unwrapping nested lists it sounds reasonable to assume both empty list and null item do not contribute to concrete elements.

ttnghia · 2021-05-05T17:02:54Z

Hi @shwina! FYI, I'm working on another list concatenation API (lists::concatenate_by_key), which concatenate lists together and may beneficial for your use case. In particular, given a pair of keys-lists columns, lists having the same key will be concatenated. For example:

keys = [0, 1, 0, 2, 0]
values = [{0, 1}, {2, 3, 4}, {5}, {}, {6, 7}]
r = lists::concatenate_by_key(keys, values)
r is now [{0, 1, 5, 6, 7}, {2, 3, 4}, {}]

I'm not sure if this can be applied to this PR, or more general list flattening usages. I imagine that if we want to flatten every N lists (concatenate every N contiguous lists into one list), then just generate the same key for every N indices and call lists::concatenate_by_key. Of course, concatenating different numbers of lists for each resulting list can be done in the same way.

shwina · 2021-05-05T18:16:18Z

@ttnghia That does sound useful, although there's the unnecessary overhead of allocating a list column of keys. Could libcudf include a less general API that simply concatenates all the lists in each row?

In any case, what is the behaviour with nulls, i.e., what if an index corresponds to a null element?

ttnghia · 2021-05-05T18:19:39Z

Yes, of course we can have that API. In strings we already have strings::concatenate_list_elements, so I can implement a similar API for lists---I'll do that.

For a null list element, there is an option to choose: either to ignore the null and continue concatenating the remaining lists, or nullify the entire result (concatenation involving a null list will result in a null list).

kkraus14 · 2021-05-05T19:47:55Z

python/cudf/cudf/core/column/lists.py

+        if not isinstance(result_dtype, ListDtype):
+            return self._return_or_inplace(self._column)


May want to mention in the docstring that this API is designed to work on Lists of Lists and that if you only give it one level it doesn't return an integral type.

Or we could decide to allow it to return an integral type.

ttnghia · 2021-05-12T22:59:33Z

@shwina FYI, cudf PR is online (#8231) 😃
Update: It's merged.

…dd-list-ravel

…add-list-ravel

python/cudf/cudf/core/column/lists.py

Co-authored-by: Nghia Truong <[email protected]>

…add-list-ravel

…-ravel

charlesbluca · 2021-06-30T19:14:33Z

rerun tests

shwina · 2021-06-30T22:00:45Z

@gpucibot merge

Add initial list ravel

1f188e1

github-actions bot added the Python Affects Python cuDF API. label Apr 20, 2021

This was referenced Apr 20, 2021

[BUG] Groupby collect list fails with Dask #7812

Closed

Add collect list to dask-cudf groupby aggregations #8045

Merged

shwina added 2 commits April 26, 2021 17:00

Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into a…

d9edac7

…dd-list-ravel

Add typing for list methods. Ravel tests.

ddf2907

shwina added non-breaking Non-breaking change feature request New feature or request labels Apr 27, 2021

shwina requested a review from skirui-source April 27, 2021 12:33

shwina commented Apr 27, 2021

View reviewed changes

shwina marked this pull request as ready for review April 27, 2021 12:36

shwina requested a review from a team as a code owner April 27, 2021 12:36

shwina requested review from galipremsagar and isVoid April 27, 2021 12:36

shwina changed the title ~~Add list ravel~~ Add list flatten May 3, 2021

Rename ravel -> flatten

dfd1719

kkraus14 reviewed May 3, 2021

View reviewed changes

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved

Don't access children by index

f3ddaba

isVoid reviewed May 4, 2021

View reviewed changes

ttnghia mentioned this pull request May 5, 2021

[FEA] Support concatenate_list_elements for list type #8164

Closed

kkraus14 reviewed May 5, 2021

View reviewed changes

ttnghia mentioned this pull request May 12, 2021

Implement lists::concatenate_list_elements #8231

Merged

Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into a…

400e204

…dd-list-ravel

galipremsagar mentioned this pull request May 26, 2021

Bump dask versions rapidsai/integration#281

Merged

shwina added 2 commits May 26, 2021 15:43

Merge branch 'branch-21.06' of https://github.com/rapidsai/cudf into …

4c985e8

…add-list-ravel

Add bindings for lists::concatenate_list_elements

65cfa91

shwina changed the title ~~Add list flatten~~ Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() May 26, 2021

ttnghia approved these changes May 26, 2021

View reviewed changes

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved

galipremsagar approved these changes May 26, 2021

View reviewed changes

shwina and others added 3 commits May 26, 2021 18:14

Fix test

48f6949

Test raise case

978807a

Update python/cudf/cudf/core/column/lists.py

cdb5f4b

Co-authored-by: Nghia Truong <[email protected]>

shwina changed the base branch from branch-21.06 to branch-21.08 May 26, 2021 22:33

shwina added 2 commits June 22, 2021 13:57

Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …

c945166

…add-list-ravel

Merge branch 'add-list-ravel' of github.com:shwina/cudf into add-list…

6ee35dc

…-ravel

charlesbluca approved these changes Jun 30, 2021

View reviewed changes

rapids-bot bot merged commit df45976 into rapidsai:branch-21.08 Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python bindings for `lists::concatenate_list_elements` and expose them as `.list.concat()` #8006

Add Python bindings for `lists::concatenate_list_elements` and expose them as `.list.concat()` #8006

shwina commented Apr 20, 2021 •

edited

Loading

shwina Apr 27, 2021

galipremsagar Apr 27, 2021

shwina Apr 27, 2021

kkraus14 May 3, 2021

shwina May 3, 2021

shwina May 3, 2021

kkraus14 May 4, 2021

shwina May 26, 2021

shwina Apr 27, 2021

isVoid May 4, 2021

shwina May 26, 2021

codecov bot commented Apr 27, 2021 •

edited

Loading

isVoid May 4, 2021

shwina May 26, 2021

isVoid May 4, 2021

shwina May 26, 2021

isVoid May 4, 2021

ttnghia commented May 5, 2021 •

edited

Loading

shwina commented May 5, 2021

ttnghia commented May 5, 2021 •

edited

Loading

kkraus14 May 5, 2021

ttnghia commented May 12, 2021 •

edited

Loading

charlesbluca commented Jun 30, 2021

shwina commented Jun 30, 2021

		if not isinstance(result_dtype, ListDtype):
		return self._return_or_inplace(self._column)

Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() #8006

Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() #8006

Conversation

shwina commented Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 27, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia commented May 5, 2021 • edited Loading

shwina commented May 5, 2021

ttnghia commented May 5, 2021 • edited Loading

Choose a reason for hiding this comment

ttnghia commented May 12, 2021 • edited Loading

charlesbluca commented Jun 30, 2021

shwina commented Jun 30, 2021

Add Python bindings for `lists::concatenate_list_elements` and expose them as `.list.concat()` #8006

Add Python bindings for `lists::concatenate_list_elements` and expose them as `.list.concat()` #8006

shwina commented Apr 20, 2021 •

edited

Loading

codecov bot commented Apr 27, 2021 •

edited

Loading

ttnghia commented May 5, 2021 •

edited

Loading

ttnghia commented May 5, 2021 •

edited

Loading

ttnghia commented May 12, 2021 •

edited

Loading