Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement string list concatenation #7929

Merged
merged 58 commits into from
Apr 26, 2021
Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Apr 9, 2021

Given a lists column of strings (each row is a list of strings), this PR facilitates the concatenation of strings within each list.

For example:

s = [ {'aa', 'bb', 'cc'}, null, {null, 'dd'}, {'ee', 'ff'} ]
r = strings::concatenate_list_elements(s, '+++') 
r is ['aa+++bb+++cc', null, null, 'ee+++ff']

This PR is similar to Spark's concat_ws, and closes #7727.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Apr 9, 2021
@harrism
Copy link
Member

harrism commented Apr 21, 2021

Given a lists column of strings (each row is a list of strings), this PR facilitates the concatenation of strings within each list.

For example:

s = [ {'aa', 'bb', 'cc'}, null, {null, 'dd'}, {'ee', 'ff'} ]
concatenate_rows(s, "+++") = ['aa+++bb+++cc', null, null, 'ee+++ff']

This PR also partially supports Spark's concat_ws (closes #7727).

I think the name concatenate_rows is wrong for this API behavior. The behavior demonstrated above is concatenating the strings within each row into a single string. It is not concatenating multiple rows of a single column into one row, or concatenating corresponding rows of multiple columns into one row of an output column. Either of those might be called concatenate_rows.

I would call this concatenate_strings or concatenate_strings_list or something.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 22, 2021

I think the name concatenate_rows is wrong for this API behavior. The behavior demonstrated above is concatenating the strings within each row into a single string. It is not concatenating multiple rows of a single column into one row, or concatenating corresponding rows of multiple columns into one row of an output column. Either of those might be called concatenate_rows.

I would call this concatenate_strings or concatenate_strings_list or something.

Thanks Mark. The name concatenate_strings_list sounds good to me. Jake has also suggested the name concatenate_list_elements and I have adopted it (sorry that I didn't update the API name in the example).

So, now we have one more name suggestion. For me, concatenate_strings_list would sound better. Unless somebody objects to it, I'm going to change the API name to it.

@harrism
Copy link
Member

harrism commented Apr 22, 2021

concatenate_list_elements works for me and I like it better than my suggestion. However, will there ever be a type stored in a list other than strings that can be concatenated?

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 22, 2021

concatenate_list_elements works for me and I like it better than my suggestion. However, will there ever be a type stored in a list other than strings that can be concatenated?

Here the API is part of the strings namespace and it only works for strings type.

However, it seems that concatenate_list_elements is a good fit (so I will keep it), because in the future we may have other concatenate_list_elements APIs in other namespaces, such as an API for concatenating lists within a list (something that is similar to list unravel here https://nvidia.slack.com/archives/CB9GTQZAQ/p1618926053144000).

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good.

cpp/src/strings/combine/concatenate_list_elements.cu Outdated Show resolved Hide resolved
cpp/src/strings/utilities.cuh Show resolved Hide resolved
cpp/src/strings/combine/concatenate_list_elements.cu Outdated Show resolved Hide resolved
…rent range than the number of output strings
@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 26, 2021

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] concatenate array of strings
5 participants