Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Feature: pairwise edit distance for each string on a given nvstrings object #2803

Merged
merged 19 commits into from
Sep 28, 2019

Conversation

kayush2O6
Copy link
Contributor

This PR implements the edit_distance_matrix() function that computes pairwise distance on a list of strings(i.e. nvstrings object) and returns a distance matrix.
Recently, I was working on a strings related use-case and where I felt the need of this function...

Example:

import nvstrings, nvtext
s = nvstrings.to_device(["honda","hyundai","suzuki"])
n = nvtext.edit_distance_matrix(s)
print(n)
Out:    
[[0, 3, 6],
 [3, 0, 5],
 [6, 5, 0]]

cc @davidwendt

@kayush2O6 kayush2O6 requested review from a team as code owners September 14, 2019 03:25
cpp/custrings/text/edit_distance.cu Outdated Show resolved Hide resolved
cpp/custrings/text/edit_distance.cu Outdated Show resolved Hide resolved
python/nvstrings/nvtext.py Outdated Show resolved Hide resolved
cpp/include/nvstrings/NVText.h Outdated Show resolved Hide resolved
@kkraus14 kkraus14 added 2 - In Progress Currently a work in progress strings strings issues (C++ and Python) labels Sep 14, 2019
@codecov
Copy link

codecov bot commented Sep 15, 2019

Codecov Report

Merging #2803 into branch-0.10 will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.10    #2803   +/-   ##
============================================
  Coverage        86.51%   86.51%           
============================================
  Files               48       48           
  Lines             9013     9013           
============================================
  Hits              7798     7798           
  Misses            1215     1215

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b96073c...ff9eced. Read the comment docs.

@kayush2O6 kayush2O6 changed the title [WIP] Feature: pairwise edit distance for each string on a given nvstrings object [REVIEW] Feature: pairwise edit distance for each string on a given nvstrings object Sep 16, 2019
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job Ayush. I like how you minimized the compute buffer too.

@kayush2O6
Copy link
Contributor Author

Nice job Ayush. I like how you minimized the compute buffer too.

Thanks David!

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithm looks fine, just adjustments to match rest of cuDF style.

cpp/custrings/tests/test_text.cu Outdated Show resolved Hide resolved
cpp/custrings/text/edit_distance.cu Outdated Show resolved Hide resolved
cpp/custrings/text/edit_distance.cu Show resolved Hide resolved
cpp/custrings/text/edit_distance.cu Outdated Show resolved Hide resolved
cpp/custrings/text/edit_distance.cu Outdated Show resolved Hide resolved
cpp/custrings/text/edit_distance.cu Outdated Show resolved Hide resolved
cpp/include/nvstrings/NVText.h Outdated Show resolved Hide resolved
cpp/include/nvstrings/NVText.h Show resolved Hide resolved
cpp/include/nvstrings/NVText.h Show resolved Hide resolved
cpp/include/nvstrings/NVText.h Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor

The strings API is being changed as part of the rewrite for cudf::column support. First PR is #2811.
Changes like this to the existing working code are following the APIs there for consistency.

Introducing broad cudf style changes into deprecating code may be a waste of time in my opinion. Regardless, I agree with most of these suggestions to make the API clearer until deprecation occurs.

@harrism
Copy link
Member

harrism commented Sep 18, 2019

If this code is being deprecated, why have the PR at all?

@davidwendt
Copy link
Contributor

If this code is being deprecated, why have the PR at all?

Because it will not be replaced probably until next release and we need to keep adding features and fixing defects to the existing working functions until the new code is in place.

@harrism
Copy link
Member

harrism commented Sep 18, 2019

I don't think anything I suggested is onerous. If Ayush wants to skip the naming changes that's fine, but all the API, documentation, CUDF_EXPECTS etc. changes are relevant and not tedious.

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, everything else looks good to me.

python/nvstrings/cpp/pytext.cpp Show resolved Hide resolved
python/nvstrings/cpp/pytext.cpp Show resolved Hide resolved
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python looks good

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Sep 28, 2019
@kkraus14 kkraus14 merged commit 7f32395 into rapidsai:branch-0.10 Sep 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants