Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Create gbenchmarks for nvtext APIs #5696

Closed
5 tasks done
davidwendt opened this issue Jul 15, 2020 · 2 comments
Closed
5 tasks done

[FEA] Create gbenchmarks for nvtext APIs #5696

davidwendt opened this issue Jul 15, 2020 · 2 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) tests Unit testing for project

Comments

@davidwendt
Copy link
Contributor

davidwendt commented Jul 15, 2020

Currently there is only one benchmark for the nvtext APIs.

Propose creating the following gbenchmarks:

This will help measure performance impact of code changes.

@davidwendt davidwendt added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jul 15, 2020
@davidwendt davidwendt changed the title Create gbenchmarks for nvtext APIs [FEA] Create gbenchmarks for nvtext APIs Jul 15, 2020
@harrism harrism added Performance Performance related issue tech debt tests Unit testing for project and removed Needs Triage Need team to review and classify labels Jul 19, 2020
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@davidwendt davidwendt self-assigned this Mar 22, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 23, 2021
Reference #5696
Creates a gbenchmark for `nvtext::normalize_spaces()` and  `nvtext::normalize_characters()` functions.
The benchmarks measures various string lengths and number of rows.
I found that `normalize_spaces()` is used in haproxy parsing along with `extract` so having this benchmark helps measure possible performance improvement solutions there.
The `normalize_characters` is the same code used as part of the `subword_tokenizer`. 

Since each requires different memory footprint my initial goal for them to share a common benchmark structure did not work out. So the 2 tests are separate gbenchmark test files.

I refactored some of this code to use the more efficient `make_strings_children` and this improved the performance of `normalize_spaces` by 2-3x.

The current subword-tokenizer gbenchmark is also incorporated into the the TEXT_BENCHMARK gbenchmark.

Authors:
  - David (@davidwendt)

Approvers:
  - Vukasin Milovanovic (@vuule)
  - Conor Hoekstra (@codereport)
  - Mark Harris (@harrism)

URL: #7668
rapids-bot bot pushed a commit that referenced this issue Mar 24, 2021
Reference #5696
Creates gbenchmarks for `nvtext::tokenize()`, `nvtext::count_tokens()` and `nvtext::ngrams_tokenize()` functions.
The benchmarks measures various string lengths and number of rows.

These functions use the `make_strings_column` factory optimized in #7576

Authors:
  - David (@davidwendt)

Approvers:
  - Conor Hoekstra (@codereport)
  - Nghia Truong (@ttnghia)
  - Mark Harris (@harrism)

URL: #7684
rapids-bot bot pushed a commit that referenced this issue Mar 26, 2021
Reference #5696
Creates gbenchmarks for `nvtext::replace_tokens()` function.
The benchmarks measures various string lengths and number of rows with the default whitespace delimiter and 4 hardcoded tokens.

This API already uses the `make_strings_children` utility.

Authors:
  - David (@davidwendt)

Approvers:
  - Karthikeyan (@karthikeyann)
  - Nghia Truong (@ttnghia)
  - @nvdbaranec
  - Keith Kraus (@kkraus14)

URL: #7708
rapids-bot bot pushed a commit that referenced this issue Mar 29, 2021
Reference #5696
Creates a gbenchmark for `nvtext::generate_ngrams()` and `nvtext::generate_character_ngrams()` functions.
The benchmarks measures various string lengths and number of rows.
The `nvtext::generate_ngrams()` was refactored to use the more efficient `make_strings_children` which improved its performance by about 50%.

Authors:
  - David (@davidwendt)

Approvers:
  - Nghia Truong (@ttnghia)
  - Mark Harris (@harrism)

URL: #7693
@davidwendt
Copy link
Contributor Author

These are done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) tests Unit testing for project
Projects
None yet
Development

No branches or pull requests

2 participants