-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Create gbenchmarks for nvtext APIs #5696
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
strings
strings issues (C++ and Python)
tests
Unit testing for project
Comments
davidwendt
added
feature request
New feature or request
Needs Triage
Need team to review and classify
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
labels
Jul 15, 2020
davidwendt
changed the title
Create gbenchmarks for nvtext APIs
[FEA] Create gbenchmarks for nvtext APIs
Jul 15, 2020
harrism
added
Performance
Performance related issue
tech debt
tests
Unit testing for project
and removed
Needs Triage
Need team to review and classify
labels
Jul 19, 2020
This issue has been labeled |
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 23, 2021
Reference #5696 Creates a gbenchmark for `nvtext::normalize_spaces()` and `nvtext::normalize_characters()` functions. The benchmarks measures various string lengths and number of rows. I found that `normalize_spaces()` is used in haproxy parsing along with `extract` so having this benchmark helps measure possible performance improvement solutions there. The `normalize_characters` is the same code used as part of the `subword_tokenizer`. Since each requires different memory footprint my initial goal for them to share a common benchmark structure did not work out. So the 2 tests are separate gbenchmark test files. I refactored some of this code to use the more efficient `make_strings_children` and this improved the performance of `normalize_spaces` by 2-3x. The current subword-tokenizer gbenchmark is also incorporated into the the TEXT_BENCHMARK gbenchmark. Authors: - David (@davidwendt) Approvers: - Vukasin Milovanovic (@vuule) - Conor Hoekstra (@codereport) - Mark Harris (@harrism) URL: #7668
This was referenced Mar 23, 2021
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 24, 2021
Reference #5696 Creates gbenchmarks for `nvtext::tokenize()`, `nvtext::count_tokens()` and `nvtext::ngrams_tokenize()` functions. The benchmarks measures various string lengths and number of rows. These functions use the `make_strings_column` factory optimized in #7576 Authors: - David (@davidwendt) Approvers: - Conor Hoekstra (@codereport) - Nghia Truong (@ttnghia) - Mark Harris (@harrism) URL: #7684
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 26, 2021
Reference #5696 Creates gbenchmarks for `nvtext::replace_tokens()` function. The benchmarks measures various string lengths and number of rows with the default whitespace delimiter and 4 hardcoded tokens. This API already uses the `make_strings_children` utility. Authors: - David (@davidwendt) Approvers: - Karthikeyan (@karthikeyann) - Nghia Truong (@ttnghia) - @nvdbaranec - Keith Kraus (@kkraus14) URL: #7708
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 29, 2021
Reference #5696 Creates a gbenchmark for `nvtext::generate_ngrams()` and `nvtext::generate_character_ngrams()` functions. The benchmarks measures various string lengths and number of rows. The `nvtext::generate_ngrams()` was refactored to use the more efficient `make_strings_children` which improved its performance by about 50%. Authors: - David (@davidwendt) Approvers: - Nghia Truong (@ttnghia) - Mark Harris (@harrism) URL: #7693
These are done now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
strings
strings issues (C++ and Python)
tests
Unit testing for project
Currently there is only one benchmark for the nvtext APIs.
Propose creating the following gbenchmarks:
This will help measure performance impact of code changes.
The text was updated successfully, but these errors were encountered: