-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cudf::make_strings_column for long strings #7576
Optimize cudf::make_strings_column for long strings #7576
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7576 +/- ##
===============================================
+ Coverage 81.86% 82.00% +0.13%
===============================================
Files 101 101
Lines 16884 16991 +107
===============================================
+ Hits 13822 13933 +111
+ Misses 3062 3058 -4
Continue to review full report at Codecov.
|
#include "string_bench_args.hpp" | ||
|
||
namespace { | ||
using string_pair = thrust::pair<char const*, cudf::size_type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"string_pair" should mean a pair of strings, so should we use some other name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most usages of 'pair' indicate two different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't agree with that statement... but I also don't have a better name suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pair contains data pointer and data size, similar to a span. How about string_span
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @davidwendt! I have a small comment suggestion, but otherwise this looks good to me. I ran this on the dataset from #7545 and make_strings_column
is now 28.5 msec instead of 371.2 msec. Not too shabby! The cost is still a lot larger than what is spent decompressing and decoding the strings from Parquet for this dataset, but it's a huge improvement from where it was.
We discussed that maybe a better solution is for the Parquet reader to just build the chars and offsets directly instead of using this factory function. But this improvement will also be good for other places that use the factory. |
I discussed this with @devavret and there really isn't anything better they can do in the Parquet reader other then reinventing the implementation of this factory. The string data in the Parquet layout is stored as |
@gpucibot merge |
Reference #5696 Creates gbenchmarks for `nvtext::tokenize()`, `nvtext::count_tokens()` and `nvtext::ngrams_tokenize()` functions. The benchmarks measures various string lengths and number of rows. These functions use the `make_strings_column` factory optimized in #7576 Authors: - David (@davidwendt) Approvers: - Conor Hoekstra (@codereport) - Nghia Truong (@ttnghia) - Mark Harris (@harrism) URL: #7684
Reference #7571
This improves the performance of the
cudf::make_strings_column
for long strings. It uses a similar approach fromcudf::strings::detail::gather
and also use thresholding as in the optimizedcudf::strings::replace
.This may not be the right solution for overall optimizing #7571 but may be helpful in other places where long strings are used for created a strings column in libcudf.
This PR also includes a gbenchmark to help measure the performance results of this factory function. The results of the benchmark are that longer strings (~ >64 bytes on average) showed about a 10x improvement. I can post benchmark results here if needed. The character-parallel algorithm was slower for shorter strings so the existing algorithm is used based on the a threshold calculation.
I also added an additional gtest with a mixture of nulls and empty strings to make sure the new algorithm handles these correctly.