Optimize cudf::make_strings_column for long strings #7576

davidwendt · 2021-03-11T20:59:36Z

Reference #7571
This improves the performance of the cudf::make_strings_column for long strings. It uses a similar approach from cudf::strings::detail::gather and also use thresholding as in the optimized cudf::strings::replace.
This may not be the right solution for overall optimizing #7571 but may be helpful in other places where long strings are used for created a strings column in libcudf.
This PR also includes a gbenchmark to help measure the performance results of this factory function. The results of the benchmark are that longer strings (~ >64 bytes on average) showed about a 10x improvement. I can post benchmark results here if needed. The character-parallel algorithm was slower for shorter strings so the existing algorithm is used based on the a threshold calculation.
I also added an additional gtest with a mixture of nulls and empty strings to make sure the new algorithm handles these correctly.

codecov · 2021-03-12T00:18:06Z

Codecov Report

Merging #7576 (6163360) into branch-0.19 (7871e7a) will increase coverage by 0.13%.
The diff coverage is 93.78%.

@@               Coverage Diff               @@
##           branch-0.19    #7576      +/-   ##
===============================================
+ Coverage        81.86%   82.00%   +0.13%     
===============================================
  Files              101      101              
  Lines            16884    16991     +107     
===============================================
+ Hits             13822    13933     +111     
+ Misses            3062     3058       -4

Impacted Files	Coverage Δ
python/cudf/cudf/core/index.py	`92.86% <ø> (ø)`
python/cudf/cudf/core/column/numerical.py	`94.83% <87.50%> (-0.20%)`	⬇️
python/cudf/cudf/core/frame.py	`89.00% <89.47%> (-0.02%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.77% <90.00%> (+0.01%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`92.75% <90.32%> (-2.12%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.45% <95.65%> (-0.01%)`	⬇️
python/cudf/cudf/core/series.py	`91.25% <95.83%> (+0.47%)`	⬆️
python/cudf/cudf/core/column/categorical.py	`91.62% <100.00%> (+0.23%)`	⬆️
python/cudf/cudf/core/column/datetime.py	`89.09% <100.00%> (ø)`
python/cudf/cudf/core/column/string.py	`86.58% <100.00%> (+0.08%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b766c5...6163360. Read the comment docs.

cpp/benchmarks/string/factory_benchmark.cu

ttnghia · 2021-03-13T00:22:40Z

cpp/benchmarks/string/factory_benchmark.cu

+#include "string_bench_args.hpp"
+
+namespace {
+using string_pair = thrust::pair<char const*, cudf::size_type>;


"string_pair" should mean a pair of strings, so should we use some other name?

Most usages of 'pair' indicate two different types.

Well, I don't agree with that statement... but I also don't have a better name suggestion.

The pair contains data pointer and data size, similar to a span. How about string_span?

cpp/benchmarks/string/factory_benchmark.cu

…l::gather

jlowe

Thanks for working on this, @davidwendt! I have a small comment suggestion, but otherwise this looks good to me. I ran this on the dataset from #7545 and make_strings_column is now 28.5 msec instead of 371.2 msec. Not too shabby! The cost is still a lot larger than what is spent decompressing and decoding the strings from Parquet for this dataset, but it's a huge improvement from where it was.

cpp/include/cudf/strings/detail/strings_column_factories.cuh

davidwendt · 2021-03-16T15:49:33Z

... The cost is still a lot larger than what is spent decompressing and decoding the strings from Parquet for this dataset, but it's a huge improvement from where it was.

We discussed that maybe a better solution is for the Parquet reader to just build the chars and offsets directly instead of using this factory function. But this improvement will also be good for other places that use the factory.

jrhemstad · 2021-03-16T16:05:55Z

We discussed that maybe a better solution is for the Parquet reader to just build the chars and offsets directly instead of using this factory function. But this improvement will also be good for other places that use the factory.

I discussed this with @devavret and there really isn't anything better they can do in the Parquet reader other then reinventing the implementation of this factory. The string data in the Parquet layout is stored as {string, size, string, size, ...}. So there's no better way to do it than coalescing a bunch of disparate string_views.

cpp/include/cudf/detail/gather.cuh

kkraus14 · 2021-03-18T02:54:28Z

@gpucibot merge

@davidwendt

Reference #5696 Creates gbenchmarks for `nvtext::tokenize()`, `nvtext::count_tokens()` and `nvtext::ngrams_tokenize()` functions. The benchmarks measures various string lengths and number of rows. These functions use the `make_strings_column` factory optimized in #7576 Authors: - David (@davidwendt) Approvers: - Conor Hoekstra (@codereport) - Nghia Truong (@ttnghia) - Mark Harris (@harrism) URL: #7684

davidwendt added 2 commits March 11, 2021 15:38

add benchmark for make_strings_column(pair)

f6b8e27

Optimize cudf::make_strings_column for long strings

1dde5ce

davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 11, 2021

davidwendt self-assigned this Mar 11, 2021

davidwendt requested review from a team as code owners March 11, 2021 20:59

davidwendt requested review from trxcllnt and jrhemstad March 11, 2021 20:59

github-actions bot added the CMake CMake build issue label Mar 11, 2021

davidwendt added the Performance Performance related issue label Mar 11, 2021

Merge branch 'branch-0.19' into make-strings-column

87cca87

Merge branch 'branch-0.19' into make-strings-column

3b3ba28

devavret mentioned this pull request Mar 12, 2021

[QST] Can we improve performance of Parquet file scans with large string columns? #7008

Closed

ttnghia requested changes Mar 13, 2021

View reviewed changes

davidwendt added 7 commits March 12, 2021 20:00

move include of local file

0507559

cudf/types.hpp include needed for cudf::size_type

226b347

remove unneeded is_signed_iterator

b9c3d5a

create common gather_chars for make_strings_column and strings::detai…

97a63bc

…l::gather

reinstate commented out threshold

b8dddce

Merge branch 'branch-0.19' into make-strings-column

6595488

Merge branch 'branch-0.19' into make-strings-column

b7f3901

davidwendt requested a review from jlowe March 16, 2021 13:27

Merge branch 'branch-0.19' into make-strings-column

169a471

jlowe approved these changes Mar 16, 2021

View reviewed changes

cpp/include/cudf/strings/detail/strings_column_factories.cuh Outdated Show resolved Hide resolved

change comment wording

3ea82a4

jrhemstad reviewed Mar 17, 2021

View reviewed changes

cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved

Merge branch 'branch-0.19' into make-strings-column

019fe2d

ttnghia approved these changes Mar 17, 2021

View reviewed changes

resurrect is_unsigned_iterator

6163360

davidwendt requested a review from jrhemstad March 17, 2021 21:27

jrhemstad approved these changes Mar 17, 2021

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 17, 2021

rapids-bot bot merged commit 9aa33ef into rapidsai:branch-0.19 Mar 18, 2021

davidwendt deleted the make-strings-column branch March 18, 2021 08:35

devavret mentioned this pull request Mar 19, 2021

[BUG] Slow kernel_agent (make_string_columns) calls when parsing a Parquet File #7124

Closed

This was referenced Mar 23, 2021

Add gbenchmark for nvtext tokenize functions #7684

Merged

[FEA] Improve performance of loading long strings from Parquet #7545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cudf::make_strings_column for long strings #7576

Optimize cudf::make_strings_column for long strings #7576

davidwendt commented Mar 11, 2021 •

edited

Loading

codecov bot commented Mar 12, 2021 •

edited

Loading

ttnghia Mar 13, 2021

davidwendt Mar 13, 2021

harrism Mar 17, 2021

ttnghia Mar 18, 2021

jlowe left a comment

davidwendt commented Mar 16, 2021

jrhemstad commented Mar 16, 2021

kkraus14 commented Mar 18, 2021

Optimize cudf::make_strings_column for long strings #7576

Optimize cudf::make_strings_column for long strings #7576

Conversation

davidwendt commented Mar 11, 2021 • edited Loading

codecov bot commented Mar 12, 2021 • edited Loading

Codecov Report

ttnghia Mar 13, 2021

Choose a reason for hiding this comment

davidwendt Mar 13, 2021

Choose a reason for hiding this comment

harrism Mar 17, 2021

Choose a reason for hiding this comment

ttnghia Mar 18, 2021

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

davidwendt commented Mar 16, 2021

jrhemstad commented Mar 16, 2021

kkraus14 commented Mar 18, 2021

davidwendt commented Mar 11, 2021 •

edited

Loading

codecov bot commented Mar 12, 2021 •

edited

Loading