Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve url_decode performance for long strings #7353

Merged
merged 10 commits into from
Feb 15, 2021

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Feb 9, 2021

Fixes #7348.

This changes the url_decode algorithm from row-level parallelism to character-level parallelism which improves the performance when operating on string columns that have longer average string lengths. A benchmark for url_decode has also been added.

@jlowe jlowe added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 9, 2021
@jlowe jlowe self-assigned this Feb 9, 2021
@jlowe jlowe requested review from a team as code owners February 9, 2021 17:02
@jlowe jlowe requested review from trxcllnt and vuule February 9, 2021 17:02
@github-actions github-actions bot added the CMake CMake build issue label Feb 9, 2021
@jlowe
Copy link
Member Author

jlowe commented Feb 9, 2021

Using the benchmark, here are the before and after performance numbers on a V100. The two numbers varying in the benchmark name are the number of rows and characters per row, respectively.
Before:

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100/10/manual_time            0.053 ms        0.071 ms        12289 bytes_per_second=25.0311M/s
UrlDecode<10>/url_decode_10pct/1000/10/manual_time           0.073 ms        0.091 ms         9156 bytes_per_second=181.77M/s
UrlDecode<10>/url_decode_10pct/10000/10/manual_time          0.077 ms        0.095 ms         8789 bytes_per_second=1.6881G/s
UrlDecode<10>/url_decode_10pct/100000/10/manual_time         0.084 ms        0.102 ms         8133 bytes_per_second=15.5132G/s
UrlDecode<10>/url_decode_10pct/100/100/manual_time           0.164 ms        0.182 ms         4217 bytes_per_second=60.2996M/s
UrlDecode<10>/url_decode_10pct/1000/100/manual_time          0.450 ms        0.468 ms         1554 bytes_per_second=220.233M/s
UrlDecode<10>/url_decode_10pct/10000/100/manual_time         0.451 ms        0.468 ms         1552 bytes_per_second=2.14842G/s
UrlDecode<10>/url_decode_10pct/100000/100/manual_time        0.789 ms        0.806 ms          885 bytes_per_second=12.2702G/s
UrlDecode<10>/url_decode_10pct/100/1000/manual_time           1.48 ms         1.50 ms          471 bytes_per_second=64.5066M/s
UrlDecode<10>/url_decode_10pct/1000/1000/manual_time          4.77 ms         4.79 ms          147 bytes_per_second=200.825M/s
UrlDecode<10>/url_decode_10pct/10000/1000/manual_time         4.91 ms         4.92 ms          143 bytes_per_second=1.90587G/s
UrlDecode<10>/url_decode_10pct/100000/1000/manual_time        33.9 ms         33.9 ms           21 bytes_per_second=2.76001G/s
UrlDecode<10>/url_decode_10pct/100/10000/manual_time          14.6 ms         14.6 ms           48 bytes_per_second=65.2909M/s
UrlDecode<10>/url_decode_10pct/1000/10000/manual_time         49.3 ms         49.3 ms           14 bytes_per_second=193.504M/s
UrlDecode<10>/url_decode_10pct/10000/10000/manual_time        49.7 ms         49.7 ms           14 bytes_per_second=1.87403G/s
UrlDecode<10>/url_decode_10pct/100000/10000/manual_time        371 ms          371 ms            2 bytes_per_second=2.5111G/s

After:

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100/10/manual_time            0.070 ms        0.087 ms         9599 bytes_per_second=19.1569M/s
UrlDecode<10>/url_decode_10pct/1000/10/manual_time           0.086 ms        0.104 ms         7839 bytes_per_second=155.34M/s
UrlDecode<10>/url_decode_10pct/10000/10/manual_time          0.103 ms        0.120 ms         6597 bytes_per_second=1.26665G/s
UrlDecode<10>/url_decode_10pct/100000/10/manual_time         0.208 ms        0.224 ms         3342 bytes_per_second=6.27197G/s
UrlDecode<10>/url_decode_10pct/100/100/manual_time           0.078 ms        0.095 ms         8643 bytes_per_second=127.449M/s
UrlDecode<10>/url_decode_10pct/1000/100/manual_time          0.094 ms        0.112 ms         7130 bytes_per_second=1051.62M/s
UrlDecode<10>/url_decode_10pct/10000/100/manual_time         0.190 ms        0.206 ms         3685 bytes_per_second=5.10589G/s
UrlDecode<10>/url_decode_10pct/100000/100/manual_time         1.27 ms         1.28 ms          553 bytes_per_second=7.65637G/s
UrlDecode<10>/url_decode_10pct/100/1000/manual_time          0.086 ms        0.103 ms         7868 bytes_per_second=1112.93M/s
UrlDecode<10>/url_decode_10pct/1000/1000/manual_time         0.173 ms        0.189 ms         4024 bytes_per_second=5.4189G/s
UrlDecode<10>/url_decode_10pct/10000/1000/manual_time         1.12 ms         1.14 ms          626 bytes_per_second=8.34177G/s
UrlDecode<10>/url_decode_10pct/100000/1000/manual_time        11.8 ms         11.8 ms           59 bytes_per_second=7.95281G/s
UrlDecode<10>/url_decode_10pct/100/10000/manual_time         0.153 ms        0.169 ms         4564 bytes_per_second=6.10873G/s
UrlDecode<10>/url_decode_10pct/1000/10000/manual_time         1.01 ms         1.02 ms          696 bytes_per_second=9.26655G/s
UrlDecode<10>/url_decode_10pct/10000/10000/manual_time        10.7 ms         10.7 ms           65 bytes_per_second=8.72123G/s
UrlDecode<10>/url_decode_10pct/100000/10000/manual_time        121 ms          121 ms            6 bytes_per_second=7.67838G/s

@jlowe jlowe added the strings strings issues (C++ and Python) label Feb 9, 2021
@davidwendt davidwendt self-requested a review February 9, 2021 19:21
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good solution. I don't think it is handling sliced columns correctly though.

cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments. Will take another look once the sliced input tests are in place.

cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Feb 9, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@da3ab29). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.19    #7353   +/-   ##
==============================================
  Coverage               ?   82.22%           
==============================================
  Files                  ?      100           
  Lines                  ?    16969           
  Branches               ?        0           
==============================================
  Hits                   ?    13953           
  Misses                 ?     3016           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da3ab29...01349af. Read the comment docs.

@jlowe
Copy link
Member Author

jlowe commented Feb 9, 2021

Thanks a ton for the quick reviews! I have yet to address the excellent feedback, but I wanted to post a slightly modified version of the algorithm that I found was significantly faster in practice to see if there was any suggestions/feedback on the approach.

Rather than perform the expensive binary search from each character index three times (during count_if, copy_if and for_each_n), this new version performs a much quicker but sloppier version of escape detection, ignoring string boundaries, for the count_if and copy_if. That means we'll potentially compute some escape sequence positions that aren't "real" because they cross a string boundary. That means we need to do an extra pass to compute the "real" escape positions by filtering these against the string boundaries, but that binary search is performed on the number of escape sequences rather than the number of characters which should be far fewer operations in practice.

This bears out in the new performance numbers. I added a 50% chance version of an escape sequence to compare as the number of escape sequences becomes significantly more common.

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100/10/manual_time            0.091 ms        0.108 ms         7344 bytes_per_second=14.7376M/s
UrlDecode<10>/url_decode_10pct/1000/10/manual_time           0.102 ms        0.120 ms         6616 bytes_per_second=131.022M/s
UrlDecode<10>/url_decode_10pct/10000/10/manual_time          0.118 ms        0.136 ms         5770 bytes_per_second=1.10266G/s
UrlDecode<10>/url_decode_10pct/100000/10/manual_time         0.170 ms        0.187 ms         4053 bytes_per_second=7.65321G/s
UrlDecode<10>/url_decode_10pct/100/100/manual_time           0.097 ms        0.114 ms         6914 bytes_per_second=102.522M/s
UrlDecode<10>/url_decode_10pct/1000/100/manual_time          0.112 ms        0.130 ms         6046 bytes_per_second=882.095M/s
UrlDecode<10>/url_decode_10pct/10000/100/manual_time         0.168 ms        0.184 ms         4131 bytes_per_second=5.77379G/s
UrlDecode<10>/url_decode_10pct/100000/100/manual_time        0.772 ms        0.790 ms          903 bytes_per_second=12.5472G/s
UrlDecode<10>/url_decode_10pct/100/1000/manual_time          0.105 ms        0.123 ms         6427 bytes_per_second=907.985M/s
UrlDecode<10>/url_decode_10pct/1000/1000/manual_time         0.163 ms        0.179 ms         4250 bytes_per_second=5.73304G/s
UrlDecode<10>/url_decode_10pct/10000/1000/manual_time        0.762 ms        0.780 ms          918 bytes_per_second=12.273G/s
UrlDecode<10>/url_decode_10pct/100000/1000/manual_time        7.23 ms         7.25 ms           97 bytes_per_second=12.9317G/s
UrlDecode<10>/url_decode_10pct/100/10000/manual_time         0.155 ms        0.171 ms         4452 bytes_per_second=6.00531G/s
UrlDecode<10>/url_decode_10pct/1000/10000/manual_time        0.749 ms        0.766 ms          933 bytes_per_second=12.4459G/s
UrlDecode<10>/url_decode_10pct/10000/10000/manual_time        7.11 ms         7.12 ms           99 bytes_per_second=13.112G/s
UrlDecode<10>/url_decode_10pct/100000/10000/manual_time       76.5 ms         76.5 ms            9 bytes_per_second=12.182G/s
UrlDecode<50>/url_decode_50pct/100/10/manual_time            0.092 ms        0.110 ms         7237 bytes_per_second=14.4914M/s
UrlDecode<50>/url_decode_50pct/1000/10/manual_time           0.104 ms        0.122 ms         6476 bytes_per_second=128.316M/s
UrlDecode<50>/url_decode_50pct/10000/10/manual_time          0.120 ms        0.137 ms         5659 bytes_per_second=1113.28M/s
UrlDecode<50>/url_decode_50pct/100000/10/manual_time         0.180 ms        0.197 ms         3857 bytes_per_second=7.26268G/s
UrlDecode<50>/url_decode_50pct/100/100/manual_time           0.099 ms        0.116 ms         6786 bytes_per_second=100.444M/s
UrlDecode<50>/url_decode_50pct/1000/100/manual_time          0.115 ms        0.133 ms         5940 bytes_per_second=860.877M/s
UrlDecode<50>/url_decode_50pct/10000/100/manual_time         0.175 ms        0.193 ms         3959 bytes_per_second=5.52089G/s
UrlDecode<50>/url_decode_50pct/100000/100/manual_time        0.930 ms        0.947 ms          752 bytes_per_second=10.417G/s
UrlDecode<50>/url_decode_50pct/100/1000/manual_time          0.107 ms        0.125 ms         6320 bytes_per_second=892.783M/s
UrlDecode<50>/url_decode_50pct/1000/1000/manual_time         0.173 ms        0.191 ms         4023 bytes_per_second=5.40132G/s
UrlDecode<50>/url_decode_50pct/10000/1000/manual_time        0.898 ms        0.915 ms          778 bytes_per_second=10.4122G/s
UrlDecode<50>/url_decode_50pct/100000/1000/manual_time        8.50 ms         8.52 ms           83 bytes_per_second=10.9996G/s
UrlDecode<50>/url_decode_50pct/100/10000/manual_time         0.163 ms        0.180 ms         4244 bytes_per_second=5.73172G/s
UrlDecode<50>/url_decode_50pct/1000/10000/manual_time        0.873 ms        0.890 ms          800 bytes_per_second=10.6696G/s
UrlDecode<50>/url_decode_50pct/10000/10000/manual_time        8.21 ms         8.23 ms           85 bytes_per_second=11.3474G/s
UrlDecode<50>/url_decode_50pct/100000/10000/manual_time       86.5 ms         86.5 ms            8 bytes_per_second=10.7726G/s

cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
@jlowe jlowe requested a review from vuule February 10, 2021 20:59
cpp/benchmarks/string/url_decode_benchmark.cpp Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. Just a couple of very minor comments.

cpp/src/strings/convert/convert_urls.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Show resolved Hide resolved
cpp/src/strings/convert/convert_urls.cu Show resolved Hide resolved
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake approval

@harrism
Copy link
Member

harrism commented Feb 15, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 92c4b26 into rapidsai:branch-0.19 Feb 15, 2021
@gaohao95 gaohao95 mentioned this pull request Jul 27, 2021
@jlowe jlowe deleted the url_decode_perf branch September 10, 2021 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants