Fast table-driven parsing for upb (2+GB/s) #310

haberman · 2020-10-05T18:46:21Z

When fasttable is enabled (bazel build --//:fasttable_enabled=true), this PR is a significant perf win, but also a significant code size growth (though the code size growth is exaggerated by the fact that the test proto intentionally uses every kind of field:

name                                 old time/op  new time/op  delta
ArenaOneAlloc                        21.2ns ± 0%  20.6ns ± 1%   -3.04%  (p=0.000 n=10+12)
ArenaInitialBlockOneAlloc            6.03ns ± 0%  6.04ns ± 0%   +0.16%  (p=0.001 n=9+11)
LoadDescriptor_Upb                   55.4µs ± 1%  43.2µs ± 0%  -22.11%  (p=0.000 n=12+12)
LoadAdsDescriptor_Upb                3.09ms ± 0%  2.77ms ± 1%  -10.21%  (p=0.000 n=10+12)
LoadDescriptor_Proto2                 262µs ± 0%   261µs ± 1%   -0.33%  (p=0.001 n=12+12)
LoadAdsDescriptor_Proto2             13.9ms ± 1%  13.9ms ± 0%   +0.35%  (p=0.016 n=12+11)
Parse_Upb_FileDesc_WithArena         11.7µs ± 0%   3.4µs ± 0%  -71.00%  (p=0.000 n=12+12)
Parse_Upb_FileDesc_WithInitialBlock  11.4µs ± 0%   3.1µs ± 0%  -73.13%  (p=0.000 n=12+12)
SerializeDescriptor_Proto2           5.34µs ± 3%  5.36µs ± 3%     ~     (p=0.443 n=12+12)
SerializeDescriptor_Upb              12.6µs ± 0%  11.5µs ± 0%   -8.96%  (p=0.000 n=11+12)

---------------------------------------------------------------------------------------------
Benchmark                                                      Time           CPU Iterations
---------------------------------------------------------------------------------------------
BM_ArenaOneAlloc                                              21 ns         21 ns   32864133
BM_ArenaInitialBlockOneAlloc                                   6 ns          6 ns  116211422
BM_LoadDescriptor_Upb                                      44329 ns      44326 ns      15790   161.533MB/s
BM_LoadAdsDescriptor_Upb                                 2810075 ns    2809879 ns        248   180.978MB/s
BM_LoadDescriptor_Proto2                                  240616 ns     240614 ns       2904    29.758MB/s
BM_LoadAdsDescriptor_Proto2                             12928128 ns   12926435 ns         54   39.3401MB/s
BM_Parse_Upb_FileDesc_WithArena                             3448 ns       3448 ns     202873   2.02787GB/s
BM_Parse_Upb_FileDesc_WithInitialBlock                      3071 ns       3071 ns     227945   2.27677GB/s
BM_Parse_Proto2<FileDesc, NoArena>                         31836 ns      31833 ns      22039   224.928MB/s
BM_Parse_Proto2<FileDesc, WithArena>                       22218 ns      22215 ns      31604   322.313MB/s
BM_Parse_Proto2<FileDesc, WithInitialBlock>                17818 ns      17818 ns      39283   401.859MB/s
BM_Parse_Proto2<FileDescSV, WithInitialBlock>              17704 ns      17703 ns      39454   404.467MB/s
BM_Parse_Proto2<FileDescSV, WithInitialBlock, kAlias>      17822 ns      17822 ns      39578   401.771MB/s
BM_SerializeDescriptor_Proto2                               5188 ns       5187 ns     136123   1.34794GB/s
BM_SerializeDescriptor_Upb                                 12524 ns      12523 ns      55846    571.75MB/s

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +47.2Ki  [NEW] +45.1Ki    upb/decode_fast.c
   +96% +6.59Ki  +275% +6.59Ki    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/descriptor.upb.c
   +41% +3.56Ki  +122% +3.56Ki    bazel-out/k8-opt/bin/external/com_google_protobuf/src/google/protobuf/test_messages_proto2.upb.c
   +23% +1.78Ki   +57% +1.78Ki    bazel-out/k8-opt/bin/external/com_google_protobuf/src/google/protobuf/test_messages_proto3.upb.c
   +17%   +1004   +18%    +968    upb/decode.c
   +63%    +576  +162%    +576    bazel-out/k8-opt/bin/external/com_google_protobuf/conformance/conformance.upb.c
  +119%    +521  +168%    +521    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/struct.upbdefs.c
   +20%    +288   +80%    +288    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/wrappers.upb.c
   +23%    +224   +85%    +224    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/struct.upb.c
  +0.9%    +190  +0.7%    +136    upb/def.c
   +42%     +64  +133%     +64    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/any.upb.c
   +40%     +64  +133%     +64    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/duration.upb.c
   +39%     +64  +133%     +64    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/timestamp.upb.c
   +21%     +32   +80%     +32    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/field_mask.upb.c
  +0.4%     +16  +0.4%     +16    tests/conformance_upb.c
  +1.1%     +14  [ = ]       0    upb/msg.c
  -0.1%      -8  [ = ]       0    upb/table.c
  -0.3%     -14  [ = ]       0    upb/reflection.c
  -0.1%     -20  [ = ]       0    [4 Others]
  -4.7%    -521  -4.9%    -521    upb/json_encode.c
 -89.7% -3.38Ki  [ = ]       0    [Unmapped]
   +36% +58.2Ki   +47% +59.4Ki    TOTAL

When fasttable is disabled (default build or bazel build --//:fasttable_enabled=false), this PR is perf-neutral but does have a slight code size regression:

name                                 old time/op  new time/op  delta
ArenaOneAlloc                        21.3ns ± 0%  20.7ns ± 0%  -2.84%  (p=0.000 n=12+12)
ArenaInitialBlockOneAlloc            6.03ns ± 0%  6.03ns ± 0%  +0.04%  (p=0.025 n=10+12)
LoadDescriptor_Upb                   55.4µs ± 1%  52.9µs ± 0%  -4.39%  (p=0.000 n=11+11)
LoadAdsDescriptor_Upb                3.07ms ± 1%  3.07ms ± 0%    ~     (p=0.525 n=12+11)
LoadDescriptor_Proto2                 264µs ± 0%   260µs ± 0%  -1.41%  (p=0.000 n=10+10)
LoadAdsDescriptor_Proto2             14.0ms ± 0%  13.6ms ± 1%  -2.97%  (p=0.000 n=11+12)
Parse_Upb_FileDesc_WithArena         11.7µs ± 0%  11.9µs ± 0%  +0.99%  (p=0.000 n=11+11)
Parse_Upb_FileDesc_WithInitialBlock  11.4µs ± 0%  11.4µs ± 0%  +0.68%  (p=0.000 n=11+12)
SerializeDescriptor_Proto2           5.40µs ± 6%  5.45µs ± 4%    ~     (p=0.443 n=12+12)
SerializeDescriptor_Upb              12.6µs ± 0%  12.1µs ± 1%  -4.00%  (p=0.000 n=11+12)

    FILE SIZE        VM SIZE    
 --------------  -------------- 
   +10%    +702   +11%    +672    upb/decode.c
     +14%    +560   +15%    +560    decode_msg
    [NEW]    +297  [NEW]    +256    upb_utf8_offsets
    [NEW]     +91  [NEW]     +48    fastdecode_generic
     +37%     +64   +50%     +64    decode_isdonefallback
    [DEL]    -310  [DEL]    -256    decode_verifyutf8.utf8_offset
  +1.1%    +198  +0.9%    +136    upb/def.c
    [NEW]    +128  [NEW]     +88    upb_symtab_free
    +0.8%     +32  +0.8%     +32    _upb_symtab_addfile
    +1.1%     +16  +1.2%     +16    resolve_fielddef
    +8.2%      +8  [ = ]       0    upb_fielddef_defaultint64
    +5.5%      +7  [ = ]       0    upb_fielddef_defaultstr
    +2.7%      +7  [ = ]       0    upb_symtab_new
  +0.5%     +16  +0.6%     +16    tests/conformance_upb.c
    +7.6%     +16  +9.1%     +16    main
  +0.4%     +10  [ = ]       0    upb/upb.c
    +3.3%      +7  [ = ]       0    upb_arena_fuse
     +16%      +3  [ = ]       0    _start
  -0.0%      -3  [ = ]       0    upb/table.c
    +2.4%     +13  [ = ]       0    upb_strtable_insert3
    -0.4%      -3  [ = ]       0    upb_strtable_resize
   -10.6%     -13  [ = ]       0    upb_strtable_iter_key
  -0.1%      -7  [ = ]       0    upb/text_encode.c
    -3.4%      -7  [ = ]       0    upb_text_encode
  -4.4%     -22  [ = ]       0    [section .strtab]
 -21.5%    -830  [ = ]       0    [Unmapped]
  +0.0%     +64  +0.6%    +824    TOTAL

gerben-s

I only reviewed fast decode, which looks good.

gerben-s · 2020-10-07T20:24:26Z

upb/decode_fast.c

+
+again:
+  if (card == CARD_r) {
+    if (UPB_UNLIKELY((uint32_t)data == 0)) {


if you store data = ((elem_avail - 1) << 16) | tag
then this can be if (int64)data < 0 return fastdecode_generic

and below data -= 65536;
checking tag is than trivial byte or word compare.

Now I am using an explicit end pointer. lmk if you think it's better to change.

Yes I think this is good

gerben-s · 2020-10-07T21:00:26Z

upb/decode_fast.c

+  {
+    int64_t len = ptr[tagbytes];
+    if (UPB_UNLIKELY(len < 0)) {
+      return fastdecode_generic(UPB_PARSE_ARGS);


I think having full inlined varint here is necessary. You have a frame anyway as you are recursing

Punt for now since it's not needed for the experiment?

gerben-s · 2020-10-07T21:13:55Z

upb/decode_fast.c

+      } else {
+        arr = *arr_p;
+        field = _upb_array_ptr(arr);
+        elem_avail = arr->size - arr->len;


Is there are a reason why we can't resize if elem_avail is 0

We speculate that elem_avail isn't 0. I now have inline resizing inside the loop when it is 0.

------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_ArenaOneAlloc 21 ns 21 ns 32994231 BM_ArenaInitialBlockOneAlloc 6 ns 6 ns 116318005 BM_ParseDescriptorNoHeap 3028 ns 3028 ns 231138 2.34354GB/s BM_ParseDescriptor 3557 ns 3557 ns 196583 1.99498GB/s BM_ParseDescriptorProto2NoArena 33228 ns 33226 ns 21196 218.688MB/s BM_ParseDescriptorProto2WithArena 22863 ns 22861 ns 30666 317.831MB/s BM_SerializeDescriptorProto2 5444 ns 5444 ns 127368 1.30348GB/s BM_SerializeDescriptor 12509 ns 12508 ns 55816 580.914MB/s $ perf stat bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap 2020-10-08 14:07:06 Running bazel-bin/benchmark Run on (72 X 3700 MHz CPU s) CPU Caches: L1 Data 32K (x36) L1 Instruction 32K (x36) L2 Unified 1024K (x36) L3 Unified 25344K (x2) ---------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_ParseDescriptorNoHeap 3071 ns 3071 ns 227743 2.31094GB/s Performance counter stats for 'bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap': 1,050.22 msec task-clock # 0.978 CPUs utilized 4 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 179 page-faults # 0.170 K/sec 3,875,796,334 cycles # 3.690 GHz 13,282,835,967 instructions # 3.43 insn per cycle 2,887,725,848 branches # 2749.627 M/sec 8,324,912 branch-misses # 0.29% of all branches 1.073924364 seconds time elapsed 1.042806000 seconds user 0.008021000 seconds sys Profile: 23.96% benchmark benchmark [.] upb_prm_1bt_max192b 22.44% benchmark benchmark [.] fastdecode_dispatch 18.96% benchmark benchmark [.] upb_pss_1bt 14.20% benchmark benchmark [.] upb_psv4_1bt 8.33% benchmark benchmark [.] upb_prm_1bt_max64b 6.66% benchmark benchmark [.] upb_prm_1bt_max128b 1.29% benchmark benchmark [.] upb_psm_1bt_max64b 0.77% benchmark benchmark [.] fastdecode_generic 0.55% benchmark [kernel.kallsyms] [k] smp_call_function_single 0.42% benchmark [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.42% benchmark benchmark [.] upb_psm_1bt_max256b 0.31% benchmark benchmark [.] upb_psb1_1bt 0.21% benchmark benchmark [.] upb_plv4_5bv 0.14% benchmark benchmark [.] upb_psb1_2bt 0.12% benchmark benchmark [.] decode_longvarint64 0.08% benchmark [kernel.kallsyms] [k] vsnprintf 0.07% benchmark [kernel.kallsyms] [k] _raw_spin_lock 0.07% benchmark benchmark [.] _upb_msg_new 0.06% benchmark ld-2.31.so [.] check_match

gerben-s · 2020-10-09T18:03:13Z

upb/decode.c

+UPB_NOINLINE
+static const char *decode_msg(upb_decstate *d, const char *ptr, upb_msg *msg,
+                              const upb_msglayout *layout) {
+  if (msg) {


how can this not always be true? I would expect layout

Unknown fields. It's true that layout would be null there too.

This is not very optimized yet. There is a lot of room to optimize it further.

Also fixed a bug with fixed packed in decode_fast.c.

Without a profile, we assume that fields with smaller numbers are hotter.

1. For long tags we were putting table entries in the wrong slot. 2. For repeated strings, when the buffer flipped to no longer alias we were failing to notice and kept aliasing anyway.

… aspect.

These control whether fasttable decoding is on.

The real solution is to have each Kokoro build as part of a separate job that runs in parallel.

…oid timeout.

davidbolvansky · 2021-04-25T17:39:17Z

Related to your article and "jne followed by jmp" issue - LLVM does it intentionally this way:
https://llvm.org/doxygen/BranchFolding_8cpp_source.html

line 1521

haberman added 8 commits September 22, 2020 14:11

WIP.

763a3f6

Avoid passing too many params to fallback.

34b98bc

WIP.

26abaa2

WIP.

383ae52

Give all field parsers a generic table entry.

438ecae

We have a properly structured algorithm, but perf regresses by 20%.

3937874

Cleanup for showing.

fac992d

Revert test changes.

d43ccfa

gerben-s approved these changes Oct 5, 2020

View reviewed changes

haberman added 4 commits October 6, 2020 22:11

Donate/steal from arena to accelerate decoding.

7ec2c52

Merge branch 'decode-arena' into fast-table

e219a2d

Handle non-repeated submessages.

f173642

Table-driven supports repeated sub-messages.

88b1ec7

gerben-s reviewed Oct 7, 2020

View reviewed changes

haberman added 10 commits October 7, 2020 14:39

Handle 2-byte submessage lengths.

405e793

Added benchmarks for proto2.

e46e94e

A bunch more optimization.

8dd7b5a

Optimized memset() with cutoff and fixed group & unknown message bugs.

9e5c5ce

A small optimization: don't increment array length every iteration.

388b6f6

Fixed a bug with tag number 15.

52a0ed3

Hoisted updates to limits and depth out of the loop.

e39ec95

Handle long varints, now 2GB/s!

4c65b25

Replicated dispatch and implemeted array resizing logic. Up to 2.67GB/s.

0dcc564

gerben-s reviewed Oct 9, 2020

View reviewed changes

haberman added 2 commits October 9, 2020 17:00

A few updates to the benchamrk and minor implementation changes.

537b6f4

Fixed C89 compat issues.

ff957b9

haberman added 6 commits October 24, 2020 23:42

Fastdecode support for packed fields.

86d9908

This is not very optimized yet. There is a lot of room to optimize it further.

Allow larger tags into the table if they are unique mod 31.

021db6f

Also fixed a bug with fixed packed in decode_fast.c.

Allocate hasbits and table slots in "hotness" order.

3eba479

Without a profile, we assume that fields with smaller numbers are hotter.

Fixed a few bugs with the fast decoder.

bd9f8f5

1. For long tags we were putting table entries in the wrong slot. 2. For repeated strings, when the buffer flipped to no longer alias we were failing to notice and kept aliasing anyway.

Added comment to decode_fast.h.

46eb824

Merge branch 'fastest-table' into fast-table

8b38e8f

gerben-s approved these changes Oct 26, 2020

View reviewed changes

haberman added 9 commits October 26, 2020 20:52

A few minor fixes and more assertions.

55f3569

A few more fixes, and test fastdecode under Kokoro.

b928696

Added -std=gnu99 for fastdecode and ran Buildifier.

efd576b

Specify C99 explicitly until/unless we stop using bool.

2c8bb6d

Plumbed copts (including the crucial -std=c99) to upb_proto_library()…

a274ad7

… aspect.

Merge branch 'master' into fastest-table

1cd0cb1

Fixed the build after the merge.

e86541a

Merge branch 'fastest-table' into fast-table

dc64613

Added #defines UPB_ENABLE_FASTTABLE and UPB_TRY_ENABLE_FASTTABLE.

e8f9eac

These control whether fasttable decoding is on.

gerben-s approved these changes Oct 29, 2020

View reviewed changes

haberman changed the title ~~Table-driven parsing for upb~~ Fast table-driven parsing for upb (2+GB/s) Oct 29, 2020

haberman added 5 commits October 29, 2020 09:55

Added UTF-8 validation for proto3 string fields.

154f2c2

Merge branch 'master' into fast-table

a799361

Updated Kokoro build script.

9d87055

Some formatting fixes.

1eb7bd3

Removed excess/redundant tests from Kokoro script.

baab25b

gerben-s approved these changes Nov 5, 2020

View reviewed changes

haberman added 3 commits November 5, 2020 12:36

Tried to slim down the tests a bit more.

73fcfe9

Exclude Clang tests from MacOS to avoid Kokoro timeouts.

a83d55e

The real solution is to have each Kokoro build as part of a separate job that runs in parallel.

Fixes for google3 build, and exclude even more tests from macOS to av…

a01f3e2

…oid timeout.

haberman merged commit c9d2e58 into protocolbuffers:master Nov 5, 2020

haberman deleted the fast-table branch December 6, 2020 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast table-driven parsing for upb (2+GB/s) #310

Fast table-driven parsing for upb (2+GB/s) #310

haberman commented Oct 5, 2020 •

edited

Loading

gerben-s left a comment

gerben-s Oct 7, 2020

haberman Oct 11, 2020

gerben-s Oct 12, 2020

gerben-s Oct 7, 2020

haberman Oct 11, 2020

gerben-s Oct 7, 2020

haberman Oct 11, 2020

gerben-s Oct 9, 2020

haberman Oct 11, 2020

davidbolvansky commented Apr 25, 2021

Fast table-driven parsing for upb (2+GB/s) #310

Fast table-driven parsing for upb (2+GB/s) #310

Conversation

haberman commented Oct 5, 2020 • edited Loading

gerben-s left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbolvansky commented Apr 25, 2021

haberman commented Oct 5, 2020 •

edited

Loading