Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast table-driven parsing for upb (2+GB/s) #310

Merged
merged 118 commits into from
Nov 5, 2020

Conversation

haberman
Copy link
Member

@haberman haberman commented Oct 5, 2020

When fasttable is enabled (bazel build --//:fasttable_enabled=true), this PR is a significant perf win, but also a significant code size growth (though the code size growth is exaggerated by the fact that the test proto intentionally uses every kind of field:

name                                 old time/op  new time/op  delta
ArenaOneAlloc                        21.2ns ± 0%  20.6ns ± 1%   -3.04%  (p=0.000 n=10+12)
ArenaInitialBlockOneAlloc            6.03ns ± 0%  6.04ns ± 0%   +0.16%  (p=0.001 n=9+11)
LoadDescriptor_Upb                   55.4µs ± 1%  43.2µs ± 0%  -22.11%  (p=0.000 n=12+12)
LoadAdsDescriptor_Upb                3.09ms ± 0%  2.77ms ± 1%  -10.21%  (p=0.000 n=10+12)
LoadDescriptor_Proto2                 262µs ± 0%   261µs ± 1%   -0.33%  (p=0.001 n=12+12)
LoadAdsDescriptor_Proto2             13.9ms ± 1%  13.9ms ± 0%   +0.35%  (p=0.016 n=12+11)
Parse_Upb_FileDesc_WithArena         11.7µs ± 0%   3.4µs ± 0%  -71.00%  (p=0.000 n=12+12)
Parse_Upb_FileDesc_WithInitialBlock  11.4µs ± 0%   3.1µs ± 0%  -73.13%  (p=0.000 n=12+12)
SerializeDescriptor_Proto2           5.34µs ± 3%  5.36µs ± 3%     ~     (p=0.443 n=12+12)
SerializeDescriptor_Upb              12.6µs ± 0%  11.5µs ± 0%   -8.96%  (p=0.000 n=11+12)

---------------------------------------------------------------------------------------------
Benchmark                                                      Time           CPU Iterations
---------------------------------------------------------------------------------------------
BM_ArenaOneAlloc                                              21 ns         21 ns   32864133
BM_ArenaInitialBlockOneAlloc                                   6 ns          6 ns  116211422
BM_LoadDescriptor_Upb                                      44329 ns      44326 ns      15790   161.533MB/s
BM_LoadAdsDescriptor_Upb                                 2810075 ns    2809879 ns        248   180.978MB/s
BM_LoadDescriptor_Proto2                                  240616 ns     240614 ns       2904    29.758MB/s
BM_LoadAdsDescriptor_Proto2                             12928128 ns   12926435 ns         54   39.3401MB/s
BM_Parse_Upb_FileDesc_WithArena                             3448 ns       3448 ns     202873   2.02787GB/s
BM_Parse_Upb_FileDesc_WithInitialBlock                      3071 ns       3071 ns     227945   2.27677GB/s
BM_Parse_Proto2<FileDesc, NoArena>                         31836 ns      31833 ns      22039   224.928MB/s
BM_Parse_Proto2<FileDesc, WithArena>                       22218 ns      22215 ns      31604   322.313MB/s
BM_Parse_Proto2<FileDesc, WithInitialBlock>                17818 ns      17818 ns      39283   401.859MB/s
BM_Parse_Proto2<FileDescSV, WithInitialBlock>              17704 ns      17703 ns      39454   404.467MB/s
BM_Parse_Proto2<FileDescSV, WithInitialBlock, kAlias>      17822 ns      17822 ns      39578   401.771MB/s
BM_SerializeDescriptor_Proto2                               5188 ns       5187 ns     136123   1.34794GB/s
BM_SerializeDescriptor_Upb                                 12524 ns      12523 ns      55846    571.75MB/s

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +47.2Ki  [NEW] +45.1Ki    upb/decode_fast.c
   +96% +6.59Ki  +275% +6.59Ki    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/descriptor.upb.c
   +41% +3.56Ki  +122% +3.56Ki    bazel-out/k8-opt/bin/external/com_google_protobuf/src/google/protobuf/test_messages_proto2.upb.c
   +23% +1.78Ki   +57% +1.78Ki    bazel-out/k8-opt/bin/external/com_google_protobuf/src/google/protobuf/test_messages_proto3.upb.c
   +17%   +1004   +18%    +968    upb/decode.c
   +63%    +576  +162%    +576    bazel-out/k8-opt/bin/external/com_google_protobuf/conformance/conformance.upb.c
  +119%    +521  +168%    +521    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/struct.upbdefs.c
   +20%    +288   +80%    +288    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/wrappers.upb.c
   +23%    +224   +85%    +224    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/struct.upb.c
  +0.9%    +190  +0.7%    +136    upb/def.c
   +42%     +64  +133%     +64    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/any.upb.c
   +40%     +64  +133%     +64    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/duration.upb.c
   +39%     +64  +133%     +64    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/timestamp.upb.c
   +21%     +32   +80%     +32    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/field_mask.upb.c
  +0.4%     +16  +0.4%     +16    tests/conformance_upb.c
  +1.1%     +14  [ = ]       0    upb/msg.c
  -0.1%      -8  [ = ]       0    upb/table.c
  -0.3%     -14  [ = ]       0    upb/reflection.c
  -0.1%     -20  [ = ]       0    [4 Others]
  -4.7%    -521  -4.9%    -521    upb/json_encode.c
 -89.7% -3.38Ki  [ = ]       0    [Unmapped]
   +36% +58.2Ki   +47% +59.4Ki    TOTAL                                                            

When fasttable is disabled (default build or bazel build --//:fasttable_enabled=false), this PR is perf-neutral but does have a slight code size regression:

name                                 old time/op  new time/op  delta
ArenaOneAlloc                        21.3ns ± 0%  20.7ns ± 0%  -2.84%  (p=0.000 n=12+12)
ArenaInitialBlockOneAlloc            6.03ns ± 0%  6.03ns ± 0%  +0.04%  (p=0.025 n=10+12)
LoadDescriptor_Upb                   55.4µs ± 1%  52.9µs ± 0%  -4.39%  (p=0.000 n=11+11)
LoadAdsDescriptor_Upb                3.07ms ± 1%  3.07ms ± 0%    ~     (p=0.525 n=12+11)
LoadDescriptor_Proto2                 264µs ± 0%   260µs ± 0%  -1.41%  (p=0.000 n=10+10)
LoadAdsDescriptor_Proto2             14.0ms ± 0%  13.6ms ± 1%  -2.97%  (p=0.000 n=11+12)
Parse_Upb_FileDesc_WithArena         11.7µs ± 0%  11.9µs ± 0%  +0.99%  (p=0.000 n=11+11)
Parse_Upb_FileDesc_WithInitialBlock  11.4µs ± 0%  11.4µs ± 0%  +0.68%  (p=0.000 n=11+12)
SerializeDescriptor_Proto2           5.40µs ± 6%  5.45µs ± 4%    ~     (p=0.443 n=12+12)
SerializeDescriptor_Upb              12.6µs ± 0%  12.1µs ± 1%  -4.00%  (p=0.000 n=11+12)

    FILE SIZE        VM SIZE    
 --------------  -------------- 
   +10%    +702   +11%    +672    upb/decode.c
     +14%    +560   +15%    +560    decode_msg
    [NEW]    +297  [NEW]    +256    upb_utf8_offsets
    [NEW]     +91  [NEW]     +48    fastdecode_generic
     +37%     +64   +50%     +64    decode_isdonefallback
    [DEL]    -310  [DEL]    -256    decode_verifyutf8.utf8_offset
  +1.1%    +198  +0.9%    +136    upb/def.c
    [NEW]    +128  [NEW]     +88    upb_symtab_free
    +0.8%     +32  +0.8%     +32    _upb_symtab_addfile
    +1.1%     +16  +1.2%     +16    resolve_fielddef
    +8.2%      +8  [ = ]       0    upb_fielddef_defaultint64
    +5.5%      +7  [ = ]       0    upb_fielddef_defaultstr
    +2.7%      +7  [ = ]       0    upb_symtab_new
  +0.5%     +16  +0.6%     +16    tests/conformance_upb.c
    +7.6%     +16  +9.1%     +16    main
  +0.4%     +10  [ = ]       0    upb/upb.c
    +3.3%      +7  [ = ]       0    upb_arena_fuse
     +16%      +3  [ = ]       0    _start
  -0.0%      -3  [ = ]       0    upb/table.c
    +2.4%     +13  [ = ]       0    upb_strtable_insert3
    -0.4%      -3  [ = ]       0    upb_strtable_resize
   -10.6%     -13  [ = ]       0    upb_strtable_iter_key
  -0.1%      -7  [ = ]       0    upb/text_encode.c
    -3.4%      -7  [ = ]       0    upb_text_encode
  -4.4%     -22  [ = ]       0    [section .strtab]
 -21.5%    -830  [ = ]       0    [Unmapped]
  +0.0%     +64  +0.6%    +824    TOTAL

Copy link
Contributor

@gerben-s gerben-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed fast decode, which looks good.


again:
if (card == CARD_r) {
if (UPB_UNLIKELY((uint32_t)data == 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you store data = ((elem_avail - 1) << 16) | tag
then this can be if (int64)data < 0 return fastdecode_generic

and below data -= 65536;
checking tag is than trivial byte or word compare.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I am using an explicit end pointer. lmk if you think it's better to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this is good

{
int64_t len = ptr[tagbytes];
if (UPB_UNLIKELY(len < 0)) {
return fastdecode_generic(UPB_PARSE_ARGS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having full inlined varint here is necessary. You have a frame anyway as you are recursing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punt for now since it's not needed for the experiment?

} else {
arr = *arr_p;
field = _upb_array_ptr(arr);
elem_avail = arr->size - arr->len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are a reason why we can't resize if elem_avail is 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We speculate that elem_avail isn't 0. I now have inline resizing inside the loop when it is 0.

-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_ArenaOneAlloc                          21 ns         21 ns   32994231
BM_ArenaInitialBlockOneAlloc               6 ns          6 ns  116318005
BM_ParseDescriptorNoHeap                3028 ns       3028 ns     231138   2.34354GB/s
BM_ParseDescriptor                      3557 ns       3557 ns     196583   1.99498GB/s
BM_ParseDescriptorProto2NoArena        33228 ns      33226 ns      21196   218.688MB/s
BM_ParseDescriptorProto2WithArena      22863 ns      22861 ns      30666   317.831MB/s
BM_SerializeDescriptorProto2            5444 ns       5444 ns     127368   1.30348GB/s
BM_SerializeDescriptor                 12509 ns      12508 ns      55816   580.914MB/s

$ perf stat bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap
2020-10-08 14:07:06
Running bazel-bin/benchmark
Run on (72 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x36)
  L1 Instruction 32K (x36)
  L2 Unified 1024K (x36)
  L3 Unified 25344K (x2)
----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
BM_ParseDescriptorNoHeap       3071 ns       3071 ns     227743   2.31094GB/s

 Performance counter stats for 'bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap':

          1,050.22 msec task-clock                #    0.978 CPUs utilized
                 4      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               179      page-faults               #    0.170 K/sec
     3,875,796,334      cycles                    #    3.690 GHz
    13,282,835,967      instructions              #    3.43  insn per cycle
     2,887,725,848      branches                  # 2749.627 M/sec
         8,324,912      branch-misses             #    0.29% of all branches

       1.073924364 seconds time elapsed

       1.042806000 seconds user
       0.008021000 seconds sys

Profile:
  23.96%  benchmark  benchmark          [.] upb_prm_1bt_max192b
  22.44%  benchmark  benchmark          [.] fastdecode_dispatch
  18.96%  benchmark  benchmark          [.] upb_pss_1bt
  14.20%  benchmark  benchmark          [.] upb_psv4_1bt
   8.33%  benchmark  benchmark          [.] upb_prm_1bt_max64b
   6.66%  benchmark  benchmark          [.] upb_prm_1bt_max128b
   1.29%  benchmark  benchmark          [.] upb_psm_1bt_max64b
   0.77%  benchmark  benchmark          [.] fastdecode_generic
   0.55%  benchmark  [kernel.kallsyms]  [k] smp_call_function_single
   0.42%  benchmark  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   0.42%  benchmark  benchmark          [.] upb_psm_1bt_max256b
   0.31%  benchmark  benchmark          [.] upb_psb1_1bt
   0.21%  benchmark  benchmark          [.] upb_plv4_5bv
   0.14%  benchmark  benchmark          [.] upb_psb1_2bt
   0.12%  benchmark  benchmark          [.] decode_longvarint64
   0.08%  benchmark  [kernel.kallsyms]  [k] vsnprintf
   0.07%  benchmark  [kernel.kallsyms]  [k] _raw_spin_lock
   0.07%  benchmark  benchmark          [.] _upb_msg_new
   0.06%  benchmark  ld-2.31.so         [.] check_match
upb/decode.c Outdated
UPB_NOINLINE
static const char *decode_msg(upb_decstate *d, const char *ptr, upb_msg *msg,
const upb_msglayout *layout) {
if (msg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can this not always be true? I would expect layout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown fields. It's true that layout would be null there too.

This is not very optimized yet. There is a lot of room to
optimize it further.
Also fixed a bug with fixed packed in decode_fast.c.
Without a profile, we assume that fields with smaller numbers
are hotter.
1. For long tags we were putting table entries in the wrong slot.
2. For repeated strings, when the buffer flipped to no longer alias we
   were failing to notice and kept aliasing anyway.
@haberman haberman changed the title Table-driven parsing for upb Fast table-driven parsing for upb (2+GB/s) Oct 29, 2020
@haberman haberman merged commit c9d2e58 into protocolbuffers:master Nov 5, 2020
@haberman haberman deleted the fast-table branch December 6, 2020 20:26
@davidbolvansky
Copy link

Related to your article and "jne followed by jmp" issue - LLVM does it intentionally this way:
https://llvm.org/doxygen/BranchFolding_8cpp_source.html

line 1521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants