-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast table-driven parsing for upb (2+GB/s) #310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed fast decode, which looks good.
upb/decode_fast.c
Outdated
|
||
again: | ||
if (card == CARD_r) { | ||
if (UPB_UNLIKELY((uint32_t)data == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you store data = ((elem_avail - 1) << 16) | tag
then this can be if (int64)data < 0 return fastdecode_generic
and below data -= 65536;
checking tag is than trivial byte or word compare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I am using an explicit end pointer. lmk if you think it's better to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this is good
upb/decode_fast.c
Outdated
{ | ||
int64_t len = ptr[tagbytes]; | ||
if (UPB_UNLIKELY(len < 0)) { | ||
return fastdecode_generic(UPB_PARSE_ARGS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having full inlined varint here is necessary. You have a frame anyway as you are recursing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Punt for now since it's not needed for the experiment?
upb/decode_fast.c
Outdated
} else { | ||
arr = *arr_p; | ||
field = _upb_array_ptr(arr); | ||
elem_avail = arr->size - arr->len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there are a reason why we can't resize if elem_avail is 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We speculate that elem_avail isn't 0. I now have inline resizing inside the loop when it is 0.
------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_ArenaOneAlloc 21 ns 21 ns 32994231 BM_ArenaInitialBlockOneAlloc 6 ns 6 ns 116318005 BM_ParseDescriptorNoHeap 3028 ns 3028 ns 231138 2.34354GB/s BM_ParseDescriptor 3557 ns 3557 ns 196583 1.99498GB/s BM_ParseDescriptorProto2NoArena 33228 ns 33226 ns 21196 218.688MB/s BM_ParseDescriptorProto2WithArena 22863 ns 22861 ns 30666 317.831MB/s BM_SerializeDescriptorProto2 5444 ns 5444 ns 127368 1.30348GB/s BM_SerializeDescriptor 12509 ns 12508 ns 55816 580.914MB/s $ perf stat bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap 2020-10-08 14:07:06 Running bazel-bin/benchmark Run on (72 X 3700 MHz CPU s) CPU Caches: L1 Data 32K (x36) L1 Instruction 32K (x36) L2 Unified 1024K (x36) L3 Unified 25344K (x2) ---------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_ParseDescriptorNoHeap 3071 ns 3071 ns 227743 2.31094GB/s Performance counter stats for 'bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap': 1,050.22 msec task-clock # 0.978 CPUs utilized 4 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 179 page-faults # 0.170 K/sec 3,875,796,334 cycles # 3.690 GHz 13,282,835,967 instructions # 3.43 insn per cycle 2,887,725,848 branches # 2749.627 M/sec 8,324,912 branch-misses # 0.29% of all branches 1.073924364 seconds time elapsed 1.042806000 seconds user 0.008021000 seconds sys Profile: 23.96% benchmark benchmark [.] upb_prm_1bt_max192b 22.44% benchmark benchmark [.] fastdecode_dispatch 18.96% benchmark benchmark [.] upb_pss_1bt 14.20% benchmark benchmark [.] upb_psv4_1bt 8.33% benchmark benchmark [.] upb_prm_1bt_max64b 6.66% benchmark benchmark [.] upb_prm_1bt_max128b 1.29% benchmark benchmark [.] upb_psm_1bt_max64b 0.77% benchmark benchmark [.] fastdecode_generic 0.55% benchmark [kernel.kallsyms] [k] smp_call_function_single 0.42% benchmark [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.42% benchmark benchmark [.] upb_psm_1bt_max256b 0.31% benchmark benchmark [.] upb_psb1_1bt 0.21% benchmark benchmark [.] upb_plv4_5bv 0.14% benchmark benchmark [.] upb_psb1_2bt 0.12% benchmark benchmark [.] decode_longvarint64 0.08% benchmark [kernel.kallsyms] [k] vsnprintf 0.07% benchmark [kernel.kallsyms] [k] _raw_spin_lock 0.07% benchmark benchmark [.] _upb_msg_new 0.06% benchmark ld-2.31.so [.] check_match
upb/decode.c
Outdated
UPB_NOINLINE | ||
static const char *decode_msg(upb_decstate *d, const char *ptr, upb_msg *msg, | ||
const upb_msglayout *layout) { | ||
if (msg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can this not always be true? I would expect layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unknown fields. It's true that layout would be null there too.
This is not very optimized yet. There is a lot of room to optimize it further.
Also fixed a bug with fixed packed in decode_fast.c.
Without a profile, we assume that fields with smaller numbers are hotter.
1. For long tags we were putting table entries in the wrong slot. 2. For repeated strings, when the buffer flipped to no longer alias we were failing to notice and kept aliasing anyway.
These control whether fasttable decoding is on.
The real solution is to have each Kokoro build as part of a separate job that runs in parallel.
Related to your article and "jne followed by jmp" issue - LLVM does it intentionally this way: line 1521 |
When fasttable is enabled (
bazel build --//:fasttable_enabled=true
), this PR is a significant perf win, but also a significant code size growth (though the code size growth is exaggerated by the fact that the test proto intentionally uses every kind of field:When fasttable is disabled (default build or
bazel build --//:fasttable_enabled=false
), this PR is perf-neutral but does have a slight code size regression: