Optimize CBOR representation of data in Cadence storage layer #738

fxamacker · 2021-03-29T15:16:41Z

Issue To Be Solved

Cadence storage layer's use of CBOR can be optimized to improve speed, memory use, and stored data size.

Suggestion

Performance gains are possible by tweaking the CBOR representation of Cadence Value types. For example, replacing CBOR map with CBOR array when feasible.

In addition to performance gains, replacing CBOR map with CBOR array preserves order (e.g. DictionaryValue.Entries).

UPDATE: Additional performance gains (especially for decoding) are possible by removing backwards compatible decoding (thanks @turbolent) and not saving key strings twice (thanks @SupunS).

No changes to fxamacker/cbor are required to support these optimizations.

The current CBOR representation in Cadence is very nice — so this optimization is kind of like taking a nicely normalized RDBMS schema and denormalizing some of it to gain speed.

Performance Comparisons

A preliminary comparison (from PR #752) with non-production data showed potential performance improvements. Although faster execution speed was the primary objective, memory use and stored data size also improved.

Comparisons using production data (in PR #761) shows performance gains are cumulative. On a very large value (776,155 bytes), optimization of composite + dictionary types showed:

encoding is 44.96% faster, 41.99% less memory, 46.02% fewer allocs
decoding is 34.48% faster, 29.67% less memory, 18.56% fewer allocs

⚠️ NOTE: "faster" refers to time delta in `benchstat`, so "50% faster" here means 2x speedup.

Additional performance gains are possible by optimizing remaining Value types. Smaller data will probably begin to catch up in percent improved as more remaining types are optimized. Larger data with big composites and dictionaries will show more modest gains with remaining types.

Cumulative performance gains from PR #752, PR #761, PR #778, and PR #788 confirmed expectations about small data catching up in performance gains.

For a 167-byte LinkValue (extracted from a mainnet snapshot):

encoding is 50.54% faster, 57.81% less kB per alloc, 54.46% fewer allocs/op
decoding is 31.43% faster, 47.60% less kB per alloc, same allocs/op

For a 776155-byte CompositeValue (extracted from a mainnet snapshot):

encoding is 45.23% faster, 48.25% less kB per alloc, 51.57% fewer allocs/op
decoding is 35.83% faster, 31.23% less kB per alloc, 18.56% fewer allocs/op

Additional performance gains are possible by:

optimizing code in the Cadence storage layer
adding higher-performance API to fxamacker/cbor
using higher-performance API of fxamacker/cbor when available

Click to expand:

Preliminary Comparisons (non-production dictionary data)

Stored CBOR Data Size Comparisons (for dictionaries, non-production data)

Dictionary Data	Old Size (bytes)	New Size (bytes)	Delta
Small benchmark	3361	1941	-42.25%
Large benchmark	348802	199602	-42.77%

Size reduction is data dependent. Comparisons using production data is recommended.

Encoding Speed & Memory Comparisons (for dictionaries, non-production data)

name                  old time/op    new time/op    delta
EncodingSmallValue-4     199µs ± 0%     137µs ± 0%  -31.26%  (p=0.000 n=10+9)
EncodingLargeValue-4    19.4ms ± 0%    13.1ms ± 1%  -32.25%  (p=0.000 n=10+10)

name                  old alloc/op   new alloc/op   delta
EncodingSmallValue-4    57.6kB ± 0%    36.2kB ± 0%  -37.12%  (p=0.000 n=10+10)
EncodingLargeValue-4    4.67MB ± 0%    3.55MB ± 0%  -23.93%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op  delta
EncodingSmallValue-4     1.10k ± 0%     0.80k ± 0%  -27.46%  (p=0.000 n=10+10)
EncodingLargeValue-4     92.1k ± 0%     70.8k ± 0%  -23.12%  (p=0.000 n=10+10)

Decoding Speed & Memory Comparisons (for dictionaries, non-production data)

name                  old time/op    new time/op    delta
DecodingSmallValue-4     192µs ± 0%     145µs ± 0%  -24.32%  (p=0.000 n=9+10)
DecodingLargeValue-4    19.9ms ± 1%    14.7ms ± 0%  -26.26%  (p=0.000 n=10+10)

name                  old alloc/op   new alloc/op   delta
DecodingSmallValue-4    73.9kB ± 0%    60.7kB ± 0%  -17.84%  (p=0.000 n=10+10)
DecodingLargeValue-4    7.49MB ± 0%    5.75MB ± 0%  -23.23%  (p=0.000 n=10+9)

name                  old allocs/op  new allocs/op  delta
DecodingSmallValue-4     1.57k ± 0%     1.36k ± 0%  -13.36%  (p=0.000 n=10+10)
DecodingLargeValue-4      143k ± 0%      122k ± 0%  -14.62%  (p=0.000 n=10+10)

Benchmarks used linux_amd64 with Go 1.15.10. See Caveats.
Performance of encoding and decoding is data dependent. Comparisons using production data is recommended.

Comparisons using production data (cumulative gains from from PR #752, PR #761, PR #778, and PR #788):

LinkValue 🚀

Encoding Comparisons

name                                            old time/op    new time/op    delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    20.2µs ± 1%    10.0µs ± 0%  -50.54%  (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    20.7µs ± 1%    10.0µs ± 0%  -51.44%  (p=0.000 n=10+10)

name                                            old alloc/op   new alloc/op   delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    5.18kB ± 0%    2.18kB ± 0%  -57.81%  (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    5.50kB ± 0%    2.17kB ± 0%  -60.60%  (p=0.000 n=10+10)

name                                            old allocs/op  new allocs/op  delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4       101 ± 0%        46 ± 0%  -54.46%  (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4       101 ± 0%        46 ± 0%  -54.46%  (p=0.000 n=10+10)

Decoding Comparisons

name                                            old time/op    new time/op    delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    11.9µs ± 0%     8.2µs ± 1%  -31.43%  (p=0.000 n=8+10)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    11.9µs ± 1%     8.1µs ± 1%  -31.84%  (p=0.000 n=10+9)

name                                            old alloc/op   new alloc/op   delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    4.50kB ± 0%    2.36kB ± 0%  -47.60%  (p=0.000 n=10+10)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    4.50kB ± 0%    2.36kB ± 0%  -47.60%  (p=0.000 n=10+10)

name                                            old allocs/op  new allocs/op  delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4      55.0 ± 0%      55.0 ± 0%     ~     (all equal)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4      55.0 ± 0%      55.0 ± 0%     ~     (all equal)

Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.

CompositeValue without dictionaries 🚀

Encoding Comparisons

name                                               old time/op    new time/op    delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      899µs ± 1%     486µs ± 1%  -45.93%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       20.8µs ± 1%    11.8µs ± 1%  -43.30%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       15.8µs ± 1%     8.4µs ± 0%  -47.12%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      247kB ± 2%     138kB ± 0%  -44.32%  (p=0.000 n=10+9)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       4.93kB ± 2%    2.40kB ± 0%  -51.30%  (p=0.000 n=9+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       3.77kB ± 0%    1.62kB ± 0%  -57.15%  (p=0.000 n=9+10)

name                                               old allocs/op  new allocs/op  delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      5.03k ± 1%     2.54k ± 0%  -49.50%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4          104 ± 1%        55 ± 0%  -46.91%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4         76.0 ± 0%      37.0 ± 0%  -51.32%  (p=0.000 n=10+10)

Decoding Comparisons

name                                               old time/op    new time/op    delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      762µs ± 1%     516µs ± 1%  -32.27%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       17.2µs ± 1%    11.6µs ± 1%  -32.59%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       11.9µs ± 1%     8.0µs ± 1%  -32.96%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      298kB ± 0%     212kB ± 0%  -28.81%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       5.58kB ± 0%    3.87kB ± 0%  -30.56%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       3.91kB ± 0%    2.57kB ± 0%  -34.36%  (p=0.000 n=10+10)

name                                               old allocs/op  new allocs/op  delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      5.05k ± 0%     4.65k ± 0%   -7.98%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4         97.0 ± 0%      89.0 ± 0%   -8.25%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4         56.0 ± 0%      53.0 ± 0%   -5.36%  (p=0.000 n=10+10)

Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.

Large CompositeValue with dictionaries and etc. 🚀

Encoding Comparisons

name                                               old time/op    new time/op    delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    35.5ms ± 1%    19.4ms ± 1%  -45.23%  (p=0.000 n=10+9)

name                                               old alloc/op   new alloc/op   delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    7.62MB ± 4%    3.94MB ± 0%  -48.25%  (p=0.000 n=10+10)

name                                               old allocs/op  new allocs/op  delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4      141k ± 0%       68k ± 0%  -51.57%  (p=0.000 n=10+10)

Decoding Comparisons

name                                               old time/op    new time/op    delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    40.4ms ± 1%    25.9ms ± 1%  -35.83%  (p=0.000 n=9+10)

name                                               old alloc/op   new alloc/op   delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    15.0MB ± 0%    10.3MB ± 0%  -31.23%  (p=0.000 n=9+10)

name                                               old allocs/op  new allocs/op  delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4      265k ± 0%      216k ± 0%  -18.56%  (p=0.000 n=9+10)

Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.

Proposed Changes

Encoding and decoding arrays are faster than maps. So the proposed change replaces CBOR maps with CBOR arrays wherever feasible.

As mentioned earlier, this change has the added benefit of preserving the order of DictionaryValue.Entries.

The changes are very simple and the minimally modified tests pass. Modified files are limited to encoding.go, encoding_test.go, and decoding.go.

For example, changes to func (e *Encoder) prepareDictionaryValue might include:

declaring entries as an array here:

cadence/runtime/interpreter/encode.go

Line 686 in 3bbc8c8

entries := make(map[string]interface{}, v.Entries.Len())

replacing Content: cborMap{ with Content: cborArray{ here:

cadence/runtime/interpreter/encode.go

Lines 762 to 768 in 3bbc8c8

    
           return cbor.Tag{ 
        
           	Number: cborTagDictionaryValue, 
        
           	Content: cborMap{ 
        
           		encodedDictionaryValueKeysFieldKey:    keys, 
        
           		encodedDictionaryValueEntriesFieldKey: entries, 
        
           	}, 
        
           }, nil

As mentioned earlier, additional speed gains (especially for decoding) are possible by removing backwards compatible decoding (thanks @turbolent) and not saving key strings twice (thanks @SupunS).

Caveats

Changing the CBOR representation of data is a breaking change -- hopefully, it's just an internal implementation detail within the storage layer. Guidance from @turbolent would be appreciated to make sure I didn't miss anything.

I don't know how closely the speed gains will translate to other platforms (other CPU archs, compiled to wasm, and etc.)

Having this change to CBOR representation completed before I proceed with API changes to my CBOR library can save time.

Performance comparisons are data dependent. Comparisons should use realistic Cadence data (most common mix of data types, values, & sizes encountered in production).

Updates (not exhaustive list)

April 9, 2021 at 2:30 PM PT - Include cumulative benchmark comparisons to master (commit 7c7dd55) with optimizations from PR #752, PR #761, PR #778, and PR #788 using production data (values extracted from mainnet snapshot).

April 5, 2021 at 6:03 AM PT - Include benchmark comparisons using production data. Remove section "Data Used for Benchmark Comparisons" because it described non-production data.

The text was updated successfully, but these errors were encountered:

fxamacker · 2021-03-29T15:35:56Z

I'd like to provide a PR for this unless there are any concerns or objections.

turbolent · 2021-03-30T18:59:02Z

This looks great, thank you for writing this up Faye, and thanks for the great call discussing it.

We agreed to:

Make several separate optimizations in separate PRs
Target a feature branch for now and merge it back when several optimizations have been added, so the overhead of keeping backwards-compatibility is lower
Batch migrate all existing data to the new format, i.e. perform a state migration in the spork updating to the version of Cadence which only has support for the optimized format

Here are the next steps:

Prepare for changes Prepare for new storage format #745 (owner: @fxamacker)
Work on issue Optimize encoding of dictionaries #743: Optimize dictionary encoding (owner: @fxamacker)
Work on issue Optimize encoding of composites #744: Optimize composite encoding (@owner: @fxamacker)
Get real-world data, to get large and varied values (owner: @turbolent @ramtinms)

fxamacker · 2021-04-01T19:15:18Z

I updated this issue's text and benchmark comparisons to include more recent info.

turbolent · 2021-04-01T22:19:55Z

For reference, dapperlabs/flow-go#5422 will implement the state migration function

sudojbird · 2021-04-26T20:36:27Z

@turbolent @fxamacker; is this epic good to close?

fxamacker · 2021-04-26T20:43:29Z

I think this is good to close, I defer the decision to Bastian.

The next phase of optimizations are epic #794 and epic #833. We'll be adding issues to #833 (possibly streaming decoder).

turbolent · 2021-04-26T21:24:16Z

@studio1vc yes, all issues are done now 🎉

fxamacker added Feature Feedback labels Mar 29, 2021

fxamacker assigned turbolent Mar 29, 2021

turbolent assigned turbolent and fxamacker and unassigned turbolent Mar 30, 2021

turbolent added Epic Improvement and removed Feature labels Mar 30, 2021

This was referenced Mar 31, 2021

Prepare Decoder for storage format v4 #746

Merged

Optimize encoding of dictionaries #743

Closed

turbolent mentioned this issue Apr 1, 2021

Optimize storage: Implementations #675

Closed

This was referenced Apr 4, 2021

Optimize encoding of composites #761

Merged

Create CBOR encoding and decoding benchmarks using production data #0

Closed

turbolent mentioned this issue Apr 5, 2021

Benchmark CBOR encoding/decoding using production data #765

Closed

fxamacker mentioned this issue Apr 6, 2021

Optimize encoding for remaining types still using CBOR map #764

Closed

5 tasks

turbolent mentioned this issue Apr 7, 2021

Ensure Mainnet state is migratable #780

Closed

5 tasks

This was referenced Apr 8, 2021

Benchmark CBOR encoding/decoding using mainnet data #783

Merged

Remove obsolete code in decode.go #786

Closed

Remove obsolete code in decode.go #788

Merged

Optimize encoding by using streaming encoder #794

Closed

This was referenced Apr 13, 2021

Implement Phase 1 of new streaming encoding API in CBOR library #795

Closed

Change Encoder to use new CBOR library Streaming API (Phase 1) #796

Closed

Optimize storage format #797

Merged

fxamacker mentioned this issue Apr 22, 2021

Optimize encoding by using CBOR lib's StreamEncoder #830

Merged

6 tasks

turbolent closed this as completed Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CBOR representation of data in Cadence storage layer #738

Optimize CBOR representation of data in Cadence storage layer #738

fxamacker commented Mar 29, 2021 •

edited

Loading

Stored CBOR Data Size Comparisons (for dictionaries, non-production data)

Encoding Speed & Memory Comparisons (for dictionaries, non-production data)

Decoding Speed & Memory Comparisons (for dictionaries, non-production data)

Encoding Comparisons

Decoding Comparisons

Encoding Comparisons

Decoding Comparisons

Encoding Comparisons

Decoding Comparisons

fxamacker commented Mar 29, 2021

turbolent commented Mar 30, 2021 •

edited

Loading

fxamacker commented Apr 1, 2021

turbolent commented Apr 1, 2021

sudojbird commented Apr 26, 2021

fxamacker commented Apr 26, 2021

turbolent commented Apr 26, 2021

Optimize CBOR representation of data in Cadence storage layer #738

Optimize CBOR representation of data in Cadence storage layer #738

Comments

fxamacker commented Mar 29, 2021 • edited Loading

Issue To Be Solved

Suggestion

Performance Comparisons

⚠️ NOTE: "faster" refers to time delta in benchstat, so "50% faster" here means 2x speedup.

Stored CBOR Data Size Comparisons (for dictionaries, non-production data)

Encoding Speed & Memory Comparisons (for dictionaries, non-production data)

Decoding Speed & Memory Comparisons (for dictionaries, non-production data)

Encoding Comparisons

Decoding Comparisons

Encoding Comparisons

Decoding Comparisons

Encoding Comparisons

Decoding Comparisons

Proposed Changes

Caveats

Updates (not exhaustive list)

fxamacker commented Mar 29, 2021

turbolent commented Mar 30, 2021 • edited Loading

fxamacker commented Apr 1, 2021

turbolent commented Apr 1, 2021

sudojbird commented Apr 26, 2021

fxamacker commented Apr 26, 2021

turbolent commented Apr 26, 2021

fxamacker commented Mar 29, 2021 •

edited

Loading

⚠️ NOTE: "faster" refers to time delta in `benchstat`, so "50% faster" here means 2x speedup.

turbolent commented Mar 30, 2021 •

edited

Loading