Skip to content

Commit

Permalink
Problem: memiavl snapshot format is not optimal (#890)
Browse files Browse the repository at this point in the history
* Problem: (memiavl) offset table is not used to reduce node size

Solution:
- store offset table inside keys/values file, so we can reference it with leaf index in nodes.
- leverage the property of post-order traversal to further reduce some offsets.
- specialize `Get` operation for persisted node to improve query performance.
- add native-endian implementation to improve performance.

in the end, the on-disk IAVL tree query performance is on-par with tidwall/btree library
which is used for cache store.

* cleanup and fix lint

* Apply suggestions from code review

Signed-off-by: yihuang <[email protected]>

* use simple delta encoding for key offsets

* new tricks to get back performance

* cleanup

* build recsplit index

* fix lint

* cleanup

* Apply suggestions from code review

Signed-off-by: yihuang <[email protected]>

* go mod tidy

---------

Signed-off-by: yihuang <[email protected]>
Co-authored-by: mmsqe <[email protected]>
  • Loading branch information
yihuang and mmsqe authored Mar 1, 2023
1 parent ed5462d commit 3000067
Show file tree
Hide file tree
Showing 18 changed files with 804 additions and 245 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@

- [#833](https://github.com/crypto-org-chain/cronos/pull/833) Fix rollback command.

### Improvements

- [#890](https://github.com/crypto-org-chain/cronos/pull/890) optimize memiavl snapshot format.

*Feb 09, 2022*

## v1.0.4
Expand Down
8 changes: 8 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ require (
github.com/ChainSafe/go-schnorrkel v0.0.0-20200405005733-88cbf1b4c40d // indirect
github.com/StackExchange/wmi v1.2.1 // indirect
github.com/VictoriaMetrics/fastcache v1.6.0 // indirect
github.com/VictoriaMetrics/metrics v1.23.1 // indirect
github.com/Workiva/go-datastructures v1.0.53 // indirect
github.com/alitto/pond v1.8.2 // indirect
github.com/allegro/bigcache v1.2.1 // indirect
Expand All @@ -55,6 +56,7 @@ require (
github.com/btcsuite/btcd/btcec/v2 v2.3.2 // indirect
github.com/btcsuite/btcd/btcutil v1.1.3 // indirect
github.com/btcsuite/btcd/chaincfg/chainhash v1.0.1 // indirect
github.com/c2h5oh/datasize v0.0.0-20220606134207-859f65c6625b // indirect
github.com/cenkalti/backoff/v4 v4.1.3 // indirect
github.com/cespare/xxhash v1.1.0 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
Expand Down Expand Up @@ -129,6 +131,7 @@ require (
github.com/jmhodges/levigo v1.0.0 // indirect
github.com/klauspost/compress v1.15.11 // indirect
github.com/ledgerwatch/erigon-lib v0.0.0-20230210071639-db0e7ed11263 // indirect
github.com/ledgerwatch/log/v3 v3.7.0 // indirect
github.com/lib/pq v1.10.6 // indirect
github.com/libp2p/go-buffer-pool v0.1.0 // indirect
github.com/magiconair/properties v1.8.6 // indirect
Expand Down Expand Up @@ -161,6 +164,7 @@ require (
github.com/rs/zerolog v1.27.0 // indirect
github.com/sasha-s/go-deadlock v0.3.1 // indirect
github.com/shirou/gopsutil v3.21.4-0.20210419000835-c7a38de76ee5+incompatible // indirect
github.com/spaolacci/murmur3 v1.1.0 // indirect
github.com/spf13/afero v1.9.2 // indirect
github.com/spf13/jwalterweatherman v1.1.0 // indirect
github.com/spf13/viper v1.14.0 // indirect
Expand All @@ -171,12 +175,16 @@ require (
github.com/tidwall/btree v1.5.0 // indirect
github.com/tklauser/go-sysconf v0.3.10 // indirect
github.com/tklauser/numcpus v0.4.0 // indirect
github.com/torquem-ch/mdbx-go v0.27.5 // indirect
github.com/tyler-smith/go-bip39 v1.1.0 // indirect
github.com/ulikunitz/xz v0.5.10 // indirect
github.com/valyala/fastrand v1.1.0 // indirect
github.com/valyala/histogram v1.2.0 // indirect
github.com/zondax/hid v0.9.1 // indirect
github.com/zondax/ledger-go v0.14.1 // indirect
go.etcd.io/bbolt v1.3.6 // indirect
go.opencensus.io v0.24.0 // indirect
go.uber.org/atomic v1.10.0 // indirect
golang.org/x/crypto v0.6.0 // indirect
golang.org/x/net v0.7.0 // indirect
golang.org/x/oauth2 v0.4.0 // indirect
Expand Down
16 changes: 16 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,8 @@ github.com/StackExchange/wmi v1.2.1 h1:VIkavFPXSjcnS+O8yTq7NI32k0R5Aj+v39y29VYDO
github.com/StackExchange/wmi v1.2.1/go.mod h1:rcmrprowKIVzvc+NUiLncP2uuArMWLCbu9SBzvHz7e8=
github.com/VictoriaMetrics/fastcache v1.6.0 h1:C/3Oi3EiBCqufydp1neRZkqcwmEiuRT9c3fqvvgKm5o=
github.com/VictoriaMetrics/fastcache v1.6.0/go.mod h1:0qHz5QP0GMX4pfmMA/zt5RgfNuXJrTP0zS7DqpHGGTw=
github.com/VictoriaMetrics/metrics v1.23.1 h1:/j8DzeJBxSpL2qSIdqnRFLvQQhbJyJbbEi22yMm7oL0=
github.com/VictoriaMetrics/metrics v1.23.1/go.mod h1:rAr/llLpEnAdTehiNlUxKgnjcOuROSzpw0GvjpEbvFc=
github.com/VividCortex/gohistogram v1.0.0 h1:6+hBz+qvs0JOrrNhhmR7lFxo5sINxBCGXrdtl/UvroE=
github.com/VividCortex/gohistogram v1.0.0/go.mod h1:Pf5mBqqDxYaXu3hDrrU+w6nw50o/4+TcAqDqk/vUH7g=
github.com/Workiva/go-datastructures v1.0.53 h1:J6Y/52yX10Xc5JjXmGtWoSSxs3mZnGSaq37xZZh7Yig=
Expand Down Expand Up @@ -317,6 +319,8 @@ github.com/btcsuite/websocket v0.0.0-20150119174127-31079b680792/go.mod h1:ghJtE
github.com/btcsuite/winsvc v1.0.0/go.mod h1:jsenWakMcC0zFBFurPLEAyrnc/teJEM1O46fmI40EZs=
github.com/bwesterb/go-ristretto v1.2.0/go.mod h1:fUIoIZaG73pV5biE2Blr2xEzDoMj7NFEuV9ekS419A0=
github.com/c-bata/go-prompt v0.2.2/go.mod h1:VzqtzE2ksDBcdln8G7mk2RX9QyGjH+OVqOCSiVIqS34=
github.com/c2h5oh/datasize v0.0.0-20220606134207-859f65c6625b h1:6+ZFm0flnudZzdSE0JxlhR2hKnGPcNB35BjQf4RYQDY=
github.com/c2h5oh/datasize v0.0.0-20220606134207-859f65c6625b/go.mod h1:S/7n9copUssQ56c7aAgHqftWO4LTf4xY6CGWt8Bc+3M=
github.com/casbin/casbin/v2 v2.1.2/go.mod h1:YcPU1XXisHhLzuxH9coDNf2FbKpjGlbCg3n9yuLkIJQ=
github.com/cenkalti/backoff v2.2.1+incompatible h1:tNowT99t7UNflLxfYYSlKYsBpXdEet03Pg2g16Swow4=
github.com/cenkalti/backoff v2.2.1+incompatible/go.mod h1:90ReRw6GdpyfrHakVjL/QHaoyV4aDUVVkXQJJJ3NXXM=
Expand Down Expand Up @@ -840,6 +844,8 @@ github.com/labstack/gommon v0.3.0/go.mod h1:MULnywXg0yavhxWKc+lOruYdAhDwPK9wf0OL
github.com/leanovate/gopter v0.2.9/go.mod h1:U2L/78B+KVFIx2VmW6onHJQzXtFb+p5y3y2Sh+Jxxv8=
github.com/ledgerwatch/erigon-lib v0.0.0-20230210071639-db0e7ed11263 h1:LGEzZvf33Y1NhuP5+jI/ni9l1TFS6oYPDilgy74NusM=
github.com/ledgerwatch/erigon-lib v0.0.0-20230210071639-db0e7ed11263/go.mod h1:OXgMDuUo2lZ3NpH29ZvMYbk+LxFd5ffDl2Z2mGMuY/I=
github.com/ledgerwatch/log/v3 v3.7.0 h1:aFPEZdwZx4jzA3+/Pf8wNDN5tCI0cIolq/kfvgcM+og=
github.com/ledgerwatch/log/v3 v3.7.0/go.mod h1:J2Jl6zV/58LeA6LTaVVnCGyf1/cYYSEOOLHY4ZN8S2A=
github.com/leodido/go-urn v1.2.0 h1:hpXL4XnriNwQ/ABnpepYM/1vCLWNDfUNts8dX3xTG6Y=
github.com/leodido/go-urn v1.2.0/go.mod h1:+8+nEpDfqqsY+g338gtMEUOtuK+4dEMhiQEgxpxOKII=
github.com/lib/pq v1.0.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
Expand Down Expand Up @@ -976,6 +982,7 @@ github.com/pascaldekloe/goe v0.0.0-20180627143212-57f6aae5913c/go.mod h1:lzWF7FI
github.com/pascaldekloe/goe v0.1.0 h1:cBOtyMzM9HTpWjXfbbunk26uA6nG3a8n06Wieeh0MwY=
github.com/pascaldekloe/goe v0.1.0/go.mod h1:lzWF7FIEvWOWxwDKqyGYQf6ZUaNfKdP144TG7ZOy1lc=
github.com/paulbellamy/ratecounter v0.2.0/go.mod h1:Hfx1hDpSGoqxkVVpBi/IlYD7kChlfo5C6hzIHwPqfFE=
github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58 h1:onHthvaw9LFnH4t2DcNVpwGmV9E1BkGknEliJkfwQj0=
github.com/pborman/uuid v1.2.0/go.mod h1:X/NO0urCmaxf9VXbdlT7C2Yzkj2IKimNn4k+gtPdI/k=
github.com/pelletier/go-toml v1.2.0/go.mod h1:5z9KED0ma1S8pY6P1sdut58dfprrGBbd/94hg7ilaic=
github.com/pelletier/go-toml v1.9.5 h1:4yBQzkHv+7BHq2PQUZF3Mx0IYxG7LsP222s7Agd3ve8=
Expand Down Expand Up @@ -1157,6 +1164,8 @@ github.com/tklauser/numcpus v0.2.2/go.mod h1:x3qojaO3uyYt0i56EW/VUYs7uBvdl2fkfZF
github.com/tklauser/numcpus v0.4.0 h1:E53Dm1HjH1/R2/aoCtXtPgzmElmn51aOkhCFSuZq//o=
github.com/tklauser/numcpus v0.4.0/go.mod h1:1+UI3pD8NW14VMwdgJNJ1ESk2UnwhAnz5hMwiKKqXCQ=
github.com/tmc/grpc-websocket-proxy v0.0.0-20170815181823-89b8d40f7ca8/go.mod h1:ncp9v5uamzpCO7NfCPTXjqaC+bZgJeR0sMTm6dMHP7U=
github.com/torquem-ch/mdbx-go v0.27.5 h1:bbhXQGFCmoxbRDXKYEJwxSOOTeBKwoD4pFBUpK9+V1g=
github.com/torquem-ch/mdbx-go v0.27.5/go.mod h1:T2fsoJDVppxfAPTLd1svUgH1kpPmeXdPESmroSHcL1E=
github.com/ttacon/chalk v0.0.0-20160626202418-22c06c80ed31/go.mod h1:onvgF043R+lC5RZ8IT9rBXDaEDnpnw/Cl+HFiw+v/7Q=
github.com/tv42/httpunix v0.0.0-20150427012821-b75d8614f926/go.mod h1:9ESjWnEqriFuLhtthL60Sar/7RFoluCcXsuvEwTV5KM=
github.com/tyler-smith/go-bip39 v1.0.1-0.20181017060643-dbb3b84ba2ef/go.mod h1:sJ5fKU0s6JVwZjjcUEX2zFOnvq0ASQ2K9Zr6cf67kNs=
Expand All @@ -1174,8 +1183,12 @@ github.com/urfave/cli v1.20.0/go.mod h1:70zkFmudgCuE/ngEzBv17Jvp/497gISqfk5gWijb
github.com/urfave/cli v1.22.1/go.mod h1:Gos4lmkARVdJ6EkW0WaNv/tZAAMe9V7XWyB60NtXRu0=
github.com/urfave/cli/v2 v2.3.0/go.mod h1:LJmUH05zAU44vOAcrfzZQKsZbVcdbOG8rtL3/XcUArI=
github.com/valyala/bytebufferpool v1.0.0/go.mod h1:6bBcMArwyJ5K/AmCkWv1jt77kVWyCJ6HpOuEn7z0Csc=
github.com/valyala/fastrand v1.1.0 h1:f+5HkLW4rsgzdNoleUOB69hyT9IlD2ZQh9GyDMfb5G8=
github.com/valyala/fastrand v1.1.0/go.mod h1:HWqCzkrkg6QXT8V2EXWvXCoow7vLwOFN002oeRzjapQ=
github.com/valyala/fasttemplate v1.0.1/go.mod h1:UQGH1tvbgY+Nz5t2n7tXsz52dQxojPUpymEIMZ47gx8=
github.com/valyala/fasttemplate v1.2.1/go.mod h1:KHLXt3tVN2HBp8eijSv/kGJopbvo7S+qRAEEKiv+SiQ=
github.com/valyala/histogram v1.2.0 h1:wyYGAZZt3CpwUiIb9AU/Zbllg1llXyrtApRS815OLoQ=
github.com/valyala/histogram v1.2.0/go.mod h1:Hb4kBwb4UxsaNbbbh+RRz8ZR6pdodR57tzWUS3BUzXY=
github.com/vmihailenco/msgpack/v5 v5.3.5/go.mod h1:7xyJ9e+0+9SaZT0Wt1RGleJXzli6Q/V5KbhBonMG9jc=
github.com/vmihailenco/tagparser/v2 v2.0.0/go.mod h1:Wri+At7QHww0WTrCBeu4J6bNtoV6mEfg5OIWRZA9qds=
github.com/willf/bitset v1.1.3/go.mod h1:RjeCKbqT1RxIR/KWY6phxZiaY1IyutSBfGjNPySAYV4=
Expand Down Expand Up @@ -1213,6 +1226,8 @@ go.opentelemetry.io/proto/otlp v0.7.0/go.mod h1:PqfVotwruBrMGOCsRd/89rSnXhoiJIqe
go.uber.org/atomic v1.3.2/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/atomic v1.5.0/go.mod h1:sABNBOSYdrvTF6hTgEIbc7YasKWGhgEQZyfxyTvoXHQ=
go.uber.org/atomic v1.10.0 h1:9qC72Qh0+3MqyJbAn8YU5xVq1frD8bn3JtD2oXtafVQ=
go.uber.org/atomic v1.10.0/go.mod h1:LUxbIzbOniOlMKjJjyPfpl4v+PKK2cNJn91OQbhoJI0=
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
go.uber.org/multierr v1.3.0/go.mod h1:VgVr7evmIr6uPjLBxg28wmKNXyqE9akIJ5XnfpiKl+4=
go.uber.org/tools v0.0.0-20190618225709-2cfd321de3ee/go.mod h1:vJERXedbb3MVM5f9Ejo0C68/HhF8uaILCdgjnY+goOA=
Expand Down Expand Up @@ -1509,6 +1524,7 @@ golang.org/x/sys v0.0.0-20220728004956-3c1f35247d10/go.mod h1:oPkhp1MJrh7nUepCBc
golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220908164124-27713097b956/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.1.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.3.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.5.0 h1:MUK/U/4lj1t1oPg0HfuXDN/Z1wv31ZJ/YcPiGccS4DU=
golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/term v0.0.0-20201117132131-f5c789dd3221/go.mod h1:Nr5EML6q2oocZ2LXRh80K7BxOlk5/8JxuGnuhpl+muw=
Expand Down
36 changes: 24 additions & 12 deletions gomod2nix.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ schema = 3
[mod."github.com/VictoriaMetrics/fastcache"]
version = "v1.6.0"
hash = "sha256-u1dkRJ2Y5+hnYlkyMPm14HxKkAv999bjN622nZDjaBo="
[mod."github.com/VictoriaMetrics/metrics"]
version = "v1.23.1"
hash = "sha256-z2X1DkTPXhI1eI9AJJ8d6LDeapiaKmd2l/2orxKsxi0="
[mod."github.com/Workiva/go-datastructures"]
version = "v1.0.53"
hash = "sha256-W6qOvqu8sokMlZrpOF1SWG138H0/BotywKNLlDF8Zug="
Expand Down Expand Up @@ -76,6 +79,9 @@ schema = 3
[mod."github.com/btcsuite/btcd/chaincfg/chainhash"]
version = "v1.0.1"
hash = "sha256-vix0j/KGNvoKjhlKgVeSLY6un2FHeIEoZWMC4z3yvZ4="
[mod."github.com/c2h5oh/datasize"]
version = "v0.0.0-20220606134207-859f65c6625b"
hash = "sha256-1uH+D3w0Y/B3poXm545XGrT4S4c+msTbj7gKgu9pbPM="
[mod."github.com/cenkalti/backoff/v4"]
version = "v4.1.3"
hash = "sha256-u6MEDopHoTWAZoVvvXOKnAg++xre53YgQx0gmf6t2KU="
Expand Down Expand Up @@ -327,6 +333,9 @@ schema = 3
[mod."github.com/ledgerwatch/erigon-lib"]
version = "v0.0.0-20230210071639-db0e7ed11263"
hash = "sha256-SKFGLsJV6G4xIQ5IU+qq9EY3v0/I8B/CTMcDOhXGfc4="
[mod."github.com/ledgerwatch/log/v3"]
version = "v3.7.0"
hash = "sha256-o0tOdlRL0LSU3BRJxK15LkBEh/RDGnsTWWVU/vgmARk="
[mod."github.com/lib/pq"]
version = "v1.10.6"
hash = "sha256-8EhFwY/9YH5L/fd6l2beOnC3VvpegRAmCCsnDVJBqBM="
Expand All @@ -337,18 +346,6 @@ schema = 3
version = "v1.7.15-0.20230222024938-b61261a9193b"
hash = "sha256-Hry5mpO8WqCuYZ0zCnOt6kopDGprc7/nI318A2D+Kk0="
replaced = "github.com/linxGnu/grocksdb"
[mod."github.com/lispad/go-generics-tools"]
version = "v1.1.0"
hash = "sha256-MPeqIrDVGv6d/mBK4o48eaRc8Qga5l9i1tpUyrgMoII="
[mod."github.com/lucasjones/reggen"]
version = "v0.0.0-20180717132126-cdb49ff09d77"
hash = "sha256-gX55KvGzJkyghKIhmQFnwCI9mXYt/FtDgI/sQ4xNfrE="
[mod."github.com/lufeee/execinquery"]
version = "v1.2.1"
hash = "sha256-Hg+/0StXgoflSxwiw96IYhYybYy26o1QA66a5pMsswo="
[mod."github.com/lyft/protoc-gen-validate"]
version = "v0.0.13"
hash = "sha256-JhFMmEaP1amtJJBLWFVqjjHeHuAHRP0qwLMMFX2b3FM="
[mod."github.com/magiconair/properties"]
version = "v1.8.6"
hash = "sha256-xToSfpuePctkTdhJtsuKIEkXwfMZbnkFT98ahIfd4wY="
Expand Down Expand Up @@ -446,6 +443,9 @@ schema = 3
[mod."github.com/shirou/gopsutil"]
version = "v3.21.4-0.20210419000835-c7a38de76ee5+incompatible"
hash = "sha256-oqIqyFquWabIE6DID6uTEc8oFEmM1rVu2ATn3toiCEg="
[mod."github.com/spaolacci/murmur3"]
version = "v1.1.0"
hash = "sha256-RWD4PPrlAsZZ8Xy356MBxpj+/NZI7w2XOU14Ob7/Y9M="
[mod."github.com/spf13/afero"]
version = "v1.9.2"
hash = "sha256-R1mir7Fu95QK+YL99U14RGbLJzxqWRH5rSFpssgJvzA="
Expand Down Expand Up @@ -496,12 +496,21 @@ schema = 3
[mod."github.com/tklauser/numcpus"]
version = "v0.4.0"
hash = "sha256-ndE82nOb3agubhEV7aRzEqqTlN4DPbKFHEm2+XZLn8k="
[mod."github.com/torquem-ch/mdbx-go"]
version = "v0.27.5"
hash = "sha256-GSDtoGSdX8E4QLnBgLzWVW/9qYMuvcsogoo2LC8T3eU="
[mod."github.com/tyler-smith/go-bip39"]
version = "v1.1.0"
hash = "sha256-3YhWBtSwRLGwm7vNwqumphZG3uLBW1vwT9QkQ8JuSjU="
[mod."github.com/ulikunitz/xz"]
version = "v0.5.10"
hash = "sha256-bogOwQNmQVS7W+C7wci7XEUeYm9TB7PnxnyBIXKYbm0="
[mod."github.com/valyala/fastrand"]
version = "v1.1.0"
hash = "sha256-+tvsaq1TJGA/bCLXztr0iIZ08CatTG9x7ooNTPIKSZY="
[mod."github.com/valyala/histogram"]
version = "v1.2.0"
hash = "sha256-zmCr5jZHdbOes9XiAA8HdXBHBeDiaOYVWHeW8tQIbUQ="
[mod."github.com/zondax/hid"]
version = "v0.9.0"
hash = "sha256-PvXtxXo/3C+DS9ZeGBlr4zXbIpaYNtMqLzxYhusFXNY="
Expand All @@ -515,6 +524,9 @@ schema = 3
[mod."go.opencensus.io"]
version = "v0.24.0"
hash = "sha256-4H+mGZgG2c9I1y0m8avF4qmt8LUKxxVsTqR8mKgP4yo="
[mod."go.uber.org/atomic"]
version = "v1.10.0"
hash = "sha256-E6UEDc1eh/cLUFd+J86cDesQ0B8wEv/DdaAVKb+x2t8="
[mod."golang.org/x/crypto"]
version = "v0.6.0"
hash = "sha256-rtJfWOcpCk+DUskDsOZBt9BU/pHjHJ60LkF4VOKCAI8="
Expand Down
72 changes: 42 additions & 30 deletions memiavl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@
* 27 Jan 2023:
* Update metadata file format
* Encode key length with 4 bytes instead of 2.
* 24 Feb 2023:
* Reduce node size without hash) from 32bytes to 16bytes, leverage properties of post-order traversal.
* Merge key-values into single kvs file, build MPHF hash table to index it.


## The Journey

Expand Down Expand Up @@ -38,10 +42,10 @@ It also integrates well with versiondb, because versiondb can also be derived fr
### Change Set File

```
version: int64
size: int64 // size of whole payload
version: 8
size: 8 // size of whole payload
payload:
delete: int8
delete: 1
keyLen: varint-uint64
key
[ // if delete is false
Expand All @@ -63,49 +67,55 @@ IAVL snapshot is composed by four files:
- `metadata`, 16bytes:

```
magic: uint32
format: uint32
version: uint64
root node index: uint32
magic: 4
format: 4
version: 4
root node index: 4
```

- `nodes`, array of fixed size(64bytes) nodes, the node format is like this:
- `nodes`, array of fixed size(16+32bytes) nodes, the node format is like this:

```
height : uint8 // padded to 4bytes
version : uint32
size : uint64
key : uint64 // offset in keys file
left : uint32 // inner node only
right : uint32 // inner node only
value : uint64 offset // offset in values file, leaf node only
hash : [32]byte
# branch
height : 1
_padding : 3
version : 4
size : 4
key node : 4
hash : [32]byte
# leaf
height : 1
_padding : 3
version : 4
key offset : 8
```
The node has fixed length, can be indexed directly. The nodes reference each other with the index, nodes are written in post-order, so the root node is always placed at the end.

Some integers are using `uint32`, should be enough in forseeable future, but could be changed to `uint64` to be safer.
The node has fixed length, can be indexed directly. The nodes reference each other with the node index, nodes are written in post-order, so the root node is always placed at the end.

The implementation will read the mmap-ed content in a zero-copy way, won't use extra node cache, it will only rely on the OS page cache.
For branch node, the `key node` field reference the smallest leaf node in the right branch, the key slice is fetched from there indirectly, the leaf nodes will store key slice and value index informations, but the version field is stored in `keys` file instead.

- `keys`, sequence of length prefixed leaf node keys, ordered and no duplication.
The branch node's left/child node indexes are inferenced from existing information and properties of post-order traversal:

```
size: uint32
payload
*repeat*
right child index = self index - 1
left child index = key node - 1
```

Key size is encoded in `uint32`, so the maximum key length supported is `1<<32-1`, around 4G.
The version/size/node indexes are encoded with 4 bytes, should be enough in foreseeable future, but could be changed to more bytes in the future.

- `values`, sequence of length prefixed leaf node values.
The implementation will read the mmap-ed content in a zero-copy way, won't use extra node cache, it will only rely on the OS page cache.

- `kvs`, sequence of leaf node key-value pairs, the keys are ordered and no duplication.

```
size: uint32
payload
keyLen: varint-uint64
key
valueLen: varint-uint64
value
*repeat*
```

Value size is encoded in `uint32`, so maximum value length supported is `1<<32-1`, around 4G.
- `kvs.index`, Minimal-perfect-hash-function build from `kvs`, support query as a hash map.

#### Compression

Expand All @@ -115,4 +125,6 @@ The items in snapshot reference with each other by file offsets, we can apply so

[VersionDB](../README.md) is to support query and iterating historical versions of key-values pairs, currently implemented with rocksdb's experimental user-defined timestamp feature, support query and iterate key-value pairs by version, it's an alternative way to support grpc query service, and much more compact than IAVL trees, similar in size with the compressed change set files.

[^1]: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
After versiondb is fully integrated, IAVL tree don't need to serve queries at all, it don't need to store the values at all, just store the value hashes would be enough.

[^1]: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
Loading

0 comments on commit 3000067

Please sign in to comment.