-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile Horizon #4022
Comments
I think we can really improve decoding speed by:
|
I took a quick look at the decoding object allocation and performance ... and I found that most (if not all) the intermediary buffers used by the xdr decoding library end up allocated in the heap. Here are the top-most object allocating functions of Horizon: If we ignore the calls from the Let's take a a peek at
All allocations come from line 105, which corresponds to a func (d *Decoder) DecodeInt() (int32, int, error) {
var buf [4]byte
n, err := io.ReadFull(d.r, buf[:]) One would think that
The same applies to the other decoding functions. The problem is that the reader passed to This means that, in order to dramatically reduce allocations we should change the Decoder's design so that it reads from a raw buffer (not requiring external IO calls causing stack escapes). |
This is really interesting finding. 👏🏻 Combined with that finding we could also rewrite the xdrgen generated Go code to use the same pattern as stellar/xdrgen#65 does for encoding, and apply it to decoding so that no reflection is necessary. |
I reduced decoding allocations by 1. Adding a scratch buffer to the Encoder/decoder. The scratch buffer is used as a temporary buffer. Before that, temporary buffers were allocated (in the heap since they escaped the stack) for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()` 2. Making `DecodeFixedOpaque()` decode in-place instead of allocating the result. Apart from reducing allocations, this removes the need of copying (as shown by `decodeFixedArray()`. The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement: Before: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 1039 ns/op 15.40 MB/s 152 B/op 11 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 806.7 ns/op 19.83 MB/s 104 B/op 10 allocs/op PASS ``` After: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 679.6 ns/op 23.54 MB/s 144 B/op 8 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 609.1 ns/op 26.27 MB/s 104 B/op 8 allocs/op PASS ``` Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation. More context at stellar/go#4022 (comment)
The problems I identified above are fixed by stellar/go-xdr#16 |
I reduced the allocation count by 1. Adding a scratch buffer to the Encoder/decoder. The scratch buffer is used as a temporary buffer. Before that, temporary buffers were allocated (in the heap since they escaped the stack) for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()` 2. Making `DecodeFixedOpaque()` decode in-place instead of allocating the result. Apart from reducing allocations, this removes the need of copying (as shown by `decodeFixedArray()`. The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement: Before: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 1039 ns/op 15.40 MB/s 152 B/op 11 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 806.7 ns/op 19.83 MB/s 104 B/op 10 allocs/op PASS ``` After: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 679.6 ns/op 23.54 MB/s 144 B/op 8 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 609.1 ns/op 26.27 MB/s 104 B/op 8 allocs/op PASS ``` Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation. More context at stellar/go#4022 (comment)
I reduced the allocation count by 1. Adding a scratch buffer to the Encoder/decoder. The scratch buffer is used as a temporary buffer. Before that, temporary buffers were allocated (in the heap since they escaped the stack) for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()` 2. Making `DecodeFixedOpaque()` decode in-place instead of allocating the result. Apart from reducing allocations, this removes the need of copying (as shown by `decodeFixedArray()`. The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement: Before: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 1445005 853.5 ns/op 18.75 MB/s 152 B/op 11 allocs/op BenchmarkMarshal BenchmarkMarshal-8 1591068 643.0 ns/op 24.88 MB/s 104 B/op 10 allocs/op PASS ``` After: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 1753954 661.4 ns/op 24.19 MB/s 144 B/op 8 allocs/op BenchmarkMarshal BenchmarkMarshal-8 2103310 537.1 ns/op 29.79 MB/s 104 B/op 8 allocs/op PASS ``` Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation. More context at stellar/go#4022 (comment)
I think that after all the recent PRs we can consider XDR encoding and decoding sufficiently optimized for now. If we decide to do more work in that regard, I think we should try to reduce decoding allocations, which will be hard but not impossible. For instance we could try to intern the Assets much like string interning |
While the CPU execution is dominated by the state verifier, Go's profiler only measures CPU time and not wallclock time. Go's profiler also provides tracing, which AFAIU can be used to infer IO wait time Finally https://github.com/felixge/fgprof combines both, providing wallclock profiling. I am going to evaluate both in staging and see what I get. See #4079 |
Here is the fgprof time profile: pprof.samples.time.001.pb.gz Unfortunately, although it looks very different at the top level: When focusing on So I am not sure fgprof is working well. It seems that, for some reason, it's missing DB queries. |
Add added metrics to specific batch inserts in #4080 and it turns out that |
I think it would also be useful to add tracing to db queries in general This golang issue includes some alternatives: golang/go#18080 |
It looks like all the XDR improvements and #4087 made |
Move it to the backlog. I think we will need go keep working on this. Also (without downplaying the improvements) the transaction's processor is still the bottleneck, right? I still think there is a lot to be gained by improving the performance of the batch insert builder in general. |
What do you think about closing this issue and creating specific issues connected to performance improvements? This one is very general and I guess could live in the backlog forever. |
I don't think we have profiled the Time (CPU and Wallclock) and Memory performance of Horizon in a while.
We should try doing that and and improve any low hanging fruit we see. Particularly Horizon's ingestion pipeline.
For starters, this is a CPU profile I took from Horizon's staging pubnet deployment:
pprof.stellar-horizon.samples.cpu.001.pb.gz
The text was updated successfully, but these errors were encountered: