Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot acceleration seems to have no impact on eth_call performance in both ethspam | versus and my application #21772

Closed
elliottdehn opened this issue Nov 1, 2020 · 16 comments

Comments

@elliottdehn
Copy link

elliottdehn commented Nov 1, 2020

System information

Geth version: 1.9.23-stable-8c2f2715
OS & Version: Ubuntu 20.04.1 LTS
Commit hash : (see above)

Expected behaviour

After Geth creates the snapshot, performance of eth_call should increase substantially per the blog post and others online. This should be reflected by increased performance using ethspam and versus.

Actual behaviour

The snapshot seems to have no impact on performance. ethspam | versus --stop-after=10000 --concurrency=30 is still showing roughly 200 calls/s, as is my application. This is certainly within the margin of error of performance prior to the snapshot. In theory this means that, statistically speaking, something is probably not right.

Steps to reproduce the behaviour

  1. Run geth as usual without --snapshot and perform a ethspam | versus --stop-after=10000 --concurrency=30 test, and note the requests-per-second performance.

  2. Run geth with the following flags:

geth --http --cache 5315 --datadir /some/path/here --snapshot

  1. Wait for the snapshot to finish building. I believe you can tell that it is finished when blocks start getting imported with no "aborting snapshot" and "resuming snapshot" messages along with them. I'm not sure of this, though. (Instructions on how to empirically confirm that the snapshot is complete would be nice) For reference, I continued to run it for about 12 hours after it had reached ~85m accounts, with a listed eta of about 3 hours from that point.

Edit: can confirm that, through a possible screwup on my part, the snapshot wipe deleted about 106_695_498 accounts

  1. Restart Geth for good measure using ctrl + c and wait for it to gracefully exit; then run the same command line to start it back up. Give it time to build up internal caches and such.

  2. Run ethspam | versus --stop-after=10000 --concurrency=30 again.

  3. Observe that the performance has not changed; as if the snapshot either doesn't exist, doesn't work, or isn't getting used.

For reference, my hardware:

Intel Xeon E3-1200 v6 (4 cores, 4.2 ghz) official link
2 Nvidia 1080 Ti GPU's
16 GB RAM
1 TB NVMe drive

It's possible that I have something misconfigured or that I'm misunderstanding something, but this seems like at least a usability problem at best. Help would be appreciated! Thank you.

Edit 2: It may also be worth noting that my application is executing only view (non-pure) functions which themselves are heavy on state reads

@elliottdehn
Copy link
Author

elliottdehn commented Nov 3, 2020

Some (perhaps too many) updates:

There is an error log printed out towards the bottom here that might be of interest (upon restarting the node for the first time after an unbroken snapshot building process), but here is my best effort at a somewhat comprehensive info-dump.

I managed to re-generate the entire snapshot, this time without ever interrupting the generation by ctrl + c on the process. See the following output:

INFO [11-02|21:07:46.636] Generating state snapshot                root="803bb3…f7a0fa" at="fff705…0f4ade" accounts=106741949 slots=377400657 storage=31.83GiB   elapsed=30h50m39.455s eta=15.213s
INFO [11-02|21:07:50.858] Generated state snapshot                 accounts=106756770 slots=377408583 storage=31.83GiB   elapsed=30h50m43.677s

Once the snapshot was generated, without stopping/restarting the process I did the following and got the listed output:

$ ethspam | versus --stop-after=10000 --concurrency=30 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   122.80 per second
   Timing:     0.2443s avg, 0.0002s min, 28.3649s max
               1.6686s standard deviation

   Percentiles:
     25% in 0.0006s
     50% in 0.0018s
     75% in 0.0171s
     90% in 0.0538s
     95% in 0.2050s
     99% in 6.7202s

   Errors: 0.00%

** Summary for 1 endpoints:
   Completed:  10000 results with 10000 total requests
   Timing:     244.307134ms request avg, 1m26.035735167s total run time
   Errors:     0 (0.00%)
   Mismatched: 0

This is well below what I've seen the node push without the snapshot; the highest I've seen using ethspam is ~250 req/s. We observe similar throughput with less (but still some) concurrency; we do see some expected improvements in terms of latency here:

$ ethspam | versus --stop-after=10000 --concurrency=5 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   127.63 per second
   Timing:     0.0392s avg, 0.0001s min, 5.7916s max
               0.3083s standard deviation

   Percentiles:
     25% in 0.0004s
     50% in 0.0009s
     75% in 0.0022s
     90% in 0.0058s
     95% in 0.0264s
     99% in 1.0629s

   Errors: 0.00%

** Summary for 1 endpoints:
   Completed:  10000 results with 10000 total requests
   Timing:     39.175919ms request avg, 1m18.939723154s total run time
   Errors:     0 (0.00%)
   Mismatched: 0

Throughput is far worse with no concurrency, though we can see latency is improved even further (expected):

$ ethspam | versus --stop-after=10000 --concurrency=1 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   61.82 per second
   Timing:     0.0162s avg, 0.0001s min, 2.8866s max
               0.1414s standard deviation

   Percentiles:
     25% in 0.0002s
     50% in 0.0004s
     75% in 0.0008s
     90% in 0.0017s
     95% in 0.0060s
     99% in 0.4536s

   Errors: 0.00%

** Summary for 1 endpoints:
   Completed:  10000 results with 10000 total requests
   Timing:     16.176506ms request avg, 2m42.121228219s total run time
   Errors:     0 (0.00%)
   Mismatched: 0

Here is the output of top showing that geth isn't being starved out for resources by other processes. Note that my CPU has 8 logical cores so the % CPU usage should be divided by 8 to get the overall share of CPU usage:

top - 22:03:48 up 5 days, 20:56,  1 user,  load average: 1.06, 1.17, 1.58
Tasks: 290 total,   2 running, 287 sleeping,   1 stopped,   0 zombie
%Cpu(s):  9.7 us,  2.1 sy,  0.0 ni, 86.1 id,  0.3 wa,  0.0 hi,  1.8 si,  0.0 st
MiB Mem :  15945.9 total,    174.0 free,  12977.8 used,   2794.1 buff/cache
MiB Swap:  30518.0 total,  28804.6 free,   1713.4 used.   4032.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                              
 121459 elliott+   0 -20   14.5g  10.8g  74276 S  75.7  69.5   2731:41 geth                                 
 121835 elliott+  20   0  840092  39568  22040 S   6.6   0.2  24:29.53 gnome-system-mo                      
   1875 elliott+  20   0 1025492  44628  18620 R   2.9   0.3 139:11.18 gnome-terminal-                      
   1105 elliott+  20   0  879412  27960   8572 S   2.2   0.2  77:09.56 Xorg                                 
   1459 elliott+  20   0 4818408 178144  26484 S   2.2   1.1  92:02.98 gnome-shell                          
 137156 root      20   0       0      0      0 I   2.2   0.0   0:00.09 kworker/u16:2-events_unbound         
    690 systemd+  20   0   24308   1760   1216 S   1.5   0.0  13:02.09 systemd-resolve                      
   1315 elliott+  20   0  397964   5056   2812 S   1.5   0.0   2:33.03 ibus-daemon                          
 137111 elliott+  20   0   20600   3820   3052 R   1.5   0.0   0:01.41 top                                  

I suppose it is worth noting that geth is occupying almost an entire logical core just passively. Under the heavy load of ethspam's eth_call deluge, it does jump up to 650-750%, which indicates that it is in-fact using all 4 of the cores as well as the CPU's hyper-threading capabilities (though the CPU is not literally pegged). My own application doesn't seem to push it as hard: geth tops out around 550% CPU when consuming its payloads.

For posterity, I tried restarting the node, which creating the following output during shutdown:

^CINFO [11-02|22:30:51.989] Got interrupt, shutting down... 
INFO [11-02|22:30:52.492] HTTP server stopped                      endpoint=127.0.0.1:8545
INFO [11-02|22:30:52.495] IPC endpoint closed                      url=/home/elliottdehn/Data/ethereum/geth.ipc
INFO [11-02|22:30:52.498] Ethereum protocol stopped 
INFO [11-02|22:30:52.502] Transaction pool stopped 
INFO [11-02|22:30:52.557] Writing cached state to disk             block=11181773 hash="4fe7aa…2d1894" root="9b06b6…5d6829"
INFO [11-02|22:31:02.512] Looking for peers                        peercount=0 tried=3 static=0
INFO [11-02|22:31:12.513] Looking for peers                        peercount=0 tried=4 static=0
INFO [11-02|22:31:16.921] Persisted trie from memory database      nodes=4780385 size=1.08GiB time=54.925728819s gcnodes=25512538 gcsize=9.43GiB gctime=1m31.751305368s livenodes=398958 livesize=146.41MiB
INFO [11-02|22:31:16.928] Writing cached state to disk             block=11181772 hash="3d699a…ee50e8" root="027219…b28955"
INFO [11-02|22:31:16.962] Persisted trie from memory database      nodes=3679    size=1.39MiB time=34.113483ms   gcnodes=0        gcsize=0.00B   gctime=0s              livenodes=395279 livesize=145.01MiB
INFO [11-02|22:31:16.962] Writing cached state to disk             block=11181646 hash="25276a…6d3a5d" root="f761f8…863d5b"
INFO [11-02|22:31:17.838] Persisted trie from memory database      nodes=116765  size=40.71MiB time=875.148851ms  gcnodes=0        gcsize=0.00B   gctime=0s              livenodes=278514 livesize=104.30MiB
INFO [11-02|22:31:17.838] Writing snapshot state to disk           root="42e5d9…8a2c81"
INFO [11-02|22:31:17.838] Persisted trie from memory database      nodes=0       size=0.00B    time="2.654µs"     gcnodes=0        gcsize=0.00B   gctime=0s              livenodes=278514 livesize=104.30MiB
ERROR[11-02|22:31:18.468] Dangling trie nodes after full cleanup 
INFO [11-02|22:31:18.476] Writing clean trie cache to disk         path=/home/elliottdehn/Data/ethereum/geth/triecache threads=8
INFO [11-02|22:31:19.334] Persisted the clean trie cache           path=/home/elliottdehn/Data/ethereum/geth/triecache elapsed=851.369ms
INFO [11-02|22:31:19.334] Blockchain stopped

The ERROR log does jump out to me, but I'm not sure what it means. Starting geth back up again generated the following:

$ geth --http --cache 5315 --datadir /home/elliottdehn/Data/ethereum --verbosity 3 --snapshot
INFO [11-02|22:33:10.082] Starting Geth on Ethereum mainnet... 
INFO [11-02|22:33:10.084] Maximum peer count                       ETH=50 LES=0 total=50
INFO [11-02|22:33:10.084] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
INFO [11-02|22:33:10.086] Set global gas cap                       cap=25000000
INFO [11-02|22:33:10.086] Allocated trie memory caches             clean=797.00MiB dirty=1.30GiB
INFO [11-02|22:33:10.086] Allocated cache and file handles         database=/home/elliottdehn/Data/ethereum/geth/chaindata cache=2.59GiB handles=524288
INFO [11-02|22:33:14.127] Opened ancient database                  database=/home/elliottdehn/Data/ethereum/geth/chaindata/ancient
INFO [11-02|22:33:14.179] Initialised chain configuration          config="{ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Byzantium: 4370000 Constantinople: 7280000 Petersburg: 7280000 Istanbul: 9069000, Muir Glacier: 9200000, YOLO v1: <nil>, Engine: ethash}"
INFO [11-02|22:33:14.179] Disk storage enabled for ethash caches   dir=/home/elliottdehn/Data/ethereum/geth/ethash count=3
INFO [11-02|22:33:14.179] Disk storage enabled for ethash DAGs     dir=/home/elliottdehn/.ethash count=2
INFO [11-02|22:33:14.182] Initialising Ethereum protocol           versions="[65 64 63]" network=1 dbversion=8
INFO [11-02|22:33:14.345] Loaded most recent local header          number=11181773 hash="4fe7aa…2d1894" td=18452213634300358842116 age=3m1s
INFO [11-02|22:33:14.345] Loaded most recent local full block      number=11181773 hash="4fe7aa…2d1894" td=18452213634300358842116 age=3m1s
INFO [11-02|22:33:14.345] Loaded most recent local fast block      number=11181773 hash="4fe7aa…2d1894" td=18452213634300358842116 age=3m1s
INFO [11-02|22:33:14.347] Loaded last fast-sync pivot marker       number=11149515
INFO [11-02|22:33:14.420] Loaded local transaction journal         transactions=0 dropped=0
INFO [11-02|22:33:14.420] Regenerated local transaction journal    transactions=0 accounts=0
WARN [11-02|22:33:14.420] Switch sync mode from fast sync to full sync 
INFO [11-02|22:33:14.422] Starting peer-to-peer node               instance=Geth/v1.9.23-stable-8c2f2715/linux-amd64/go1.15
INFO [11-02|22:33:14.479] New local node record                    seq=896 id=f0aac1b069f6cae6 ip=127.0.0.1 udp=30303 tcp=30303
INFO [11-02|22:33:14.479] Started P2P networking                   self=enode://d1d044642181d4e06585efbd0fb4020d5ae26204ec6e566a52630d7bc5d52c04aa0110075b75994165eab1dab5eb8ec39db9b5493b86ad29fd104d2c85f05ba5@127.0.0.1:30303
INFO [11-02|22:33:14.479] IPC endpoint opened                      url=/home/elliottdehn/Data/ethereum/geth.ipc
INFO [11-02|22:33:14.480] HTTP server started                      endpoint=127.0.0.1:8545 cors= vhosts=localhost
INFO [11-02|22:33:24.479] Looking for peers                        peercount=0 tried=3 static=0
INFO [11-02|22:33:27.710] New local node record                    seq=897 id=f0aac1b069f6cae6 ip=74.215.2.95 udp=30303 tcp=30303
INFO [11-02|22:33:34.480] Looking for peers                        peercount=0 tried=4 static=0

Output once it found a peer and caught up:

INFO [11-02|22:59:52.869] Downloader queue stats                   receiptTasks=0 blockTasks=135 itemSize=8.03KiB throttle=8164
INFO [11-02|22:59:53.825] Imported new chain segment               blocks=2 txs=433 mgas=24.933 elapsed=956.141ms mgasps=26.076 number=11181775 hash="857854…79abfe" age=28m58s dirty=3.76MiB
INFO [11-02|22:59:54.659] Imported new chain segment               blocks=3 txs=533 mgas=37.112 elapsed=833.289ms mgasps=44.537 number=11181778 hash="458a42…bd48f1" age=28m29s dirty=8.55MiB
INFO [11-02|22:59:56.690] Looking for peers                        peercount=1 tried=4 static=0
INFO [11-02|22:59:56.936] Imported new chain segment               blocks=9 txs=1446 mgas=111.495 elapsed=2.268s    mgasps=49.160 number=11181787 hash="2ed86a…76220d" age=26m34s dirty=22.55MiB
INFO [11-02|23:00:01.882] Imported new chain segment               blocks=25 txs=3948 mgas=298.951 elapsed=4.946s    mgasps=60.443 number=11181812 hash="084ce0…6cf938" age=22m30s dirty=59.65MiB
INFO [11-02|23:00:06.692] Looking for peers                        peercount=1 tried=4 static=0
INFO [11-02|23:00:09.912] Imported new chain segment               blocks=45 txs=7784 mgas=535.211 elapsed=8.029s    mgasps=66.659 number=11181857 hash="cb2171…c30450" age=13m19s dirty=130.50MiB
INFO [11-02|23:00:14.255] Deep froze chain segment                 blocks=106 elapsed=69.100ms  number=11091879 hash="cdbb21…b72a4b"
INFO [11-02|23:00:16.692] Looking for peers                        peercount=1 tried=2 static=0
INFO [11-02|23:00:17.994] Imported new chain segment               blocks=47  txs=8295 mgas=572.845 elapsed=8.082s    mgasps=70.878 number=11181904 hash="45847f…aba65a" age=3m47s  dirty=206.30MiB
INFO [11-02|23:00:19.746] Imported new chain segment               blocks=9   txs=1937 mgas=111.938 elapsed=1.751s    mgasps=63.906 number=11181913 hash="ed7407…a46792" age=1m22s  dirty=213.50MiB
INFO [11-02|23:00:20.394] Imported new chain segment               blocks=4   txs=949  mgas=49.583  elapsed=647.860ms mgasps=76.534 number=11181917 hash="203f67…eb8da4" dirty=217.39MiB
INFO [11-02|23:00:23.084] Imported new chain segment               blocks=1   txs=189  mgas=12.438  elapsed=162.020ms mgasps=76.769 number=11181918 hash="792c57…bdc41c" dirty=217.96MiB
INFO [11-02|23:00:24.146] Imported new chain segment               blocks=1   txs=191  mgas=12.426  elapsed=141.084ms mgasps=88.072 number=11181919 hash="6fa38d…8769fa" dirty=218.94MiB
INFO [11-02|23:00:26.692] Looking for peers                        peercount=1 tried=4 static=0
INFO [11-02|23:00:36.693] Looking for peers                        peercount=1 tried=5 static=0
INFO [11-02|23:00:46.693] Looking for peers                        peercount=1 tried=4 static=0
INFO [11-02|23:00:46.718] Imported new chain segment               blocks=1   txs=281  mgas=12.437  elapsed=141.378ms mgasps=87.969 number=11181920 hash="d19584…044d7f" dirty=220.07MiB
INFO [11-02|23:00:56.694] Looking for peers                        peercount=1 tried=6 static=0
INFO [11-02|23:01:01.371] Imported new chain segment               blocks=1   txs=201  mgas=12.413  elapsed=122.446ms mgasps=101.379 number=11181921 hash="3efd68…53ce0a" dirty=221.07MiB
INFO [11-02|23:01:02.315] Imported new chain segment               blocks=1   txs=147  mgas=12.432  elapsed=132.431ms mgasps=93.872  number=11181922 hash="275612…aa4641" dirty=221.60MiB

Resulting ethspam output:

ethspam | versus --stop-after=10000 --concurrency=5 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   193.23 per second
   Timing:     0.0259s avg, 0.0001s min, 3.3065s max
               0.1884s standard deviation

   Percentiles:
     25% in 0.0003s
     50% in 0.0005s
     75% in 0.0012s
     90% in 0.0033s
     95% in 0.0132s
     99% in 0.7462s

   Errors: 0.00%

** Summary for 1 endpoints:
   Completed:  10000 results with 10000 total requests
   Timing:     25.876313ms request avg, 52.193531869s total run time
   Errors:     0 (0.00%)
   Mismatched: 0

So performance increased a bit (~50%) and I was able to reproduce this result. There was also a slight gain in the throughput of my application of about 15%. Both of these metrics represent a return to virtually the same throughput I was able to get before building the snapshot.

Some additional notes:

  • The geth process is the highest-priority process on my machine. Setting it to a "normal" priority (on par with all other processes) causes a throughput loss of about 5-20%
  • ethspam executes exclusively eth_call RPC requests in a huge variety, so we should probably expect to see some semblance of throughput improvement here, rather than the same or perhaps even a loss of throughput
  • geth is consuming about 10.5 GiB of RAM with room to spare, so this is likely not for a lack of memory exploitation
  • Not sure if this will help, but it might: here is the pmap (process memory maps) output for my geth process:
    https://gist.githubusercontent.com/elliottdehn/ab8dc5f80aa01d4607247d8baeeb310d/raw/413bc6bf17dbdb5c261f0820d8dec00e87c4e084/geth_memory_map.txt
  • I've triple-checked to make sure there isn't some mystical vanilla geth process living on port 8545 that I've somehow lost or forgotten about
  • My application uses 8 round-robinned IPC "connections" each of which has its own thread in a Python ThreadPool. So the performance problems are likely unrelated to marginal overheads related to the HTTP used by ethspam. Also, the actual runtime of my application's queries are between 5-8ms, as reported by the highest verbosity, so (irreducible) marginal overheads are likely somewhat overshadowed by the payload runtimes.

Let me know if there is any more diagnostic visibility I might be able to provide here; getting these numbers up is a functional requirement for me right now :-) I'm going to do my best to keep looking into it, but I'm obviously not an expert on the internals of geth.

@holiman
Copy link
Contributor

holiman commented Nov 3, 2020

ethspam generates an infinite stream of realistic read-only Ethereum JSONRPC queries, anchored around the latest block with some amount of random jitter. The latest state is updated every 15 seconds, so it can run continuously without becoming stale.

So, I think we need to dive into what this means. But essentially, if it queries about accounts / slots that have been modified in the last 127 blocks or so, geth is almost guaranteed to have the data in memory, regardless of whether --snapshot is used or not.

So the performance increase, in that case, would be:

  • If snapshot, we can access the data in O(1), in memory
  • Without snapshot, we need to traverse the trie, but still in-memory.

If the queries were instead focused on 'random' data, which has not been touched recently, the differences between these two approaches would be a lot larger, in absolute times. As it is, I think the differences are relatively minor in comparison with other stuff.

@elliottdehn
Copy link
Author

elliottdehn commented Nov 3, 2020

(shorter) Update 2: I did an scp of geth's entire datadir from my Ubuntu machine onto my OSX laptop just to see if there would be any improvements on the different hardware. Unfortunately, this doesn't seem to be the case:

geth --cache=5315 --http --snapshot --datadir /Users/dehnbostele/Documents/Data/ethereum
ethspam | versus --concurrency=5 --stop-after=10000 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   95.91 per second
   Timing:     0.0521s avg, 0.0002s min, 8.0390s max
               0.4243s standard deviation

   Percentiles:
     25% in 0.0004s
     50% in 0.0009s
     75% in 0.0019s
     90% in 0.0054s
     95% in 0.0235s
     99% in 2.0004s

   Errors: 0.00%

** Summary for 1 endpoints:
   Completed:  10000 results with 10000 total requests
   Timing:     52.133123ms request avg, 1m44.687028468s total run time
   Errors:     0 (0.00%)
   Mismatched: 0

My laptop has weaker specs than my desktop so a performance hit is to be expected. However, if the acceleration were actually "doing the thing," we should probably see some increased throughput (perhaps at least the same) regardless. By my estimate, this number is more reflective of the performance I would expect without acceleration.

Here is are my laptop's spec, for reference:

Screen Shot 2020-11-03 at 11 56 37 AM

Just for the sake of it, I tried messing around with some of the cache allocations (though I'll be the first to admit I don't entirely understand exactly what each one means). Here's me giving almost the entire cache to the snapshot:

geth --cache=5315 --http --snapshot --datadir /Users/dehnbostele/Documents/Data/ethereum --cache.database 5 --cache.gc 5 --cache.snapshot 85 --cache.trie 5
ethspam | versus --concurrency=5 --stop-after=10000 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   95.38 per second
   Timing:     0.0524s avg, 0.0002s min, 6.9531s max
               0.3953s standard deviation

   Percentiles:
     25% in 0.0006s
     50% in 0.0011s
     75% in 0.0023s
     90% in 0.0065s
     95% in 0.0297s
     99% in 1.8049s

   Errors: 0.00%

** Summary for 1 endpoints:
   Completed:  10000 results with 10000 total requests
   Timing:     52.421388ms request avg, 1m45.963429514s total run time
   Errors:     0 (0.00%)
   Mismatched: 0

Wow, surprising result!! Performance was 100% identical. Given the almost non-existence of the "vanilla" caches here, I had expected some godawful performance.

Things to consider:

  • Despite allocating up to 4.5GB of memory to the Snapshot cache, geth maxed out and pegged at a ~1015 MB memory footprint while ethspam was running. This makes it seem like the Snapshot simply isn't being loaded into memory or used.
  • The laptop is/was plugged in and charging, so any invisible throttling attributable to not being plugged in is unlikely.

I think that about covers it. If anyone would like, I'd be willing to scp/ftp/whatever the chain data directory to you for your own perusal. For now, I think I'm going to try experimenting with Turbo-Geth and see if it can crank out some improved performance 🤞

@holiman
Copy link
Contributor

holiman commented Nov 3, 2020

I was just digging into ethspam a bit. These seem to be the queries: https://github.com/shazow/ethspam/blob/master/queries.go .

These would not be affected, as it's not state-data but block data:

  • genEthGetTransactionReceipt
  • genEthGetBlockByNumber
  • genEthBlockNumber
  • genEthGetTransactionByHash
  • genEthGetLogs

These should be improved by snapshot:

  • genEthCall
  • genEthGetBalance
  • genEthGetTransactionCount
  • genEthGetCode

These depends on RandomAddress: https://github.com/shazow/ethspam/blob/master/state.go#L63 .

func (s *liveState) RandomAddress() string {
	if len(s.transactions) == 0 {
		return ""
	}
	idx := int(s.randSrc.Int63()) % len(s.transactions)
	return s.transactions[idx].From.String()
}

As far as I understand it, it selects an account from a list of transactions, and the list of transactions is re-filled from the latest block from time to time. That set is also used to populate contract-addresses.

So my theory of why you don't experience any orders-of-magnitude performance is:

  • Only 4 out of 9 call types are state-related (snapshotter only improves state access, not block data access)
  • The state accesses are all focused on 'recently touched' state, and already resides in memory.

It definitely would be interesting to see how these numbers compare with other clients, like turbo-geth, but I would also be very interested to see the stats broken down in query-types, so we could better pinpoint which query types are slowest, or have highest variance etc.

Anyway, thanks for all the work so far!

@elliottdehn
Copy link
Author

elliottdehn commented Nov 3, 2020

@holiman just seeing your comment now. Indeed my application is accessing recent data rather than random data, not sure about the specifics of ethspam, but I suspect it would either have a mix of old/new data, or be exclusively new data. I would have to look.

I think given the identical performance of geth despite essentially non-existent caches outside of the snapshot cache would point to likely the latter scenario.

Edit: looks like you are way ahead of me, thanks for the insight. Once I have the chance I'll see if I can't set up some (ideally rather more share-able) scripts focusing on the specific cases you listed.

@holiman
Copy link
Contributor

holiman commented Nov 3, 2020

I seem to have a knack for posting right before you do, so you don't see it until later :)

@elliottdehn
Copy link
Author

😂 Every time.

@holiman
Copy link
Contributor

holiman commented Nov 3, 2020

I forked ethspam: https://github.com/holiman/ethspam/tree/fixes , and fixed it so weights can be set to 0 without panicing (edit: that fix is now merged).
It can be run like this to only do state-access type of queries:

ethspam -m=eth_getCode:100 -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:400 -m=eth_getBlockByNumber:0 -m=eth_getBalance:550  -m=eth_getTransactionReceipt:0  -m=eth_call:2000

or

[user@work ethspam]$ go run . -m=eth_getCode:100 -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:400 -m=eth_getBlockByNumber:0 -m=eth_getBalance:550  -m=eth_getTransactionReceipt:0  -m=eth_call:2000

By using this mode, one can more easily benchmark individual querytypes.

@holiman
Copy link
Contributor

holiman commented Nov 3, 2020

Here are some results running against a node which doesn't have snapshot.

State only

[user@work ethspam]$ ./ethspam  -m=eth_getCode:100 -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:100 -m=eth_getBlockByNumber:0 -m=eth_getBalance:100  -m=eth_getTransactionReceipt:0  -m=eth_call:100| versus --concurrency=5 --stop-after=500 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   146.45 per second
   Timing:     0.0341s avg, 0.0306s min, 0.0686s max
               0.0048s standard deviation

Blockdata only

user@work ethspam]$ ./ethspam  -m=eth_getCode:0 -m=eth_getLogs:100  -m=eth_getTransactionByHash:100  -m=eth_blockNumber:100  -m=eth_getTransactionCount:0 -m=eth_getBlockByNumber:100 -m=eth_getBalance:0  -m=eth_getTransactionReceipt:100  -m=eth_call:0| versus --concurrency=5 --stop-after=500 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   22.70 per second
   Timing:     0.2203s avg, 0.0307s min, 6.0633s max
               0.6400s standard deviation

GetLogs only

[user@work ethspam]$ ./ethspam  -m=eth_getCode:0 -m=eth_getLogs:100  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:0 -m=eth_getBlockByNumber:0 -m=eth_getBalance:0  -m=eth_getTransactionReceipt:0  -m=eth_call:0| versus --concurrency=5 --stop-after=500 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   3.93 per second
   Timing:     1.2718s avg, 0.0325s min, 10.2221s max
               1.8706s standard deviation

Block data without getLogs

[user@work ethspam]$ ./ethspam  -m=eth_getCode:0 -m=eth_getLogs:0  -m=eth_getTransactionByHash:100  -m=eth_blockNumber:100  -m=eth_getTransactionCount:0 -m=eth_getBlockByNumber:100 -m=eth_getBalance:0  -m=eth_getTransactionReceipt:100  -m=eth_call:0| versus --concurrency=5 --stop-after=500 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   71.00 per second
   Timing:     0.0704s avg, 0.0314s min, 0.2424s max
               0.0507s standard deviation

So apparently getLogs is the slowest thing we have, and in general the state-reader queries are faster than the block-reader-queries, even without snapshot.

@holiman
Copy link
Contributor

holiman commented Nov 3, 2020

And if I run it in default mode, I get

[user@work ethspam]$ ./ethspam  | versus --concurrency=5 --stop-after=500 'http://localhost:8545'
Endpoints:

0. "http://localhost:8545"

   Requests:   46.90 per second
   Timing:     0.1066s avg, 0.0308s min, 6.9147s max
               0.4940s standard deviation

Which indicates that even though my state accesse are at 146 or thereabout, the heavy slow multi-block spanning getLogs (and other blockdata-accessing queries) drags down the mean rate quite heavily.

@holiman
Copy link
Contributor

holiman commented Nov 4, 2020

So, I made some measurements on the machine (mon02) now, here are the results from 10K requests on each 'thing' (but all.txt is all state-stuff equally weighted):

all.txt:   Requests:   134.03 per second
balance.txt:   Requests:   146.75 per second
call.txt:   Requests:   139.80 per second
code.txt:   Requests:   125.73 per second
nonce.txt:   Requests:   141.03 per second

I'm now generating the snapshot on it, will do another run in a few days.

@holiman
Copy link
Contributor

holiman commented Nov 5, 2020

It might be also a good idea to add a reference speed test. I opted to use web3_clientVersion, since it has more or less 0 actual processing time, and the speed depends only on the network speed, latency and node RPC framework. Then we can measure the overall overhead.

#!/bin/bash

spam=/home/user/go/src/github.com/shazow/ethspam/ethspam
v="--concurrency=5 --stop-after=10000 http://localhost:8545"

yes "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"web3_clientVersion\",\"params\":[]}" | versus $v > version.txt
echo "Reference speed:" && cat version.txt | grep Requests

$spam  -m=eth_getCode:100 -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:100 -m=eth_getBlockByNumber:0 -m=eth_getBalance:100 -m=eth_getTransactionReceipt:0  -m=eth_call:100 | versus $v > all.txt
echo "State (all) speed:" && cat all.txt | grep Requests
$spam  -m=eth_getCode:100 -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:0   -m=eth_getBlockByNumber:0 -m=eth_getBalance:0   -m=eth_getTransactionReceipt:0  -m=eth_call:0  | versus $v > code.txt
echo "State (code) speed:" && cat code.txt | grep Requests
$spam  -m=eth_getCode:0   -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:100 -m=eth_getBlockByNumber:0 -m=eth_getBalance:0   -m=eth_getTransactionReceipt:0  -m=eth_call:0  | versus $v > nonce.txt
echo "State(nonce) speed:" && cat nonce.txt | grep Requests
$spam  -m=eth_getCode:0   -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:0   -m=eth_getBlockByNumber:0 -m=eth_getBalance:100 -m=eth_getTransactionReceipt:0  -m=eth_call:0  | versus $v > balance.txt
echo "State (balance) speed:" && cat balance.txt | grep Requests
$spam  -m=eth_getCode:0   -m=eth_getLogs:0  -m=eth_getTransactionByHash:0  -m=eth_blockNumber:0  -m=eth_getTransactionCount:0   -m=eth_getBlockByNumber:0 -m=eth_getBalance:0   -m=eth_getTransactionReceipt:0  -m=eth_call:100| versus $v > call.txt
echo "State (call) speed:" && cat call.txt | grep Requests

So if I plug that into my earlier numbers:

Reference speed:
   Requests:   142.73 per second
State (all) speed:
   Requests:   134.03 per second
State (code) speed:
   Requests:   125.73 per second
State(nonce) speed:
   Requests:   141.03 per second
State (balance) speed:
   Requests:   146.75 per second
State (call) speed:
   Requests:   139.80 per second

I find that the requests I made were basically instantaneous, even on the non-snapshot-enabled queries. So I'm not expecting any major gains from --snapshot. The queries all resolve in memory so quickly that the throttle (for me) seems to be the network (I performed these tests on wifi, using an ssh tunnel to an aws host).

Could you please try the bash script above?

@ligi ligi removed the status:triage label Nov 5, 2020
@elliottdehn
Copy link
Author

Finally got around to try your script out. Will update this with numbers once my laptop is synced up again. Appreciate your work here, thank you!

@elliottdehn
Copy link
Author

elliottdehn commented Nov 7, 2020

These are my numbers on my laptop:

Reference speed:
   Requests:   2797.23 per second, 0.70 per second for errors
State (all) speed:
   Requests:   4354.61 per second
State (code) speed:
   Requests:   1460.58 per second
State(nonce) speed:
   Requests:   2606.11 per second, 1.44 per second for errors
State (balance) speed:
   Requests:   6767.52 per second
State (call) speed:
   Requests:   1459.57 per second

Which is, of course, wildly interesting. I'll have to write up a script to run the particular calls I'm interested in and see how that last number changes. The similarity between the second (code) and last number (call) makes me think that getting a contract's code is actually some kind of bottleneck for eth_call, though this may be a coincidence. Hard to tell at the moment.

Edit: I ran it a few more times and the numbers between code and call seem to have decoupled a bit, with call becoming a bit slower:

dehnbostele@Elliotts-MacBook-Pro ~ % sh performance.sh                                                          
Reference speed:
   Requests:   7171.92 per second
State (all) speed:
   Requests:   3700.63 per second
State (code) speed:
   Requests:   1402.48 per second
State(nonce) speed:
   Requests:   6253.87 per second
State (balance) speed:
   Requests:   4537.84 per second
State (call) speed:
   Requests:   1255.44 per second

----------------

Reference speed:
   Requests:   2264.38 per second, 6395.19 per second for errors
State (all) speed:
   Requests:   858.61 per second, 11.84 per second for errors
State (code) speed:
   Requests:   1715.10 per second
State(nonce) speed:
   Requests:   5903.79 per second
State (balance) speed:
   Requests:   4951.06 per second
State (call) speed:
   Requests:   1320.30 per second

@holiman
Copy link
Contributor

holiman commented Nov 9, 2020

Yeah, I don't know what to make of those numbers, tbh. Like, 0.70 per second for errors on a simple web3_clientVersion? What error is that? Why would that error?
It also seems to vary pretty wildly; reference speed sometimes, 7171.92 per second, and sometimes 6395.19 per second for errors. And it also makes no sense that e.g. balance is faster than the reference tests.

All in all, I'm starting to wonder if it isn't other factors having a large contribution to the speeds here. Like, the RPC endpoing may be a lot slower if it's actively processing a block, or reorganizing transactions. If it's importing a block, it uses several threads to derive transaction signatures. If you are doing this against a remote node, are you tunneling over ssh? Using wifi?

@holiman
Copy link
Contributor

holiman commented Oct 25, 2021

This isn't really an issue report about a bug -- we don't have any reason to believe the snapshots are broken, based on the discussions here, it's a more nuanced question about what types of data is requested. The speeds varies a lot for different types, some are accelerated, some not. Some are cached, some not.

@holiman holiman closed this as completed Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants