Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execute SQL updates as chunks in _set_spent function for tx_removals #11613

Merged
merged 4 commits into from
Jul 13, 2022

Conversation

neurosis69
Copy link
Contributor

@neurosis69 neurosis69 commented May 23, 2022

Currently for every coin removal per tx block, one SQL Update command is executed after the other.
This could lead to up to 900 update executions per tx block.

-- list blocks with most tx_removals
sqlite> select spent_index, count(1) from coin_record 
   ...> group by spent_index order by 2 desc limit 10;
0|172132232
226075|917
225800|894
225987|893
947021|890
789903|889
1678819|887
1129746|886
1383510|885
1678869|884

This PR changes the issued Update statement by

  • splitting given coin_names into coin_names_chunk using given constant
  • executing only one SQL Update using coin_name IN ( ... ) instead of coin_name = x'...'

sql log from current implementation

16:35:20.558244 RELEASE s1609854
16:35:20.564141 SAVEPOINT s1609855
16:35:20.565120 UPDATE OR FAIL coin_record SET spent_index=283422 WHERE coin_name=x'32dd4b18693a838dff8dacc37a4ba64cd4ea27a53cebac67d137a80cf18feb63' AND spent_index=0
16:35:20.565926 UPDATE OR FAIL coin_record SET spent_index=283422 WHERE coin_name=x'ccd5035f9067dd7b87f0f472f4361674976faf4b3b43b241776a2151471179f2' AND spent_index=0
...
16:35:20.623934 UPDATE OR FAIL coin_record SET spent_index=283422 WHERE coin_name=x'd6a8b8efee10ef8771d921d3ac68530414cd5a3072cf4d61176840d525a76c04' AND spent_index=0
16:35:20.624644 UPDATE OR FAIL coin_record SET spent_index=283422 WHERE coin_name=x'f02d940775d2873c195a50c260bbd3334b1ee1a053083159df90f9309dc20405' AND spent_index=0
16:35:20.625800 RELEASE s1609855
16:35:20.626943 SAVEPOINT s1609856
16:35:20.627797 UPDATE OR FAIL full_blocks SET in_main_chain=0 WHERE height>283421 AND in_main_chain=1

tx time from SAVEPOINT to RELEASE: 0.061659s for 103 removals on spent_index 283422

sql log after this PRs implementation

22:29:39.108772 RELEASE s5958
22:29:39.109872 SAVEPOINT s5959
22:29:39.111652 UPDATE OR FAIL coin_record SET spent_index=283422 WHERE spent_index=0 
AND coin_name in (x'32dd4b18693a838dff8dacc37a4ba64cd4ea27a53cebac67d137a80cf18feb63',x'ccd5035f9067dd7b87f0f472f4361674976faf4b3b43b241776a2151471179f2',
...
x'f02d940775d2873c195a50c260bbd3334b1ee1a053083159df90f9309dc20405') 
22:29:39.115791 RELEASE s5959
22:29:39.116985 SAVEPOINT s5960
22:29:39.118238 UPDATE OR FAIL full_blocks SET in_main_chain=0 WHERE height>283421 AND in_main_chain=1

tx time from SAVEPOINT to RELEASE: 0.005919s for 103 removals on spent_index 283422

@mariano54
Copy link
Contributor

wow that is a major speedup! This must be one of the main bottlenecks of syncing, since there are probably ~1000 coins per transaction block during the dust storm periods.

@neurosis69
Copy link
Contributor Author

I'll need to rewrite the script as the chosen f""" makes it way harder to do log/trace analysis.

This is how a sql log would look like with this PR

23:45:53.959293 SAVEPOINT s162960
23:45:53.959428
                            UPDATE OR FAIL coin_record INDEXED BY sqlite_autoindex_coin_record_1
                            SET spent_index=1556983
                            WHERE spent_index=0
                            AND coin_name IN (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)

23:45:53.959953 RELEASE s162960

@neurosis69 neurosis69 marked this pull request as draft May 24, 2022 07:28
@neurosis69
Copy link
Contributor Author

neurosis69 commented May 24, 2022

wow that is a major speedup! This must be one of the main bottlenecks of syncing, since there are probably ~1000 coins per transaction block during the dust storm periods.

Earle answered my question about the sqlite concerns in the last AMA mainly with the high amount of I/O which is done right now. I see a potential to reduce that, by optimizing SQLs and some parameters around.

For instance, I did full sync tests from height 0 to 1945145 on my i7-11700k recently and tested some of these changes.

branch sync runtime GB written
main 27h36m 4351.769
test1 25h07m 2443.256
test2 18h24m 1009.895

I'd expect that on RPi the reduced runtime percentage would be even higher.

As I fully understand the reasoning behind the runtime/write reduction from my test1, I am still not sure why test2 brings that much of an improvement. So I'm picking some of the kind-of-standalone changes and submitting them as PRs, like the last few I did.

@neurosis69 neurosis69 marked this pull request as ready for review May 24, 2022 12:03
@mariano54
Copy link
Contributor

The tests are failing here, they need to pass.
Make sure to throw the exception like the previous code did:
if ret.rowcount != len(coin_names):

Those are some impressive improvements! what is test1 and what is test2?

@neurosis69
Copy link
Contributor Author

The tests are failing here, they need to pass.

I have problems understanding how these tests get triggered and what to check before making a PR to mitigate that errors.
Didn't find anything about that in the CONTRIBUTING.md. Can you give me a hint or point me to such dev-principles so I can read up on it?

Make sure to throw the exception like the previous code did: if ret.rowcount != len(coin_names):

I think rowcount is only available for executemany, but I need to look that up how to check that.
The chunks approch was only used in SELECT statements until now, I think, so that's different terrain of course.

Those are some impressive improvements! what is test1 and what is test2?

test1
is simply starting the sync in journal_mode=off, synchronous=off and dropping all indexes not needed for syncing, like on coin_record, hints table and the one on full_blocks used for blueboxes. (similar like in my branch main_fastsync).

The runtime improvement and IO reduction was expected and that was already what I experienced in dbv1 in winter during my tests, but I wasn't sure how to convince you to implement that and didn't have the time to follow up on that later.

I was thinking of a simple implementation like chia start node --fastsync and then, as long as the node is in long sync mode it operates without journal/synchronous and indexes. Then it either stops and you need to restart without fastsync or when switching to short sync it automatically creates the indexes and enables journaling and synchronous again.
As IO is such a concern and a performance factor, especially on low-end HW, I think that would give solid improvement for cases like syncing-only.

test2
on top of the changes from test 1, there are the PRs #11613, #11594, #11613 and a few other changes,
which you can find in my repo in the main_scenario0 branch.
I also tested different recommended sqlite compile options, which I was thinking could help in this project,
but I understood that you won't touch anything in that direction.
Currently I am running tests on my RPi to understand the improvement better and I think it is just the combination of
everything mentioned above plus the increased buffer cache.
But it is still too much guessing on my side right now.

@mariano54
Copy link
Contributor

You can get rowcount from the Cursor object which is returned from execute I think.

We need to look at the improvements one by one to see which one has an impact.

The tests get trigerred every time you push a new commit. You can also close and re-open the PR to trigger the tests.
The main thing is that you need to replicate the exact behavior of the previous code. For example, we use hex in the previous code, and not bytes for the coin names. Also, we throw a value error if we weren't able to set everything to spent. These things must be in the new PR.

Thanks for helping out btw, this is a very important change

@neurosis69 neurosis69 marked this pull request as draft May 24, 2022 16:31
@emlowe
Copy link
Contributor

emlowe commented May 24, 2022

You can also run the tests locally. If you sh install.sh -d it will install the dev dependencies (like pytest, etc).

You can then find the test invocation by looking into the workflow. - for example, here is the invocation for the core-full_node test:

pytest --durations=10  -n 4 -m "not benchmark" tests/core/full_node/test_address_manager.py tests/core/full_node/test_block_height_map.py tests/core/full_node/test_conditions.py tests/core/full_node/test_full_node.py tests/core/full_node/test_generator_tools.py tests/core/full_node/test_hint_management.py tests/core/full_node/test_mempool.py tests/core/full_node/test_mempool_performance.py tests/core/full_node/test_node_load.py tests/core/full_node/test_peer_store_resolver.py tests/core/full_node/test_performance.py tests/core/full_node/test_transactions.py

@xearl4
Copy link
Contributor

xearl4 commented May 24, 2022

(Don't mind me, I'm just helping @neurosis69 a bit with putting his ideas into code :)

While `executemany` is convenient for executing the same SQL for many
different sets of parameters, it still executes one SQL for each
parameter set. For updating coin spends, this means potentially running
hundreds of UPDATEs per transaction block.

With this change, these hundreds of single coin UPDATEs are batched into
larger chunks where a single UPDATE marks many coins spent.
@neurosis69 neurosis69 marked this pull request as ready for review May 24, 2022 21:59
@emlowe emlowe requested review from arvidn and mariano54 May 26, 2022 16:56
arvidn
arvidn previously approved these changes May 26, 2022
Copy link
Contributor

@arvidn arvidn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! I've spent some time recently trying to pin down exactly where we're spending so much time. My first impression was that we were I/O bound, but I couldn't find any smoking guns when profiling the aiosqlite thread. I suspect we spend a lot of time ping-ponging with the aiosqlite thread.

So, I suspect this speed-up could also be gained by not waiting for one execute() to complete before issuing the next. Until recently I assumed async functions would return a future and you could just remove the await and wait for them all at the end of the function. However, that doesn't quite work, you have to create_task() to get a future back.

In summary, I think every single await in a loop is potentially problematic, and we should probably generally try to void that, by creating short lived tasks (just to get futures for the async operaitons). Even with this change, ideally you wouldn't use await inside the loop. Although, in practice you would most likely end up with two or fewer turns through the loop.

name_params = ",".join(["?"] * len(coin_names_chunk))
if self.db_wrapper.db_version == 2:
ret: Cursor = await conn.execute(
f"UPDATE OR FAIL coin_record INDEXED BY sqlite_autoindex_coin_record_1 "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifying the index here, I assume, is to make sure it's not using the spent_index index, right?
The condition spent_index=0 isn't really supposed to be required. All these coins are supposed to be unspent (iiuc). Perhaps we could remove that condition and also drop the INDEXED BY, would that work or make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ran this first without that index clause and it started to create a huge amount of i/o, then I figured the wrong index usage.
I'm sure if we remove spent_index condition, we don't need INDEXED BY any more

@arvidn
Copy link
Contributor

arvidn commented May 26, 2022

I'm interested in running benchmarks on this myself.

@arvidn
Copy link
Contributor

arvidn commented May 26, 2022

running the tools/test_full_sync.py, I don't see any speed-up with this patch. I wonder why.

Running the stress test with 500 full blocks, I see:

before after
154.56s 155.57s

@arvidn
Copy link
Contributor

arvidn commented May 26, 2022

my first run was with db-sync off, running it with full also didn't show an improvement. I didn't measure bytes written though. I should probably add that to the tools/run_benchmark.sh as well.

before after
163.15s 163.21s

@neurosis69
Copy link
Contributor Author

On which hardware are you running your benchmarks? My timings in first post for this patch were taken from RPi.

@neurosis69
Copy link
Contributor Author

I've started a few testsyncs (regular from peer) over the last few days and when I've sorted the data out I will share that data in my repo.

How can I start your mentioned benchmark?
Just python3 tools/test_full_sync.py

@arvidn
Copy link
Contributor

arvidn commented May 26, 2022

I'm running it on an AMD threadripper with SSD, so quite powerful compared to the RPi.

Unfortunately it's not so simple to run the benchmark. I'm using an artificial chain, created to max out all blocks. i.e. as a stress-test. You can create this chain by running:

python ./tools/generate_chain.py --length 500 --fill-rate 100

And when running test_full_sync.py you have to specify --test-constants, because this chain doesn't use mainnet constants, because that would be too expensive to create.

@neurosis69
Copy link
Contributor Author

chia-sync-data is my repo where I'll upload the test data.

These uploads include the debug.log and automatically generated sar reports for every uploaded test.
Currently there are two testruns uploaded, one using the main branch and one with most of the optimizations I use (main_scenario0 branch).

The ansible repo is I use for these tests is chia-sync-test.
Info about Scenarios/Testcases is also there

@neurosis69
Copy link
Contributor Author

For both testruns mentioned above, here is an additional grafana chart.

left side is using my branch with optimisations, right side is using main branch.

row 1: blocks synced over 5m
row 2: bytes written to db device
row 3: cpu utilization

grafik

@neurosis69
Copy link
Contributor Author

neurosis69 commented May 26, 2022

running the tools/test_full_sync.py, I don't see any speed-up with this patch. I wonder why.

I'm not sure what this test actually does? Is this a block validation test?
Because there is await in_db.execute("pragma query_only") for the only db connection.
When tracing the session only 1 SQL SELECT statements was issued.

I don't think that this is a valid test for my proposed change in _set_spent function, or is it?

EDIT: Or I may have executed the wrong script?

@jspro25
Copy link

jspro25 commented May 27, 2022 via email

@arvidn
Copy link
Contributor

arvidn commented May 27, 2022

I'm not sure what this test actually does? Is this a block validation test?
Because there is await in_db.execute("pragma query_only") for the only db connection.
When tracing the session only 1 SQL SELECT statements was issued.

That's the connection that reads the test blockchain database. It feeds those blocks into a full node instance, that's the actual test/benchmark.

@neurosis69
Copy link
Contributor Author

Finally I was able to run your benchmark on my i7-11700k with enabled sql logging.

before after
148.28s 144.21s

sql.log output, taken from last updates issued on block height 479:

before:

13:38:03.322184 SAVEPOINT s2808
13:38:03.322345 UPDATE OR FAIL coin_record SET spent_index=? WHERE coin_name=? AND spent_index=0
...
13:38:03.348058 UPDATE OR FAIL coin_record SET spent_index=? WHERE coin_name=? AND spent_index=0
13:38:03.348209 RELEASE s2808

13:38:03.348209 - 13:38:03.322184 = 0,026025s

after

13:42:32.806548 SAVEPOINT s2808
13:42:32.807098 UPDATE OR FAIL coin_record INDEXED BY sqlite_autoindex_coin_record_1 SET spent_index=479 WHERE spent_index=0 AND coin_name IN (?,?,?,?,?,
?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
...
,?,?,?,?,?,?,?,?,?,?)
13:42:32.814143 RELEASE s2808

13:42:32.814143 - 13:42:32.806548 = 0,007595s

I compared the main branch with the change in _set_spent function.
I wouldn't expect much IO related improvements by this PR, it is only about UPDATE execution time.

The IO related improvements would come from no-wal/synchronous and drop indexes during sync.
(although I'm currently testing the impact of letting wal on, as I had a few annoying db locking situations during my tests)

@arvidn
Copy link
Contributor

arvidn commented May 27, 2022

I was expecting WAL would improve write performance, since updates are written sequentially to the WAL, and incorporated into the DB later, asynchronously.

@neurosis69
Copy link
Contributor Author

I was expecting WAL would improve write performance, since updates are written sequentially to the WAL, and incorporated into the DB later, asynchronously.

Compared to other journal modes I'd agree. But compared to no journaling mode I'd expect it to be slower. At least you need additional checkpointing operation which slows down transactions. Probably different file format in WAL as well, but I don't know SQLite that good.

Here is hyperfine Benchmark result of 10 runs for above tests, slight improvement stays.
I'd expect a higher improvement percentage on RPi.

before:

Benchmark 1: python ./tools/test_full_sync.py run --test-constants /chia_temp1/test-chain/stress-test-blockchain-500-100-refs.sqlite
  Time (mean ± σ):     146.391 s ±  0.508 s    [User: 368.267 s, System: 147.680 s]
  Range (min … max):   145.375 s … 147.050 s    10 runs

after:

Benchmark 1: python ./tools/test_full_sync.py run --test-constants /chia_temp1/test-chain/stress-test-blockchain-500-100-refs.sqlite
  Time (mean ± σ):     143.878 s ±  0.536 s    [User: 364.189 s, System: 149.680 s]
  Range (min … max):   142.985 s … 144.855 s    10 runs

@neurosis69
Copy link
Contributor Author

neurosis69 commented May 27, 2022

I have another PR planned, based on an idea of @xchdata1, using batch INSERT for _add_coin_records function.

Running a current benchmark for only this change brings down runtime to 138.88s.
Running benchmark with both changes in _add_coin_records and set_spent_function, the runtime is 135.54s

But for this INSERT PR I'd need additional assistance by @xchdata1 :)

@neurosis69
Copy link
Contributor Author

neurosis69 commented May 29, 2022

Benchmark tests
I integrated your benchmark tool in my ansible roles and did a couple of runs against my repos branches.

For now I'll simply update the readme in my repo.

At the top, the table containes the runtimes from my i7-11700k, the avg timing over 5 executions on a 500 blocks test.
Below I added runtimes from my RPi4, avg timing over 2 executions on a 500 blocks test.
At the bottom there are runtimes from my old synctest using mainnet-peers.

I'll probably add timings for other tests and also plan to create a bigger chain, as 500 is a bit to small for my i7 I think.


Synchronous
What I don't understand is, that when running your benchmark tool it seems that the I/O is already reduced and it only peaked when I explicitly set it to NORMAL, which is strange because by default it should be set to FULL on my setup.
grafik

It looks like it is always off. Is this part of your simulation?
Because that would probably invalidate some of my test runs.


Regarding this PR I'd conclude

  • This change would lead to less db time, measured by sql logging from SAVEPOINT to RELEASE (whole transaction)
    In theory I'd explain that by having less parsing and execution time, but as mentioned I'm no sqlite expert and lack deep insights there.
  • A full_sync improvement, measured with your benchmark tool, is not really given, slight 0.6% improvement on RPi, but only 2 tests done.
  • improvements found:
    • using increased cache_size combined with this change on _set_spent updates
    • dropped coin_record indexes combined with this change on _set_spent updates
  • maybe aiosqlite has a throtteling effect here, using such big statements with bind variables (mem?)

EDIT: Is there a profiling feature built in, to only see the time it spent for the _set_spent function call? Because that's what we actually want to measure.

@arvidn
Copy link
Contributor

arvidn commented May 29, 2022

It looks like it is always off. Is this part of your simulation?

The original purpose of test_full_sync.py was to run a sync-from-scratch of a mainnet chain, to ensure no consensus rules had accidentally changed affecting the ability to sync. So setting it to "OFF" was an attempt to make that run faster.

I've manually set it to "FULL" from time to time to be more authentic for benchmark purposes. I also have an outstanding PR to make this configurable on the command line: #11647

@neurosis69
Copy link
Contributor Author

I've manually set it to "FULL" from time to time to be more authentic for benchmark purposes. I also have an outstanding PR to make this configurable on the command line: #11647

Looking forward for #11647 beeing merged to rerun the benchmarks.

@wallentx wallentx closed this Jul 5, 2022
@wallentx wallentx reopened this Jul 5, 2022
Copy link
Contributor

@wjblanke wjblanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aok

@wallentx wallentx merged commit 37ab8fc into Chia-Network:main Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants