-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADP-634] Run SMASH metadata fetching in batches of 15 concurrently #2432
Conversation
@piotr-iohk maybe want to do some mainnet testing? |
I have done some testing against mainnet on Linux and Windows (on synced network). Linux (Ubuntu i7 6 core, 32GB RAM): Fetching time: Windows (VM 4 virtual CPU, 4GB RAM): Fetching time: In both cases (and systems) I was wondering if it'd be worth adding such benchmark into benchmark suite for future reference? |
So it seems no one is really sure what http-client does under the hood when sharing a |
Use wireshark to check what's happening (before and after). You don't need to decrypt the HTTPS, just look at how many TCP connections are being opened. Running HTTP requests in parallel is faster than in sequence, because the connection setups are concurrent. |
From looking at the source, I think the connection idle timeout for the http-client Manager connection pool is 30 seconds. I could be mistaken however. |
Following the http-client ticket:
Given the improvement with |
I just did some testing with wireshark and ghci - if you set the header |
I think we have a pretty clear vision now on how to move this forward:
|
I noticed this while syncing an empty wallet for looking at stake pools. On a node which is also syncing, the wallet was spamming the reward account endpoint with an empty reward list. Interestingly, each request would take between 300ms and 500ms (which is quite a lot for an empty request...) and I believe, would also require quite some time / context-switch to the node. I think we ought to have a sort of debouncer here (though arguably, once synced, this only get triggered every 20s or so...); yet, a lot hanging fruit is simply to short-circuit everything when there's no account.
910740f
to
6fc3208
Compare
@rvl implementation reviewed to use a bounded queue. Testing on mainnet, it took a bit less than a minute to fetch all metadata on my machine. I think that's "okay-ish" / sufficient at this stage. What could be useful as a follow-up work I imagine could be to provide some progress stats to Daedalus (e.g. number of pending attempts, number of known registrations...) |
There's one seemingly important yet subtle change in this PR with regards to the fetching strategy. From: when (null refs || null successes) $ do
traceWith tr $ MsgFetchTakeBreak blockFrequency
threadDelay blockFrequency To: when (null refs) $ do
traceWith tr $ MsgFetchTakeBreak blockFrequency
threadDelay blockFrequency So, as of now, the wallet pauses for a whole 20s when all requests to fetching metadata failed. I think the rationale was originally to prevent the wallet from just eating up CPU by trying to fetch only invalid pools... What this create in practice is simply more unnecessary delays. There's about 8.5K pool registrations currently on mainnet, and after a full sync, there are still ~2254 fetch attempts still-to-proceed... so that's a lot of occasions to get a There's already a backoff which comes with the fetch attempts which prevent the wallet from doing needless work anyway. With this little change, we should massively improve the time needed for updating metadata upon new registrations, such as the problem described in: input-output-hk/daedalus#2359 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Nice.
The time taken to fetch all metadata is highly dependent on your ping to the smash server.
So a 1 minute fetch in Europe could be longer in Australia.
How long did it take before to fetch all metadata on your connection?
enqueueMetadata | ||
:: TBQueue (PoolId, StakePoolMetadataUrl, StakePoolMetadataHash) | ||
-> IO () | ||
enqueueMetadata queue = forever $ do | ||
refs <- atomically (unfetchedPoolMetadataRefs 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps set this number of refs to be smaller than queueSize - maybe the same as numInFlight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I should've removed the magic number here and use queueSize
instead. Thing is, I don't want the queue to be empty if there's something to fetch. So ideally, this should be > numInFlight. The bounded queue will block on writes anyway if the queue is full.
h <- readTBQueue queue | ||
q <- catMaybes <$> replicateM (numInFlight - 1) (tryReadTBQueue queue) | ||
pure (h:q) | ||
forConcurrently_ refs $ \(pid, url, hash) -> do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still have uneven bursts of requests, won't it?
But if the fetch time is down to one minute, then it's not really a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, we could actually make the consumer send queries one-by-one (asynchronously) and actually use the queue's size to limit the number of requests in flight 🤔 ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Actually no, it's not that simple because now unfetchedPoolMetadataRefs
gets called multiple times in a row before it's correctly updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, now done properly. After a full run, there's no attempts with a retry_count greater than 3, which sounds right.
This works by using a bounded queue as a 'seat allocator'. Requests are only emitted in an asynchronous thread when there's a seat at the table. On my machine, the total time is cut from ~320s to ~50s. It also guarantees that there are never more than N requests in flight and does a best effort at keeping the number of requests in flight at any time close to N (this works because the time needed to fetch requests to send from the database is much smaller than the time needed to actually send and process requests. There's one seemingly important yet subtle change in this PR with regards to the fetching strategy. From: ```hs when (null refs || null successes) $ do traceWith tr $ MsgFetchTakeBreak blockFrequency threadDelay blockFrequency ``` To: ```hs when (null refs) $ do traceWith tr $ MsgFetchTakeBreak blockFrequency threadDelay blockFrequency ``` So, as of now, the wallet pauses for a whole 20s when all requests to fetching metadata failed. I think the rationale was originally to prevent the wallet from just eating up CPU by trying to fetch only invalid pools... What this create in practice is simply more unnecessary delays. There's about 8.5K pool registrations currently on mainnet, and after a full sync, there are still ~2254 fetch attempts still-to-proceed... so that's a lot of occasions to get a `null successes` on a particular loop. There's already a backoff which comes with the fetch attempts which prevent the wallet from doing needless work anyway. With this little change, we should massively improve the time needed for updating metadata upon new registrations, such as the problem described in: input-output-hk/daedalus#2359
6fc3208
to
a6c05d1
Compare
About 320s. |
For me on
|
inFlights <- STM.atomically $ newTBQueue maxInFlight | ||
forever $ do | ||
refs <- atomically (unfetchedPoolMetadataRefs $ fromIntegral maxInFlight) | ||
forM_ refs $ \(pid, url, hash) -> withAvailableSeat inFlights $ do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty neat pattern. Unrelatedly, I'm wondering if this is achievable with streamly or conduit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought of using a conduit here, to stream things directly from the database and then, process them at the right path. Yet, that's much more work on the db side to do, and persistent API is quite unsatisfactory here (we use a custom raw SQL query behind the scene, so most of the persistent API is just unusable). Plus, we would need a "refreshable" stream, because after requests are processed, the set of metadata refs to fetch do actually change.
threadDelay blockFrequency | ||
where | ||
maxInFlight :: Natural | ||
maxInFlight = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably doesn't matter for merging the PR, but did someone check other values? I'm not questioning that 20 is a reasonable default, but maybe there's more room to optimize? At which point will performance degrade or SMASH give up on us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried: 10, 20 and 50. There seem to be almost no difference and they all oscillate between 40s and 60s for a entire refresh 🤷♂️ .. I went for 10 in the end.
I noticed that after running the new implementation, some metadata would have a retry_count way too high (internally, we use an exponential backoff, so it'd already take several days to reach ~6 or 7.. found some with 20+ retries). Turns out it was because the refs to fetch are only updated _after_ successful or unsuccessful fetching and since fetching is asynchronous, it could happen that a single metadata was simply enqueued over and over until a response finally came out of the server to change the state. Yet, all other queued request would also count towards the retry_count... and would simply eat up all the retries in one go.
Thanks 👋 bors r+ |
Build succeeded: |
Since #2432 (concurrent pool metadata fetching), the fetching of metadata is now done in new short-lived threads forked from the 'fetchMetadata' thread worker. Killing the parent thread 'fetchMetadata' does not actually kill child threads which may still complete afterwards. As a consequence, when changing settings from smash (or direct) to none, in-flight requests spawned in short-lived thread may still resolve even if the 'fetchMetadata' loop has been terminated. This commit makes sure to only write the result of requests if and only if the settings haven't changed since the parent thread was launched. This should prevent short-lived thread to update the database after the parent is killed when changing settings.
2567: Guard pool metadata fetch result persistence with settings status r=rvl a=KtorZ # Issue Number <!-- Put here a reference to the issue that this PR relates to and which requirements it tackles. Jira issues of the form ADP- will be auto-linked. --> ADP-634 # Overview <!-- Detail in a few bullet points the work accomplished in this PR --> - Guard pool metadata fetch result persistence with settings status Since #2432 (concurrent pool metadata fetching), the fetching of metadata is now done in new short-lived threads forked from the 'fetchMetadata' thread worker. Killing the parent thread 'fetchMetadata' does not actually kill child threads which may still complete afterwards. As a consequence, when changing settings from smash (or direct) to none, in-flight requests spawned in short-lived thread may still resolve even if the 'fetchMetadata' loop has been terminated. This commit makes sure to only write the result of requests if and only if the settings haven't changed since the parent thread was launched. This should prevent short-lived thread to update the database after the parent is killed when changing settings. Co-authored-by: KtorZ <[email protected]> Co-authored-by: Piotr Stachyra <[email protected]>
Issue Number
ADP-634
Overview
All requests already share the same Manager. I'm not sure if that implicitly leads to proper HTTP pipelining when used with
forConcurrently
, but it seems to work: the sync time on testnet goes from 3 minutes down to ~30 seconds.Comments
Open questions