-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: orphan chunk state witness pool #10613
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10613 +/- ##
==========================================
+ Coverage 72.34% 72.42% +0.07%
==========================================
Files 730 732 +2
Lines 149292 149916 +624
Branches 149292 149916 +624
==========================================
+ Hits 108012 108578 +566
- Misses 36374 36417 +43
- Partials 4906 4921 +15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
a2a8632
to
42caa7c
Compare
Sometimes when a ChunkStateWitness arrives to be processed, the block required to process it isn't available yet, we have to wait for it. near#10535 implemented a hacky way to do it by retrying the processing every 500ms until the required block arrives. This PR will implement a proper solution, so let's remove the hacky workaround.
The current function to verify state witness signatures takes the epoch_id from the previous block. For orphaned state witnesses the previous block isn't available, so it's impossible to use this function. Let's add another function that takes the epoch_id as an argument instead of fetching it from a block. It will allow us to verify signatures of orphaned chunk state witnesses.
Add a struct which keeps a cache of orphaned ChunkStateWitnesses. It provides two methods that allow to easily add a ChunkStateWitness to the pool, and then when a block arrrives, another method allows to easily fetch all orphaned ChunkStateWitnesses that were waiting for this block.
Let's split this function into two parts - one part which tries to find the previous block for the given ChunkStateWitness and another which processes the witness when the previous block is available. It will make handling orphaned witnesses easier - once the block arrives, we can just call the second function to process it. In the future it might also make it easier to process witnesses before the previous block is saved to the database, which could reduce latency.
736bc2b
to
3ea39f7
Compare
Add a function which tries to determine in which epoch the specified height is located. It looks at the previous, current and next epoch around the chain Tip and tries to figure out to which one of them this height belongs. It's not always possible to determine in which epoch a given height will be, as the start height of the next epoch isn't known in the current epoch, so the function returns a set of possible epochs where the height might be. For example if the tip is in the middle of the current epoch, and the height is about epoch_length blocks higher than the tip, this function would return [current_epoch_id, next_epoch_id], as this height could be in either one of those epochs. It will later be used to verify the signature of orphaned witnesses. To verify the signature we need to know what is the chunk producer who should produce a witness at some height, and for that we need to know what's the epoch_id at this height.
When the previous block isn't available, we can't process the ChunkStateWitness and it becomes an orphaned state witness, waiting for the required block to arrive. In such situations the witness is partialy validated and then put in the `OrphanStateWitnessPool`. It's impossible to fully validate the witness, as the previous block isn't available, but we can do some basic checks to curb abusive behavior - check the shard_id, signature, size, etc.
When a new block is accepted we can process all orphaned state witnesses that were waiting for this block. Witnesses which became ready are taken out of the pool and processed one by one.
There is no ideal size of the witness pool, node operators might want to adjust it to achieve optimal performance. Let's make this possible by adding a configuration option which allows to control the pool size.
Add metrics which provide information about witnesses kept in the pool. Metrics are increased and decreased using a RAII struct to ensure that they're always correct, no matter what happens with the pool.
There was no way to get the index of a client with some AccountId, let's add a method that enables it to `TestEnv`
Add a function that allows to wait until some chunk endorsement is sent out. It will later be used to verify that a validator sends out endorsements for an orphaned chunk state witness when its block arrives.
`stateless_validation` is a private module of `near_client`, so it's impossible to use items defined there in integration tests. Let's expose some of them to make their use in tests possible.
`run_chunk_validation_test` has a very useful function which allows to determine a block producer at some offset. I would like to also use it in my integration tests. Let's add it to `TestEnv` so that other tests can also use it.
Add a function analogous to `get_block_producer_at_offset`, but for chunk producers.
3ea39f7
to
0183075
Compare
@pugachAG I see that you self-requested a review. |
random thought: We may also want to clear the cache periodically. If I recall correctly currently we only remove the witness for which a block is received. This means we will never remove a witness from the cache unless we receive the block for it and eventually we'll always have full cache. I would suggest something like cleaning up witnesses with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
target: "client", | ||
witness_height, | ||
witness_shard, | ||
witness_chunk = ?chunk_header.chunk_hash(), | ||
witness_prev_block = ?chunk_header.prev_block_hash(), | ||
witness_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should, I believe the span log level only applies to when you have enabled logging of entering and closing spans.
// Finally try the previous epoch. | ||
// First and last blocks of the previous epoch are already known, so the situation is clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that we already have the height -> epoch id mapping stored in the db in one form or another so we wouldn't need to actually walk the chain. But Longarithm's point is even better.
chain/epoch-manager/src/tests/mod.rs
Outdated
); | ||
|
||
let epoch1 = EpochId(h[0]); | ||
dbg!(&epoch1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use tracing? It's used throughout the codebase and everyone is familiar with it. This is a mini nit given it's only a test so - optional :)
prev_block: &Block, | ||
processing_done_tracker: Option<ProcessingDoneTracker>, | ||
) -> Result<(), Error> { | ||
if witness.inner.chunk_header.prev_block_hash() != prev_block.hash() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one example where I think asserting is more appropriate than returning an error
- the contract of this function already assumes block to correspond to
witness.inner.chunk_header.prev_block_hash()
, so violating that is a programmatic error, not an "invalid input" kind of error - having error handling here might give an impression that this could actually happened under some input, which is not true
- returning error might mask an underlying issue and result in more time spent on debugging or completely hide it which is even worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I'm really scared of introducing code that could lead to DoS vulnerability. It might not be a big threat to normal services, but on a blockchain I imagine that a DoS means that it'd be possible to kick out validators and cause all sorts of mayhem.
IMO passing the wrong block to process_chunk_state_witness
is not a fatal error - it's invalid input, for which the function can safely fail.
I added it because I know that there are plans to process witnesses before the block is applied, and I was worried that someone might accidentally use it incorrectly once the complexity of things increases.
I don't see a reason to risk a panic here. This error will show up in the logs with an ERROR
log-level, and I think anyone debugging an issue looks for those, so the visibility is there. Maybe we could introduce a BUG
log-level, that would be used for non-critical bugs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to merge the PR as it is and we can discuss this on protocol core weekly sync
chain/client/src/stateless_validation/chunk_validator/orphan_witness_handling.rs
Show resolved
Hide resolved
/// List of orphaned witnesses that wait for this block to appear. | ||
/// Maps block hash to entries in `witness_cache`. | ||
/// Must be kept in sync with `witness_cache`. | ||
waiting_for_block: HashMap<CryptoHash, HashSet<(ShardId, BlockHeight)>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really worth maintaining this lookup map in the given context? the default size limit is 25 and I don't think it makes sense to have it much higher than that (if we so behind then there is very little chance to send the endorsement in time). I think just iterating over witness_cache
would have been more than performant enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true.
After doing so much leetcode it feels very wrong to use a suboptimal algorithm, but there's no point in introducing so much complexity and expects
when the cache holds 25 items. You're right, the rational choice is to use a naive algorithm.
Changed to a simpler implementation.
One thing I liked about the lookup map is that it provides a "safe" api - no matter what number someone passes, the cache will behave correctly. With a naive implementation using a higher value might lead to silent performance problems, so I added a warning if someone tries to create a cache with capacity of more than 128.
In previous commit I added debug spans to avoid writing witness_height, ... everywhere. I also removed them from the ejection message, but that was a mistake. The span contains information about the witness that is being added, not the one that is being ejected. Because of that the message about ejecting a witness would contain invalid witness_height and other parameters. Fix by using the correct parameters, and names which don't collide with the ones in the span.
I imported near-o11y to use the function for initializing a tracing-subscriber in epoch manager tests, but I didn't import the corresponding nightly packages, so a CI check failed. Let's fix this by importing the packages that the CI check prosposed.
Opened issues about it, those features can be added in a follow-up PR, this one is big enough already. |
Ran python3 scripts/fix_nightly_feature_flags.py fix to satisfy the CI
While resolving a merge conflict caused by near#10646, I accidentally removed a handler for chunk endorsement messages. I didn't mean to do it, I wanted to only modify handlers for chunk state witnesses. Let's undo the change and bring the chunk endorsement handler back.
This reverts commit 5cae3c3. It was actually necessary to remove the endorsement message handler, it doesn't compile otherwise 🤦
Do we have logging to monitor how much cache is consumed (so we can adjust as needed)? |
Missed reviewing this, but great PR! |
Description
This PR adds a pool for orphaned
ChunkStateWitnesses
.To process a
ChunkStateWitness
we need the previous block, but sometimes it isn't available immediately. The node might receive aChunkStateWitness
before the block that's required to process it. In such cases the witness becomes an "orphaned chunk state witness" and it's put inOrphanChunkStateWitnessPool
, where it waits for the desired block to appear. Once a new block is accepted, we fetch all orphaned witnesses that were waiting for this block from the pool and process them.Design of
OrphanStateWitnessPool
OrphanStateWitnessPool
keeps a cache which mapsshard_id
andheight
to an orphanChunkStateWitness
with these parameters:All
ChunkStateWitnesses
go through basic validation before being put in the orphan cache.@staffik
and@Longarithm
the observedChunkStateWitness
sIze was 16-32MB, so a 40MB limit should be alright. This PR only limits the size of orphaned witnesses, limiting the size of non-orphan witnesses is much more tricky, see the discussion in Limit ChunkStateWitness size to 16MB #10615.It's impossible to fully validate an orphaned witness, but this partial validation should be enough to protect against attacks on the orphan pool.
Under normal circumstances there should be only a few orphaned witnesses per shard. If the node has fallen behind by more than a few blocks, it has to catch up and its chunk endorsements don't matter.
The default cache capacity is set to 25 witnesses. With 5 shards it provides capacity for 5 orphaned witnesses on each shard, which should be enough.
Assuming that a single witness can take up 40 MB, the pool will consume at most 1GB at full capacity.
The changes are divided into individual commits, they can be reviewed commit-by-commit.
Fixes
Fixes: #10552
Fixes: near/stakewars-iv#15