Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make adversarial_behaviors tests pass with stateless validation #10571

Merged
merged 5 commits into from
Feb 6, 2024

Conversation

jancionear
Copy link
Contributor

Description

The tests in adversarial_behaviors.rs didn't work with stateless validation, let's make them pass again.
To achieve this we first need to propagate chunk state witnesses and endorsements, and then adjust the expected behavior to match the one exhibited by stateless validation.

The most complex part of this PR is modifying which blocks should be skipped in test_banning_chunk_producer_when_seeing_invalid_chunk_base().
This function creates a blockchain with one malicious chunk producer and runs it for a few epochs.
Which blocks will be skipped depends on whether we are using stateless validation or not.

Without stateless validation

In the old version of the protocol the invalid chunk is included in a block and then the whole block is validated. The chunk is invalid, so the whole block is also invalid and the block should be skipped. After that the malicious chunk producer is banned for the rest of the epoch, which means that block producers will reject chunks sent by this chunk producer, so the subsequent blocks won't contain any invalid chunks and won't be skipped.
Only the first block with invalid chunks in an epoch should be skipped.

With stateless validation

With stateless validation the situation is entirely different. When the malicious chunk producer produces an invalid chunk, chunk validators will refuse to send out endorsements for this chunk, so the chunk won't be included in any block. This means that no blocks should be skipped.

There is one exception - in the test the block produced at height 2 is invalid and gets skipped. I suspect that this is because the first few blocks after genesis don't use stateless validation, and they are still handled by old protocol. I'm not 100% sure if that's the case, @Longarithm could you confirm that this explanation makes sense?
Something is definitely different for the first few blocks, in the logs I can see client: Banning chunk producer for producing invalid at height 2, which seems to belong to the old protocol. In the new version malicious chunk producers are banned using BanPeer, without any log messages.

Refs: #10506, zulip discussion

The tests use `process_all_actor_messages()` to pass `PartialChunk` messages between peers.
The problem is that this function consumes (but ignores) `ChunkStateWitness`
and `ChunkEndorsement` messages. This means that these messages disappear and never reach their
destination. Let's use `handle_filtered` to process only `PartialChunk` messages, without consuming
any other ones. Thanks to this we will be able to process the other messages later.
…ents

Some tests produce invalid chunk state witnesses and/or invalid endorsements
to test how malicious behavior is handled. In these tests `process_chunk_state_witness`
and `process_chunk_endorsement` will sometimes return an error, which is expected.
At the moment functions used to propagate chunk state witnesses and endorsements don't
tolerate any errors, they `unwrap()` the result of processing, which causes a panic on any error.
Let's add an argument which allows to ignore the errors when propagating chunk witnesses
and endorsements. This will make it possible to use them in tests which test adversarial behaviors.
Propagate chunk state witnesses and endorsements in `test_non_adversarial_case`,
which makes the test pass with stateless validation. This is a non-adversarial case,
so there are no invalid/missing chunks in this test.
Fix two tests which test how the blockchain behaves when there is one malicious
chunk producer who produces invalid chunks. The tests didn't work with stateless validation.

To fix them we first have to properly propagate chunk state witnesses and chunk endorsements.
We should allow errors when processing witnesses and endorsements, as the malicious chunk producer
will cause some of the validations to fail.

Then it's also necessary to adjust the expected behavior. The previous version of the protocol
first included new chunks into a block and then validated the whole block. This means that if some
chunk was invalid, it would cause the whole block to be invalid, and the block would be skipped.
This is why the current code expected the first block with invalid chunks in an epoch to be skipped.
With stateless validation, the situation is different. Chunk validators will refuse to send endorsements
for the invalid chunk, which means that it won't be included in any blocks, and no blocks will be skipped.
The only exception is the block at height 2, which is skipped. I suspect that this is because the first few
blocks in the blockchain still use the old validation logic, not stateless validation.
@jancionear jancionear added the A-stateless-validation Area: stateless validation label Feb 6, 2024
@jancionear jancionear requested a review from a team as a code owner February 6, 2024 16:04
@Longarithm
Copy link
Member

height < 3

I suspect that my attempt to handle genesis state witness spills over to the rest of the code base, taking a look

@Longarithm
Copy link
Member

Longarithm commented Feb 6, 2024

Well, yes. In these tests we are lucky enough to have bad CP producing first chunk after genesis, which gets auto-endorsed in our current code (e.g.

// TODO(#10502): Handle production of state witness for first chunk after genesis.
// Properly handle case for chunk right after genesis.
// Context: We are currently unable to handle production of the state witness for the
// first chunk after genesis as it's not possible to run the genesis chunk in runtime.
)
How about we push this PR forward but tackle first-after-genesis chunk validation in a separate PR? I hope it is enough to put genesis chunk state root into main state transition and skip "apply new chunk" step.
I wasn't seeing any bans in logs, however.

@jancionear
Copy link
Contributor Author

jancionear commented Feb 6, 2024

Well, yes. In these tests we are lucky enough to have bad CP producing first chunk after genesis, which gets auto-endorsed in our current code. How about we push this PR forward but tackle first-after-genesis chunk validation in a separate PR? I hope it is enough to put genesis chunk state root into main state transition and skip "apply new chunk" step.

Thanks for the explanation. I updated the comment to better reflect what's going on and added a TODO(#10502) to remove this logic once we'll be able to properly handle state witness for the first blocks. The test tests the current behavior, so that seems ok. Later when the behavior changes we will also change the test.

I wasn't seeing any bans in logs, however.

In my logs (log.txt.zip) I can see the following lines:

0.333s ERROR start_process_block{provenance=PRODUCED block_height=2}: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.337s ERROR start_process_block{provenance=NONE block_height=2}: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.341s ERROR start_process_block{provenance=NONE block_height=2}: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.352s ERROR start_process_block{provenance=NONE block_height=2}: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.372s ERROR process_blocks_with_missing_chunks: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.386s ERROR process_blocks_with_missing_chunks: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.404s ERROR process_blocks_with_missing_chunks: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)
 0.415s ERROR process_blocks_with_missing_chunks: client: Banning chunk producer for producing invalid chunk chunk_producer=AccountId("test7") epoch_id=EpochId(11111111111111111111111111111111) chunk_hash=ChunkHash(6zGqyVzHRB1cgjCGq1DLfvdanbtrTyPAdcSejHEFPVMv)

Generated using:

cargo nextest run --package integration-tests --features nightly,test_features --nocapture test_banning_chunk_producer_when_seeing_invalid_chunk 2>&1 | tee log.txt

@Longarithm
Copy link
Member

Longarithm commented Feb 6, 2024

Oh, that's right...
I think the old behaviour happens because everyone still tracks all shards and is able to say that chunk is invalid after block processing, even when it gets included.
Well, after solving genesis cornercase it will go away.
This implies that with stateless validation, code path for banning CP due to invalid chunk in block won't be triggered anymore, which makes sense.

UPD: could be a good idea to completely remove such unused logic after stateless validation release

@jancionear
Copy link
Contributor Author

In these tests we are lucky enough to have bad CP producing first chunk after genesis,

I read a bit into the code and I think this is actually the second chunk after genesis.
It looks like height 0 is genesis, height 1 is the first block after genesis and height 2 is the second block after genesis:

0.180s DEBUG custominfo: Produced block. Height: 1 Prev block hash: HppBDhchEtKYbxDuoKFa112aEK4j6KNYWmtNr5BjjRLq    
0.322s DEBUG custominfo: Produced block. Height: 2 Prev block hash: CVWFUwZDcvT4MUZQpXGQUZ6U8pZNuCr2EcGo2pVZH1qn 

The invalid chunk is in block #2, so afaiu the state witness for this chunk should be processed normally 0_o

@Longarithm
Copy link
Member

Block 1 never has chunks, that's another specific of the code.
produce_chunks(h+1) is triggered only after successful processing of block h, but block 0 is never processed, it is just written to db :|

@jancionear
Copy link
Contributor Author

Block 1 never has chunks, that's another specific of the code.
produce_chunks(h+1) is triggered only after successful processing of block h, but block 0 is never processed, it is just written to db :|

Ahh ok. When there's no chunk in block #1, prev_chunk is in block 0 and prev_chunk_header.prev_block_hash() == &CryptoHash::default() is true. Makes sense, thanks 👍

@Longarithm Longarithm added this pull request to the merge queue Feb 6, 2024
Copy link

codecov bot commented Feb 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c198d8b) 71.92% compared to head (d47c961) 71.97%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10571      +/-   ##
==========================================
+ Coverage   71.92%   71.97%   +0.04%     
==========================================
  Files         724      724              
  Lines      147133   147140       +7     
  Branches   147133   147140       +7     
==========================================
+ Hits       105830   105908      +78     
+ Misses      36437    36380      -57     
+ Partials     4866     4852      -14     
Flag Coverage Δ
backward-compatibility 0.08% <0.00%> (-0.01%) ⬇️
db-migration 0.08% <0.00%> (-0.01%) ⬇️
genesis-check 1.25% <0.00%> (-0.01%) ⬇️
integration-tests 37.01% <100.00%> (+0.13%) ⬆️
linux 71.04% <42.10%> (-0.01%) ⬇️
linux-nightly 71.44% <100.00%> (+0.06%) ⬆️
macos 54.98% <0.00%> (-0.03%) ⬇️
pytests 1.46% <0.00%> (-0.01%) ⬇️
sanity-checks 1.26% <0.00%> (-0.01%) ⬇️
unittests 67.87% <0.00%> (-0.04%) ⬇️
upgradability 0.13% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Merged via the queue into near:master with commit 17f85c1 Feb 6, 2024
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants