Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Burst Denial of Service in Proof of Time #2142

Closed
vanhauser-thc opened this issue Oct 20, 2023 · 8 comments · Fixed by #2320
Closed

Burst Denial of Service in Proof of Time #2142

vanhauser-thc opened this issue Oct 20, 2023 · 8 comments · Fixed by #2320
Assignees
Labels
audit Audit results node Node (service library/node app) performance Related to performance measurement or improvement
Milestone

Comments

@vanhauser-thc
Copy link
Collaborator

[Medium] Burst Denial of Service in Proof of Time

Summary

An attacker can abuse the high cost associated with verifying Proof of Time (PoT).

Issue details

Based on the AES Latency Report, an estimated ASIC is expected to perform the PoT generation 2.4x faster than a COTS hardware.

When an ASIC produces PoTs every second, COTS hardware takes 2.4 seconds to generate it independently and 300ms to verify a PoT message, assuming it can execute 8 parallel tasks, one for each checkpoint.

Attack scenario 1: If up to seven received PoT messages are in the queue for an iteration
let correct_proof = if potentially_matching_proofs.len() < EXPECTED_POT_VERIFICATION_SPEEDUP { [1]
then one PoT message after another is checked, and the first that is successfully verified is taken. An attacker then sends a handful of faulty PoT messages from different nodes early into the network (e.g. so that the first messages are the faulty ones).
If e.g. the 6th message is a valid, then the node has spent 5 * 300ms = 1.5 seconds for verification.

As the iterations to be performed are allowed to be 50% above the current iteration number on the chain, this becomes 2.25 seconds (proof.slot_iterations.get() <= current_slot_iterations.get() * 3 / 2).

Attack scenario 2: If eight or more PoT messages were received, the node itself does the PoT generation which will cost 2.4 seconds.

A mitigation mechanism already in place bans nodes that send a faulty PoT. However, an attacker can easily spawn new instances with fresh IP addresses, especially when using IPv6.

Risk

A node, while busy with verification, can miss its opportunity to create a block or perform other actions on the chain.

Mitigation

Changes to the iteration can only occur at the begin of an era which is currently not checked when allowing the 50% increase on the number iterations. This would then allow for EXPECTED_POT_VERIFICATION_SPEEDUP to be increased by 1 and be at worst case have the same speed as calculating PoT itself.

Additionally, if less than EXPECTED_POT_VERIFICATION_SPEEDUP messages are present, a random selection mechanism for the next PoT to be verified could also help reducing the impact of such a burst DoS attack.

Lastly, this could operate in two modes: the existing implementation and a trusted mode. In the trusted mode, if no faulty PoT message has been received so far, the EXPECTED_POT_VERIFICATION_SPEEDUP could be doubled.

[1] https://github.com/subspace/subspace/blob/d31fe47911151223d4750bf740524265118e4241/crates/sc-proof-of-time/src/source/gossip.rs#L420

@vanhauser-thc vanhauser-thc added the audit Audit results label Oct 20, 2023
@nazar-pc
Copy link
Member

nazar-pc commented Oct 23, 2023

When an ASIC produces PoTs every second, COTS hardware takes 2.4 seconds to generate it independently and 300ms to verify a PoT message, assuming it can execute 8 parallel tasks, one for each checkpoint.

I'm not sure you understand it quite the same way, so I will clarify. It is not 8 parallel tasks that we are being used, it is instruction-level parallelism in modern processors. Modern processors can verify 8 checkpoints (16 effective independent computations since we can do both AES encoding and decoding in opposite direction due to the fact that we know inputs and outputs and can go both ways and check if middle value matches) on a single core ~8 times faster than prover, we described it in https://forum.subspace.network/t/proof-of-time-optimistic-vs-pessimistic-verification-analysis/1623 (sections "Energy" and "Commodity Hardware"). By the time we have ASICs that are faster, we expect AVX512 (or AES10 or whatever they end up calling it) will also have wider AES instructions (VAES) enabled on all modern processors (currently it is "complicated") and verification time as well as efficiency will shrink further. I recommed you to read linked forum post in full, it contains a lot of insigths.

Attack scenario 1: If up to seven received PoT messages are in the queue for an iteration
let correct_proof = if potentially_matching_proofs.len() < EXPECTED_POT_VERIFICATION_SPEEDUP { [1]
then one PoT message after another is checked, and the first that is successfully verified is taken. An attacker then sends a handful of faulty PoT messages from different nodes early into the network (e.g. so that the first messages are the faulty ones).
If e.g. the 6th message is a valid, then the node has spent 5 * 300ms = 1.5 seconds for verification.

As the iterations to be performed are allowed to be 50% above the current iteration number on the chain, this becomes 2.25 seconds (proof.slot_iterations.get() <= current_slot_iterations.get() * 3 / 2).

Attack scenario 2: If eight or more PoT messages were received, the node itself does the PoT generation which will cost 2.4 seconds.

First of all proof.slot_iterations.get() <= current_slot_iterations.get() * 3 / 2 is only a quck check, it doesn't lead to immediate verification. This is a sanity check, essentially we don't expect number of iterations to change more than 50% during a short period of time.

For those potential proofs that did pass all of the superficial (but cheap) checks of PotGossipValidator we will store them for the future when they become candidates for the future proof. When that happens in PotGossipWorker::handle_next_slot_input we will do some more cheap checks, including the check that the inputs used for generating potential proofs are matching.

Only if all of those checks succeded (if any of them do not, reputation of the peer is decreasesd by a bit) and we have several candidates we check whether it is cheaper computation-wise to generate proof ourselves or to verify candidates. Either way this those peers that sent us the incorrect proof will be banned on the network level and will not be able to send us proofs going forward.

A mitigation mechanism already in place bans nodes that send a faulty PoT. However, an attacker can easily spawn new instances with fresh IP addresses, especially when using IPv6.

Yes, the bigger question of identifying "under attack" conditions and become more concervative with connections going forward is still applicable, but more of a network-related topic. I believe there are more cases like this in the protocol and it is not just Subspace, you can attack any blockchain node this way even though the cost here is much larger.

@dariolina we should probably prioritize work in this direction.

Additional proposed mitigation

I'm not sure focus on EXPECTED_POT_VERIFICATION_SPEEDUP is important, the cost is high either way and +50% doesn't change the situation drastically, it is still bad if this attack happens.

One of the mitigations I can think of for PoT specifically is to always default to verification, but sort peers by reputation and pick smaller number of those that have the highest reputation first. Maybe even check the quorum and if most of those with high reputation have sent us the same set of checkpoints, chances are those are the checkpoints we should verify before checking anything else.

@vanhauser-thc
Copy link
Collaborator Author

When an ASIC produces PoTs every second, COTS hardware takes 2.4 seconds to generate it independently and 300ms to verify a PoT message, assuming it can execute 8 parallel tasks, one for each checkpoint.

I'm not sure you understand it quite the same way, so I will clarify. It is not 8 parallel tasks that we are being used, it is instruction-level parallelism in modern processors. Modern processors can verify 8 checkpoints ([16 effective independent computations since we can do both AES encoding and decoding in opposite direction due to the fact that we know inputs and outputs and can go both ways and check if middle value matches]

yes we understood it like that, however it is difficult to put into exact words :)

One of the mitigations I can think of for PoT specifically is to always default to verification, but sort peers by reputation and pick smaller number of those that have the highest reputation first. Maybe even check the quorum and if most of those with high reputation have sent us the same set of checkpoints, chances are those are the checkpoints we should verify before checking anything else.

IMHO that is the better way to handle this. this way an attacker first has to play nice for quite some time for just a short attack span.

@dariolina
Copy link
Member

for PoT specifically is to always default to verification, but sort peers by reputation and pick smaller number of those that have the highest reputation first

I agree this sounds promising. Since we already have to store information about the sender, we might as well prioritize verification of PoT from our "usual" Timekeepers.
However, I can't entirely agree with removing the limit on verifications as this makes verification time theoretically unbounded. We can do a mixed approach, if we have many PoT messages that passed all preliminary checks, we can try to verify the top-3/5/7 and if that fails, we fall back to local evaluation.

@nazar-pc
Copy link
Member

I think we can even reduce it to one, just pick the sender with highest reputation, it should be the most likely to send us correct proof. We can fall back to more or to proving in case that fails, which will not happen in most cases.

@nazar-pc nazar-pc added node Node (service library/node app) performance Related to performance measurement or improvement labels Oct 25, 2023
@dariolina
Copy link
Member

In R&D sync we have discussed a need for a generally more robust "under attack" mode. The approach suggested here to verify from the "best" peer first, surely doesn't hurt and prevent some attacks on the lower-effort end of the spectrum. @nazar-pc I will add this to spec, if you are up to it.

@nazar-pc
Copy link
Member

Please do, it is a non-beaking protocol change we can introduce any time for those who upgrade. I'll implement it in the near future.

@nazar-pc
Copy link
Member

nazar-pc commented Nov 6, 2023

There is lack of Substrate API for reputation querying right now, see paritytech/polkadot-sdk#2185

I'll look into exposing it in Substrate once I get some feedback on what it might look like upstream.

Un-assigning Dariia now, since the immediate fix for the scope of this issue is implementation-only.

@nazar-pc
Copy link
Member

Fix implemented in #2320

@nazar-pc nazar-pc removed the upstream Blocked by upstream label Dec 13, 2023
@nazar-pc nazar-pc moved this from In Progress to Under Review in Subspace core (node, farmer, etc.) Dec 13, 2023
@github-project-automation github-project-automation bot moved this from Under Review to Closed in Subspace core (node, farmer, etc.) Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit Audit results node Node (service library/node app) performance Related to performance measurement or improvement
Projects
Development

Successfully merging a pull request may close this issue.

3 participants