Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent DOS vector introduced by throttling #594

Closed
Tracked by #754
shaspitz opened this issue Dec 13, 2022 · 22 comments
Closed
Tracked by #754

Prevent DOS vector introduced by throttling #594

shaspitz opened this issue Dec 13, 2022 · 22 comments
Labels
type: feature-request New feature or request improvement

Comments

@shaspitz
Copy link
Contributor

shaspitz commented Dec 13, 2022

Problem

Inspired by slack convo with @mpoke, if an attacker first causes the slash meter to go negative with some initial "valid" slash packets, then sends a bunch more slash packets, we have a scenario where the packet queues could grow very large over multiple blocks until provider binaries panic. This scenario would not panic the provider WITHOUT throttling, because the spam packets would be handled/dropped immediately upon being recv.

Closing criteria

Solve the attack vector with a patch to throttling, or deem it as not needed.

Problem details

A malicious consumer can easily send 10000 "valid" or "invalid" packets after causing the meter to go negative, where we can't assert logic about validity unless we inspect the queues upon every recv slash packet. Even if we did add logic where we inspect queues, 10000 valid downtime slash packets in a row that're all duplicates would still be valid with how we define it. Eventually we will hit this logic which would halt the provider.

I see two potential solutions to this issue:

More validation logic

Add validation logic to ValidateSlashPacket such that the following conditions result in the packet being thrown out before it is queued.:

  1. data.Validator has an unfound or unbonded validator according to the provider, this is Return IBC err ack when val not found for slash packet #546
  2. data.ValsetUpdateId is valid in that it maps to a persisted infraction height, and a VSC matured packet has not already been recv for this VSCId (see Outdated SlashPackets are not dropped #544)
  3. Infraction type is double sign or downtime
  4. Most important: an equivalent slash packet with this exact same data has not already been recv (would require additional state, with appropriate pruning mechanisms)

Don't stop iteration through queue

On endblocker, instead of stopping iteration through the global queue at the instance that the slash meter goes negative, we keep iterating through all global queue entries. Any entry which would result in no voting power changes (val already jailed, tombstoned, not found, etc.) would be handled, no matter the value of the slash meter. Global entries which do affect voting power (and are therefore fully valid) would act as before, and be throttled by slash meter value. This solution would require changes to the current throttling implementation to make sure that VSC Matured packets in the chain specific queue are still handled only after all other slash packets for that chain have been handled.

@shaspitz shaspitz added type: feature-request New feature or request improvement ccv-core labels Dec 13, 2022
@shaspitz shaspitz moved this to Todo in Replicated Security Dec 13, 2022
@shaspitz
Copy link
Contributor Author

My hunch is that the second solution would be a lot easier to implement and wrap our heads around. Seems like it could have less edge cases as well.

@mpoke
Copy link
Contributor

mpoke commented Dec 14, 2022

Eventually we will hit this logic which would halt the provider.

IIUC, this triggered in DeliverTx, which would just abort the TX.

@mpoke
Copy link
Contributor

mpoke commented Dec 14, 2022

Add validation logic to ValidateSlashPacket such that the following conditions result in the packet being thrown out before it is queued.:

I would suggest to first add validation for 1-3 since it's straightforward.

Regarding 4, it only requires a reverse index that should be pruned when the entry is removed from the queue, i.e.,

GlobalSlashEntry: GlobalSlashEntryBytePrefix | recvTime | ibcSeqNum | chainID -> GlobalSlashEntry{recvTime, ibcSeqNum, chainID, consAddr}
GlobalSlashEntryRev: GlobalSlashEntryRevBytePrefix | chainID | consAddr | infractionType -> {recvTime, ibcSeqNum}

The infraction type is needed since the provider shouldn't accept

  • two SlashPackets from the same chain for double-signing committed by the same validator (as the first packet would result in the validator will be tombstoned anyway)
  • two SlashPackets from the same chain for downtime committed by the same validator (as the consumer should not have sent a second downtime due to the outstanding downtime flag)

@mpoke
Copy link
Contributor

mpoke commented Dec 14, 2022

the provider shouldn't accept two SlashPackets from the same chain for downtime committed by the same validator (as the consumer should not have sent a second downtime due to the outstanding downtime flag)

This led me think to the following edge case:

  • a validator val is down on multiple consumer chains, which results in SlashPackets sent to the provider (one packet from each consumer)
  • the first SlashPacket goes through and it jails the validator on the provider
  • the other SlashPackets are in the queue behind other SlashPackets for other validators
  • the throttling meter goes negative and as a result the other SlashPackets for val are not handled
  • val unjails itself after 10m
  • the other SlashPackets for val are handled and val is jailed again

This behavior is different than the behavior w/o jail throttling where val would be slashed multiple times but jailed only once. Not sure though that's a problem though. Could be annoying for validators. @jtremback What do you think?

@mpoke
Copy link
Contributor

mpoke commented Dec 14, 2022

Don't stop iteration through queue

This would handled packets in a different order than the one they were received, and thus break the assumption of an order CCV channel.

@shaspitz
Copy link
Contributor Author

I would suggest to first add validation for 1-3 since it's straightforward.

This sounds good to me since there's issues open, 4 is also not crazy challenging.

This led me think to the following edge case

Valid edge case 👍 this is indeed a change in behavior. But, imo the new (throttling introduced) behavior doesn't cause any issues.

For all of these convos, I'd prefer we implement them as their own PRs into main after the large throttling branch is merged.

@mpoke mpoke mentioned this issue Dec 16, 2022
2 tasks
@jtremback
Copy link
Contributor

Hold on- Why do we need to fix this? The slash throttle is ultimately intended to give validators time to halt the chain in an attack, this "problem" just lets the attacker do it for us. @mpoke @smarshall-spitzbart

@shaspitz
Copy link
Contributor Author

shaspitz commented Dec 16, 2022

If our goal was to automatically halt the chain when too many slash packets come in, the throttling PR could have been way less complex.

This issue focuses on throwing out slash packets where it's possible to avoid automatic halts (w/o affecting throttling) allowing validators to act as they see fit during an attack (which may be a halt most of the time, or just a swift consumer removal?).

@mpoke
Copy link
Contributor

mpoke commented Dec 16, 2022

The idea is that slash throttling could enable the censorship of SlashPackets. All you need is a malicious consumer, send many SlashPackets so no other consumer would get their SlashPackets in. This is due to the fact that the provider panics if a queue is exceeding the max size.

@shaspitz
Copy link
Contributor Author

This is due to the fact that the provider panics if a queue is exceeding the max size.

Not just this, but if too many slash packets are persisted in state, binaries could start to see unexpected behavior and start to halt in a non-deterministic manner. That's why the explicit, deterministic halt was implemented

@shaspitz
Copy link
Contributor Author

CC @MSalopek once the More validation logic section from above is implemented, we could remove the halt code altogether. Feel free to make that its own issue if you see fit, or we keep it a part of this issue

@shaspitz
Copy link
Contributor Author

Replying to @mpoke's comment

This would handle packets in a different order than the one they were received, and thus break the assumption of an order CCV channel.

Correct me if I'm wrong, but an ordered channel only guarantees the order in which packets are recv is the same as the order they were sent. We can handle things is whatever order is appropriate for the designed protocol. Therefore, an adaptation of Don't stop iteration through queue could possibly still have some merit.

(just brainstorming here, as More validation logic is becoming pretty hard for me to wrap my head around).

Note that throttling already breaks Provider Slashing Warranty from the spec, and that property will need to be updated eventually

@mpoke
Copy link
Contributor

mpoke commented Dec 23, 2022

Correct me if I'm wrong, but an ordered channel only guarantees the order in which packets are recv is the same as the order they were sent. We can handle things is whatever order is appropriate for the designed protocol. Therefore, an adaptation of Don't stop iteration through queue could possibly still have some merit.

From an IBC perspective, we can handle things in whatever way. However, ICS relies on the Channel Order property. This entails that the packets MUST be handled in the same order they were sent. Figuring out whether it's safe to handle slash packets out of order needs careful analysis.

Note that throttling already breaks Provider Slashing Warranty from the spec, and that property will need to be updated eventually

Indeed. I assume that's why the diff testing for throttling is not working yet.

@mpoke
Copy link
Contributor

mpoke commented Dec 23, 2022

Regarding this issue, I don't think we should panic the provider when the queues are exceeding a certain size. Just remove the consumer and cleanup the corresponding state.

@shaspitz
Copy link
Contributor Author

However, ICS relies on the Channel Order property. This entails that the packets MUST be handled in the same order they were sent.

We'll need to update the language for the channel order property then, it current says:

Channel Order: If a packet P1 is sent over a CCV channel before a packet P2, then P2 MUST NOT be received by the other end of the channel before P1.

I agree that we'll need to carefully analyze solutions to this issue

@shaspitz
Copy link
Contributor Author

Discussion for after holidays: remove panic on provider when queues get too large vs making the max queue size 10k instead of 1k

@mpoke mpoke added this to the Untrusted consumer chains milestone Jan 27, 2023
@mpoke mpoke added this to Cosmos Hub Jan 27, 2023
@github-project-automation github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Jan 27, 2023
@ivan-gavran
Copy link
Contributor

I was just thinking about reporting this potential problem, when I saw you already discussed it at length here :)
What is then the final decision on this issue - going with 100k as the default size?

I am wondering, why is this a solution? Couldn't a combination of a broken/malicious consumer and a collaborating relayer cause the panicking even with this increased size?

@shaspitz
Copy link
Contributor Author

shaspitz commented Feb 6, 2023

I was just thinking about reporting this potential problem, when I saw you already discussed it at length here :) What is then the final decision on this issue - going with 100k as the default size?

I am wondering, why is this a solution? Couldn't a combination of a broken/malicious consumer and a collaborating relayer cause the panicking even with this increased size?

Hi Ivan, it's correct that a malicious consumer is still able to halt the provider if this scenario were to play out. However, without the throttling mechanism, safety could be violated by applying each slash packet immediately. The decision to go with the current design is to sacrifice liveness in an edge case scenario for the sake of security.

See #713 for the seemingly best solution to this issue

@ivan-gavran
Copy link
Contributor

@smarshall-spitzbart , on the second thought: it can't halt the chain after all. The panic happens inside SetThrottledPacketDataSize, which is either called from DeleteThrottledPacketData (originating from EndBlocker) or it is called from IncrementThrottledPacketDataSize (originating from a transaction).

In the case of deletion, it will not panic because it had to be larger than the max size of the queue even before deletion.
In the case of addition (and incrementing the queue size), it can panic but that won't halt the chain.

@shaspitz
Copy link
Contributor Author

@smarshall-spitzbart , on the second thought: it can't halt the chain after all. The panic happens inside SetThrottledPacketDataSize, which is either called from DeleteThrottledPacketData (originating from EndBlocker) or it is called from IncrementThrottledPacketDataSize (originating from a transaction).

In the case of deletion, it will not panic because it had to be larger than the max size of the queue even before deletion. In the case of addition (and incrementing the queue size), it can panic but that won't halt the chain.

Ah interesting, if a transaction panics the binary then the chain will still produce blocks, and also accept other transactions in subsequent blocks?

@ivan-gavran
Copy link
Contributor

Ah interesting, if a transaction panics the binary then the chain will still produce blocks, and also accept other transactions in subsequent blocks?

Yes, because the panic would be caught by this recover function inside runTx

@mpoke mpoke removed this from the Untrusted consumers milestone Mar 5, 2023
@mpoke mpoke moved this from 🩹 Triage to 📥 Todo in Cosmos Hub Mar 5, 2023
@shaspitz
Copy link
Contributor Author

shaspitz commented Jun 9, 2023

This issue will become obsolete once we replace throttling v1 with either #713 or #761

@shaspitz shaspitz closed this as completed Jun 9, 2023
@github-project-automation github-project-automation bot moved this from 📥 Todo to ✅ Done in Cosmos Hub Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature-request New feature or request improvement
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants