Prevent DOS vector introduced by throttling #594

shaspitz · 2022-12-13T23:20:33Z

Problem

Inspired by slack convo with @mpoke, if an attacker first causes the slash meter to go negative with some initial "valid" slash packets, then sends a bunch more slash packets, we have a scenario where the packet queues could grow very large over multiple blocks until provider binaries panic. This scenario would not panic the provider WITHOUT throttling, because the spam packets would be handled/dropped immediately upon being recv.

Closing criteria

Solve the attack vector with a patch to throttling, or deem it as not needed.

Problem details

A malicious consumer can easily send 10000 "valid" or "invalid" packets after causing the meter to go negative, where we can't assert logic about validity unless we inspect the queues upon every recv slash packet. Even if we did add logic where we inspect queues, 10000 valid downtime slash packets in a row that're all duplicates would still be valid with how we define it. Eventually we will hit this logic which would halt the provider.

I see two potential solutions to this issue:

More validation logic

Add validation logic to ValidateSlashPacket such that the following conditions result in the packet being thrown out before it is queued.:

data.Validator has an unfound or unbonded validator according to the provider, this is Return IBC err ack when val not found for slash packet #546
data.ValsetUpdateId is valid in that it maps to a persisted infraction height, and a VSC matured packet has not already been recv for this VSCId (see Outdated SlashPackets are not dropped #544)
Infraction type is double sign or downtime
Most important: an equivalent slash packet with this exact same data has not already been recv (would require additional state, with appropriate pruning mechanisms)

Don't stop iteration through queue

On endblocker, instead of stopping iteration through the global queue at the instance that the slash meter goes negative, we keep iterating through all global queue entries. Any entry which would result in no voting power changes (val already jailed, tombstoned, not found, etc.) would be handled, no matter the value of the slash meter. Global entries which do affect voting power (and are therefore fully valid) would act as before, and be throttled by slash meter value. This solution would require changes to the current throttling implementation to make sure that VSC Matured packets in the chain specific queue are still handled only after all other slash packets for that chain have been handled.

The text was updated successfully, but these errors were encountered:

shaspitz · 2022-12-13T23:28:20Z

My hunch is that the second solution would be a lot easier to implement and wrap our heads around. Seems like it could have less edge cases as well.

mpoke · 2022-12-14T08:20:49Z

Eventually we will hit this logic which would halt the provider.

IIUC, this triggered in DeliverTx, which would just abort the TX.

mpoke · 2022-12-14T08:45:12Z

Add validation logic to ValidateSlashPacket such that the following conditions result in the packet being thrown out before it is queued.:

I would suggest to first add validation for 1-3 since it's straightforward.

Regarding 4, it only requires a reverse index that should be pruned when the entry is removed from the queue, i.e.,

GlobalSlashEntry: GlobalSlashEntryBytePrefix | recvTime | ibcSeqNum | chainID -> GlobalSlashEntry{recvTime, ibcSeqNum, chainID, consAddr}
GlobalSlashEntryRev: GlobalSlashEntryRevBytePrefix | chainID | consAddr | infractionType -> {recvTime, ibcSeqNum}

The infraction type is needed since the provider shouldn't accept

two SlashPackets from the same chain for double-signing committed by the same validator (as the first packet would result in the validator will be tombstoned anyway)
two SlashPackets from the same chain for downtime committed by the same validator (as the consumer should not have sent a second downtime due to the outstanding downtime flag)

mpoke · 2022-12-14T08:59:11Z

the provider shouldn't accept two SlashPackets from the same chain for downtime committed by the same validator (as the consumer should not have sent a second downtime due to the outstanding downtime flag)

This led me think to the following edge case:

a validator val is down on multiple consumer chains, which results in SlashPackets sent to the provider (one packet from each consumer)
the first SlashPacket goes through and it jails the validator on the provider
the other SlashPackets are in the queue behind other SlashPackets for other validators
the throttling meter goes negative and as a result the other SlashPackets for val are not handled
val unjails itself after 10m
the other SlashPackets for val are handled and val is jailed again

This behavior is different than the behavior w/o jail throttling where val would be slashed multiple times but jailed only once. Not sure though that's a problem though. Could be annoying for validators. @jtremback What do you think?

mpoke · 2022-12-14T09:01:50Z

Don't stop iteration through queue

This would handled packets in a different order than the one they were received, and thus break the assumption of an order CCV channel.

shaspitz · 2022-12-14T16:22:39Z

I would suggest to first add validation for 1-3 since it's straightforward.

This sounds good to me since there's issues open, 4 is also not crazy challenging.

This led me think to the following edge case

Valid edge case 👍 this is indeed a change in behavior. But, imo the new (throttling introduced) behavior doesn't cause any issues.

For all of these convos, I'd prefer we implement them as their own PRs into main after the large throttling branch is merged.

jtremback · 2022-12-16T15:55:00Z

Hold on- Why do we need to fix this? The slash throttle is ultimately intended to give validators time to halt the chain in an attack, this "problem" just lets the attacker do it for us. @mpoke @smarshall-spitzbart

shaspitz · 2022-12-16T16:25:59Z

If our goal was to automatically halt the chain when too many slash packets come in, the throttling PR could have been way less complex.

This issue focuses on throwing out slash packets where it's possible to avoid automatic halts (w/o affecting throttling) allowing validators to act as they see fit during an attack (which may be a halt most of the time, or just a swift consumer removal?).

mpoke · 2022-12-16T23:03:27Z

The idea is that slash throttling could enable the censorship of SlashPackets. All you need is a malicious consumer, send many SlashPackets so no other consumer would get their SlashPackets in. This is due to the fact that the provider panics if a queue is exceeding the max size.

shaspitz · 2022-12-19T17:59:41Z

This is due to the fact that the provider panics if a queue is exceeding the max size.

Not just this, but if too many slash packets are persisted in state, binaries could start to see unexpected behavior and start to halt in a non-deterministic manner. That's why the explicit, deterministic halt was implemented

shaspitz · 2022-12-19T18:01:17Z

CC @MSalopek once the More validation logic section from above is implemented, we could remove the halt code altogether. Feel free to make that its own issue if you see fit, or we keep it a part of this issue

shaspitz · 2022-12-22T20:53:37Z

Replying to @mpoke's comment

This would handle packets in a different order than the one they were received, and thus break the assumption of an order CCV channel.

Correct me if I'm wrong, but an ordered channel only guarantees the order in which packets are recv is the same as the order they were sent. We can handle things is whatever order is appropriate for the designed protocol. Therefore, an adaptation of Don't stop iteration through queue could possibly still have some merit.

(just brainstorming here, as More validation logic is becoming pretty hard for me to wrap my head around).

Note that throttling already breaks Provider Slashing Warranty from the spec, and that property will need to be updated eventually

mpoke · 2022-12-23T08:23:01Z

Correct me if I'm wrong, but an ordered channel only guarantees the order in which packets are recv is the same as the order they were sent. We can handle things is whatever order is appropriate for the designed protocol. Therefore, an adaptation of Don't stop iteration through queue could possibly still have some merit.

From an IBC perspective, we can handle things in whatever way. However, ICS relies on the Channel Order property. This entails that the packets MUST be handled in the same order they were sent. Figuring out whether it's safe to handle slash packets out of order needs careful analysis.

Note that throttling already breaks Provider Slashing Warranty from the spec, and that property will need to be updated eventually

Indeed. I assume that's why the diff testing for throttling is not working yet.

mpoke · 2022-12-23T08:29:26Z

Regarding this issue, I don't think we should panic the provider when the queues are exceeding a certain size. Just remove the consumer and cleanup the corresponding state.

shaspitz · 2022-12-23T16:32:09Z

However, ICS relies on the Channel Order property. This entails that the packets MUST be handled in the same order they were sent.

We'll need to update the language for the channel order property then, it current says:

Channel Order: If a packet P1 is sent over a CCV channel before a packet P2, then P2 MUST NOT be received by the other end of the channel before P1.

I agree that we'll need to carefully analyze solutions to this issue

shaspitz · 2022-12-23T17:20:41Z

Discussion for after holidays: remove panic on provider when queues get too large vs making the max queue size 10k instead of 1k

ivan-gavran · 2023-01-31T14:47:56Z

I was just thinking about reporting this potential problem, when I saw you already discussed it at length here :)
What is then the final decision on this issue - going with 100k as the default size?

I am wondering, why is this a solution? Couldn't a combination of a broken/malicious consumer and a collaborating relayer cause the panicking even with this increased size?

shaspitz · 2023-02-06T20:28:45Z

I was just thinking about reporting this potential problem, when I saw you already discussed it at length here :) What is then the final decision on this issue - going with 100k as the default size?

I am wondering, why is this a solution? Couldn't a combination of a broken/malicious consumer and a collaborating relayer cause the panicking even with this increased size?

Hi Ivan, it's correct that a malicious consumer is still able to halt the provider if this scenario were to play out. However, without the throttling mechanism, safety could be violated by applying each slash packet immediately. The decision to go with the current design is to sacrifice liveness in an edge case scenario for the sake of security.

See #713 for the seemingly best solution to this issue

ivan-gavran · 2023-02-09T13:39:00Z

@smarshall-spitzbart , on the second thought: it can't halt the chain after all. The panic happens inside SetThrottledPacketDataSize, which is either called from DeleteThrottledPacketData (originating from EndBlocker) or it is called from IncrementThrottledPacketDataSize (originating from a transaction).

In the case of deletion, it will not panic because it had to be larger than the max size of the queue even before deletion.
In the case of addition (and incrementing the queue size), it can panic but that won't halt the chain.

shaspitz · 2023-02-10T22:45:53Z

@smarshall-spitzbart , on the second thought: it can't halt the chain after all. The panic happens inside SetThrottledPacketDataSize, which is either called from DeleteThrottledPacketData (originating from EndBlocker) or it is called from IncrementThrottledPacketDataSize (originating from a transaction).

In the case of deletion, it will not panic because it had to be larger than the max size of the queue even before deletion. In the case of addition (and incrementing the queue size), it can panic but that won't halt the chain.

Ah interesting, if a transaction panics the binary then the chain will still produce blocks, and also accept other transactions in subsequent blocks?

ivan-gavran · 2023-02-13T09:53:01Z

Ah interesting, if a transaction panics the binary then the chain will still produce blocks, and also accept other transactions in subsequent blocks?

Yes, because the panic would be caught by this recover function inside runTx

shaspitz · 2023-06-09T21:03:58Z

This issue will become obsolete once we replace throttling v1 with either #713 or #761

shaspitz added type: feature-request New feature or request improvement ccv-core labels Dec 13, 2022

shaspitz added this to Replicated Security Dec 13, 2022

shaspitz moved this to Todo in Replicated Security Dec 13, 2022

mpoke mentioned this issue Dec 16, 2022

Fix double iterator bug #605

Merged

2 tasks

shaspitz mentioned this issue Dec 22, 2022

Outdated SlashPackets are not dropped #544

Closed

5 tasks

shaspitz mentioned this issue Jan 5, 2023

Change default max throttled packets params + small log fix #639

Merged

1 task

mpoke added this to the Untrusted consumer chains milestone Jan 27, 2023

mpoke added this to Cosmos Hub Jan 27, 2023

github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Jan 27, 2023

mpoke removed this from Replicated Security Jan 27, 2023

shaspitz mentioned this issue Feb 6, 2023

EPIC: Improve throttling mechanism with retries #713

Closed

mpoke removed this from the Untrusted consumers milestone Mar 5, 2023

mpoke moved this from 🩹 Triage to 📥 Todo in Cosmos Hub Mar 5, 2023

mpoke mentioned this issue Mar 6, 2023

EPIC: Provider assumption of trusted consumer too strong #754

Closed

shaspitz closed this as completed Jun 9, 2023

github-project-automation bot moved this from 📥 Todo to ✅ Done in Cosmos Hub Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent DOS vector introduced by throttling #594

Prevent DOS vector introduced by throttling #594

shaspitz commented Dec 13, 2022 •

edited

Loading

shaspitz commented Dec 13, 2022

mpoke commented Dec 14, 2022

mpoke commented Dec 14, 2022

mpoke commented Dec 14, 2022

mpoke commented Dec 14, 2022

shaspitz commented Dec 14, 2022

jtremback commented Dec 16, 2022

shaspitz commented Dec 16, 2022 •

edited

Loading

mpoke commented Dec 16, 2022

shaspitz commented Dec 19, 2022

shaspitz commented Dec 19, 2022

shaspitz commented Dec 22, 2022

mpoke commented Dec 23, 2022 •

edited

Loading

mpoke commented Dec 23, 2022

shaspitz commented Dec 23, 2022

shaspitz commented Dec 23, 2022

ivan-gavran commented Jan 31, 2023

shaspitz commented Feb 6, 2023

ivan-gavran commented Feb 9, 2023

shaspitz commented Feb 10, 2023

ivan-gavran commented Feb 13, 2023

shaspitz commented Jun 9, 2023

Prevent DOS vector introduced by throttling #594

Prevent DOS vector introduced by throttling #594

Comments

shaspitz commented Dec 13, 2022 • edited Loading

Problem

Closing criteria

Problem details

More validation logic

Don't stop iteration through queue

shaspitz commented Dec 13, 2022

mpoke commented Dec 14, 2022

mpoke commented Dec 14, 2022

mpoke commented Dec 14, 2022

mpoke commented Dec 14, 2022

shaspitz commented Dec 14, 2022

jtremback commented Dec 16, 2022

shaspitz commented Dec 16, 2022 • edited Loading

mpoke commented Dec 16, 2022

shaspitz commented Dec 19, 2022

shaspitz commented Dec 19, 2022

shaspitz commented Dec 22, 2022

mpoke commented Dec 23, 2022 • edited Loading

mpoke commented Dec 23, 2022

shaspitz commented Dec 23, 2022

shaspitz commented Dec 23, 2022

ivan-gavran commented Jan 31, 2023

shaspitz commented Feb 6, 2023

ivan-gavran commented Feb 9, 2023

shaspitz commented Feb 10, 2023

ivan-gavran commented Feb 13, 2023

shaspitz commented Jun 9, 2023

shaspitz commented Dec 13, 2022 •

edited

Loading

shaspitz commented Dec 16, 2022 •

edited

Loading

mpoke commented Dec 23, 2022 •

edited

Loading