Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle bug fixes + req refactors #565

Merged
merged 10 commits into from
Dec 8, 2022
Merged

Conversation

shaspitz
Copy link
Contributor

@shaspitz shaspitz commented Dec 6, 2022

Description

This PR solves two bugs found from diff tests on #462, thank you for finding these @danwt! It also closes an issue raised about the integration of key assignment and throttling.

Bug 1: #462 (comment)
Bug 2: #462 (comment)
Issue fixed: #538

Linked issues

Closes #538

Type of change

  • Bugfixes

How was the feature tested?

  • E2E tests
  • Integration

Other information

This will need to be cherry-picked into goc-december and circuit-breaker

@shaspitz shaspitz changed the base branch from main to circuit-breaker December 6, 2022 21:42
@shaspitz shaspitz marked this pull request as ready for review December 6, 2022 21:46
Copy link
Contributor

@danwt danwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heyoo
It's very important to define precise invariant(s) and property(s) of your ideal system before you start implementing a solution to get there. There are problems cropping up because the invariants defined so far are not precise.

The invariant is here

## Throttling Invariant
Using on-chain params and the sub protocol defined, slash packet throttling is implemented such that the following invariant is maintained (in addition to those already defined in the CCV spec).
For the following invariant to hold, these points must be true:
- The final slashed validator does not have more than `SlashMeterReplenishFraction` of total voting power on the provider.
- `SlashMeterReplenishFraction` is large enough to avoid rounding errors.
- `SlashMeterReplenishPeriod` is sufficiently longer than the time it takes to produce a block.
Invariant:
> The time it takes to jail `X`% of the provider validator set from consumer initiated slash requests will be greater than or equal to `(X * SlashMeterReplenishPeriod / SlashMeterReplenishFraction) - 2 * SlashMeterReplenishPeriod`
Intuition: If jailings begin when the slash meter is full, then `SlashMeterReplenishFraction` of the provider validator set can be jailed immediately. The remaining jailings are only applied when the slash meter is positive in value, so the time it takes to jail the remaining `X - SlashMeterReplenishFraction` of the provider validator set is `(X - SlashMeterReplenishFraction) * SlashMeterReplenishPeriod / SlashMeterReplenishFraction`. However, the final validator could be jailed during the final replenishment period, with the meter being very small in value (causing it to go negative after jailing). So we subtract another `SlashMeterReplenishPeriod` term in the invariant to account for this.
Note this invariant could be adjusted with different slash meter protocols, but the current scheme is the simplest to implement and understand.
This invariant is useful because it allows us to reason about the time it takes to jail a certain percentage of the provider validator set from consumer initiated slash requests. For example, if `SlashMeterReplenishFraction` is set to 0.06, then it takes no less than 4 replenishment periods to jail 33% of the provider validator set on the Cosmos Hub. Note that as of writing this on 11/29/22, the Cosmos Hub does not have a validator with more than 6% of total voting power.
Note also that 4 replenishment period is a worst case scenario that depends on well crafted attack timings.

but it's not sensible in the current form

The time it takes to jail X% of the provider validator set from consumer initiated slash requests ...

because the validator set is always changing every block. So the sentence doesn't really make sense: which validator set is being talked about?

I bring this up because development without invariants leads to many problems. The core issue is that you cannot judge a solution/implementation without invariants. Moreover, you cannot describe what the code is supposed to be doing, except in a hand wavy way.

I bring this up here because, in this PR, changes are being made which affect the slash guarantees in subtle ways, and it cannot be said if they're correct or not, because the goal (the invariant) isn't defined.

To be concrete, with a problem here. One of the issues being addressed is the one from this comment

#462 (comment)

which says that setting replenishFraction=1 is not sufficient to avoid packet delays. But this code does not address that AFAIU.

Note that inactive validators are not the same as unbonded valiators. All unbonded validators are inactive, but not all inactive validators are unbonded. Some inactive validators are unbonding, and they must be slashable. Slash packets targeting active and inactive (but not unbonded) validators can all arrive at the same time and they must all be handled as per whatever invariant.

I think it's very important to precisely write down the throttle invariants, and then work from there. We can work together on doing that!

@mpoke mpoke requested a review from MSalopek December 7, 2022 10:57
Copy link
Contributor

@MSalopek MSalopek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because needed in goc-december testnet release.

Integration tests will be added in some other PR.

x/ccv/types/ccv.go Outdated Show resolved Hide resolved
x/ccv/provider/keeper/throttle.go Outdated Show resolved Hide resolved
x/ccv/provider/keeper/throttle.go Outdated Show resolved Hide resolved
@shaspitz shaspitz changed the title Throttle bug fixes Throttle bug fixes + req refactors Dec 7, 2022
@mergify mergify bot mentioned this pull request Dec 7, 2022
24 tasks
@shaspitz
Copy link
Contributor Author

shaspitz commented Dec 7, 2022

Everything passes locally, including golancgci-lint run, so ignore the failed GH actions

@shaspitz
Copy link
Contributor Author

shaspitz commented Dec 7, 2022

Responding to @danwt

There are problems cropping up because the invariants defined so far are not precise.

The problems that have came up so far seem like normal bugs that're a part of the development process. E2e/unit tests are limited, this is why we decided not to merge the slash packet throttling feature into main until we either had diff tests passing, or throttle specific integration tests passing (both of which are capable of finding this critical bug). Finding bugs now is a good thing imo and I welcome such efforts!

the goal (the invariant) isn't defined.

Imo this is not the case. There is a lack of finalized spec rn, but that does not mean we're throwing invariants out the window. Provider Slashing Warranty is the only invariant that slash packet throttling should adjust in the current CCV spec. If there are more existing CCV invariants that this throttling implementation violates, then we can have a discussion about that, and make changes to spec or implementation as needed.

which validator set is being talked about?

You have a valid concern about the invariant I've defined in throttle.md. I will improve that document to define that I'm referring to the validator set that exists at the start of any such attack. If it interests you to collab on making the language more granular, then by all means I'd welcome the help, CC @mpoke who will write the spec.

This PR does address #462 (comment). See TestSlashSameValidator, which would not pass in circuit-breaker without this PR. Setting SlashMeterReplenishFraction = 1 should now allow all slash packets to be handled immediately (until you slash 100% of validators), even if they're recv in the same block. If I'm understanding the bug incorrectly and it's not fixed by this PR, then lets discuss.

Note that inactive validators are not the same as unbonded valiators.

None of the functionality you've written in this paragraph should be violated by slash packet throttling. If it is violated, that's a bug we need to fix.

@shaspitz shaspitz requested a review from mpoke December 7, 2022 23:49
Copy link
Contributor

@mpoke mpoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some suggestions on simplifying the code. Also, why are the GH actions failing?

ctx.BlockTime(), // recv time
chainID, // consumer chain id that sent the packet
packet.Sequence, // IBC sequence number of the packet
providerConsAddr)) // Provider consensus address of val to be slashed

// Queue slash packet data in the same (consumer chain specific) queue as vsc matured packet data,
// to enforce order of handling between the two packet types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not update the packet data to include the providerConsAddr and avoid calling GetProviderAddrFromConsumerAddr later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not update the packet data

By this, I believe you're referring to the SlashPacketData type. This is the protobuf generated type that is sent over the wire from consumer to provider. Adding a new field to this type (that's only used on the provider) seems unnecessary to me. However, if you feel this is important, I can refactor the base branch once the merge conflicts are solved. I'm hesitant to do much more refactoring in this PR due to how disconnected from main it is at this time :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more like updating SlashPacketData.Validator.Address to the provider address obtained from GetProviderAddrFromConsumerAddr. But it's not that important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, that's a reasonable change, done in 1f920b4

x/ccv/provider/keeper/relay.go Show resolved Hide resolved
Comment on lines +38 to +45
var valPower int64
if !found || val.IsJailed() {
// If validator is not found, or found but jailed, it's power is 0. This path is explicitly defined since the
// staking keeper's LastValidatorPower values are not updated till the staking keeper's endblocker.
valPower = 0
} else {
valPower = k.stakingKeeper.GetLastValidatorPower(ctx, val.GetOperator())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but doesn't this change allow validator to be slashed all the way to 0 in a single block?

If a consumer sends many slash packets for the same validator?


Besides that, I don't think this addressing the problem in this comment

#462 (comment)

To state the problem more clearly:

We want it to be possible to set the slashFraction or other params so that the throttle is 'invisible'. Ie. the behavior of the system is exactly as in the current spec,
This means, we do not want any slash packets to be held up/delayed (still talking about when fraction=1, replenishPeriod = 1 second ect).

Consider the following

Imagine there are n validators total on the provider, with k<n active. Imagine the actives are v1,v2..vk and the inactives are vk+1,..vn

A consumer sends slash packets for every single provider validator v1..,vn and they are all delivered to the provider in the same block.

It is possible, that all of the slashes for the active validators v1..vk are handled first. This will set the meter to 0 so the slashes for vk+1,.. vn will be blocked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this design we can slash up to but not including 100% of the "initial val set" instantaneously when the replenish fraction is 1. If we want to include 100%, its a one line inequality change. I like 100% not being allowed, but I'm not super attached to that idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change we discussed today is included in 777a3f6, thank you for the efforts and feedback. Even with disagreements at times, the dialogue ended up useful and productive! It will be cool to get this feature properly specced and thoroughly tested before prod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and before I forget,

I might be missing something but doesn't this change allow validator to be slashed all the way to 0 in a single block?

Yes this is possible with or without slash packet throttling. Does #544 address your concern? If not, might be useful to add onto that issue or create a new one

@shaspitz
Copy link
Contributor Author

shaspitz commented Dec 8, 2022

Also, why are the GH actions failing?

I think this has to do with some oddities of how GH handles CI for a branch off a branch that's disconnected from main 😆, I just ran make test locally on the most recent commit, and it all checks out

@shaspitz shaspitz merged commit a759c07 into circuit-breaker Dec 8, 2022
@shaspitz shaspitz deleted the throttle-bug-fixes branch December 8, 2022 21:33
@shaspitz
Copy link
Contributor Author

shaspitz commented Dec 8, 2022

@mpoke @danwt, I've merged, but have done my best to address your PR comments here. Please lmk in the base PR if there's anything that feels unaddressed, or incorrect.

Refactors will be a lot easier to reason about once circuit-breaker is up to date with main, and particularly #542

danwt pushed a commit that referenced this pull request Dec 9, 2022
commit 01b13d3
Author: Shawn Marshall-Spitzbart <[email protected]>
Date:   Thu Dec 8 18:34:49 2022 -0800

    weird

commit a759c07
Author: Shawn Marshall-Spitzbart <[email protected]>
Date:   Thu Dec 8 13:33:34 2022 -0800

    Throttle bug fixes + req refactors (#565)

    * fixes

    * Update throttle.go

    * fix tests

    * set replenish frac = 1.0 for all test runs

    * rm unmarshal func

    * req refactors

    * small lint fix

    * comment adjustment

    * found <-> jailed order swap

    * 100% change
jtremback added a commit that referenced this pull request Dec 16, 2022
* wip

* Update module.go

* wip

* tests pass

* Update relay.go

* Update relay.go

* Update relay.go

* Update relay.go

* wip

* still wip

* panic with too many packets

* Update relay.go

* wip

* checkpoint

* Update throttle.go

* make relay.go closer to main

* merge fixes

* Update expected_keepers.go

* Update throttle_test.go

* Update throttle.go

* second queue is now implemented

* smalls

* Update throttle.go

* removed pointer silliness

* Update throttle_test.go

* test improvements

* small

* where it's called

* Update relay.go

* comments n stuff

* Update throttle.go

* Update throttle.go

* comments

* wip

* callbacks

* cleans

* mas tests

* Update README.md

* less diff

* mas

* keys

* wip

* changes

* Update params.go

* size constraints

* Update keys_test.go

* Update throttle_test.go

* wip

* on recv new behavior and test

* on recv slash packet

* so close

* clean

* Update slashing.go

* cleans

* wip

* params

* wip

* tests

* changes

* mas

* Update README.md

* Update README.md

* Update README.md

* sorry for the friday night emails

* Update instance_test.go

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* works

* Update generic_setup.go

* Update setup.go

* smol

* small

* path to ccv chan setup

* todos

* Update setup.go

* Create debug_test.go

* democ

* Update debug_test.go

* setup all ccv channels

* bump to main

* another bump, missed one

* fix after merge main

* fixes

* Update slashing.go

* expired client tests

* checks

* fixed the stuff

* smol

* changes

* updates

* cleans

* clean

* todo

* fixes

* cleans

* Update slashing.go

* Update slashing.go

* mod tidy

* fix build errs

* test fixes

* Update slashing.go

* Update throttle.go

* base rounding

* mas unit tests

* updates

* comments

* comments

* cleans

* clean

* small

* sin gas

* more understandable logic

* crypto rand

* utils

* one e2e

* helpers

* Update slashing.go

* helpers

* cleans

* shiz works

* tcs

* smalls

* un mas

* allowance changing test

* smol

* smol

* e2e tests are done

* lets try this

* Key Assignment -> goc-december (#527)

* add MsgAssignConsumerKey

* add MsgAssignConsumerKey

* fix package name

* add keys

* add keeper methods for key assignment

* handle MsgAssignConsumerKey

* map addresses in slash requests

* prune old consumer addresses

* move AssignConsumerKey logic to keeper

* update consumer initial valset

* add ApplyKeyAssignmentToValUpdates

* fix client creation

* do not check init valset on consumer

* clean state on val removal

* fix TestAssignConsensusKeyForConsumerChain

* delete on val removal

* remove reverse mapping on val removal

* remove pending key assignment in EndBlock

* add query endpoints
add summary of indexes
change ConsumerValidatorByVscID to ConsumerAddrsToPrune

* Refactor AssignConsumerKey for clarity (IMO)

* finish key assignment genesis code- untested

* FIxed mocks compile issue - not sure if it works right though.

* add test for init and export genesis

* set after get in AssignConsumerKey

* enable AssignConsumerKey to be called twice

* remove key assignment on chain removal

* apply some review comments

* fix bug: two validator with same consumer key

* rename key: ConsumerValidatorsByVscIDBytePrefix -> ConsumerAddrsToPruneBytePrefix

* PendingKeyAssignment -> KeyAssignmentReplacements

* msg.ProviderAddr is a validator addr

* fix: key assignment genesis tests (#517)

* Fix consumer init genesis test

* fix provider genesis tests

* fix key assignement handler

* fix linter

* fix merge conflict

* fix ProviderValidatorAddress

* remove unused expectation

Co-authored-by: Marius Poke <[email protected]>

* add key assignment CRUD operations unit tests (#516)

* test val consumer key related CRUD

* test val consumer addr related CRUD

* test pending key assignments related CRUD

* refactor after review session

* refactor after review session

* add prune key CRUD tests

* renamings in testfiles

* improve KeyAssignmentReplacement set and get

* remove ApplyKeyAssignmentToInitialValset (redundant)

* add invariant to docstring of AppendConsumerAddrsToPrune

* fix address conversion

* adding e2e tests

* fix linter

* add queries; setup integration tests (#519)

* add queries; setup integration testse

* test key assignment before chain start

* fix state queries; refactor

* rm extra comment

* rm unused action field

* bump voting times in all tests

* add provider address query to tests

* Adds some very basic random testing and unit tests (#522)

* Adds imports

* Does multi iterations: fails!

* Handle errs

* checkpoint debug

* Pre introduce dynamic mock

* Issue seems to be resolved

* Removes prints in key asisgn

* Removes debug, pre reintroduce all test features

* Fix some magic numbers, bring back prune check

* Pre rework initial assignments

* Refactor and tidyup

* Better docs, clarity, org

Co-authored-by: Daniel <[email protected]>

* Enable key assignment testing for all e2e tests (#524)

* split CCVTestSuite.setupCallback in two

* pre-assign keys for all vals of first consumer

* fix linter

* remove TestConsumerGenesis

Co-authored-by: mpoke <[email protected]>
Co-authored-by: Simon Noetzlin <[email protected]>
Co-authored-by: MSalopek <[email protected]>
Co-authored-by: Daniel T <[email protected]>
Co-authored-by: Daniel <[email protected]>

* Fix errors in merge commit, comment out failing TestRelayAndApplySlashPacket test

* add simon's test fixes

* Update state.go

* last replenish time -> last full time

* readme

* increment i

* shared method

* iteration change

* Meter allowance lockstep (#553)

changes

* log

* requested key format changes (#560)

changes

* Throttle garbage collection (#557)

* changes

* Update proposal.go

* add log

* Update throttle_test.go

* fixes

* update invariant

* Throttle bug fixes + req refactors (#565)

* fixes

* Update throttle.go

* fix tests

* set replenish frac = 1.0 for all test runs

* rm unmarshal func

* req refactors

* small lint fix

* comment adjustment

* found <-> jailed order swap

* 100% change

* weird

* merge fixes to build

* tests now pass

* name change

* better ordering tests

* remove integration test diff

* avoid double call to address mapping

* Update throttle.go

* 0 included in iteration start

* naming refactors

* Update keys_test.go

* more refactors

* clarify allowance terminology

* update doc with explanation on min value

* md clarification

* Update throttle.md

* swap replenish order

* add max limit note

* #533 Adds normal operation diff testing

* reb

* reb

* Del unused

Co-authored-by: Daniel <[email protected]>

* progress save

* cleans

* name change

* Bugfix (#605)

* Circuit breaker refactor (#606)

* quick fix

* small key correction

* smol

* don't store time length

* use big endian, shawn you dingus

Co-authored-by: Jehan <[email protected]>
Co-authored-by: mpoke <[email protected]>
Co-authored-by: Simon Noetzlin <[email protected]>
Co-authored-by: MSalopek <[email protected]>
Co-authored-by: Daniel T <[email protected]>
Co-authored-by: Daniel <[email protected]>
Co-authored-by: Jehan Tremback <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Throttle uses consumer validator address to lookup validator power but it should use provider address
5 participants