Precise per-component ETH-denominated rewards tracking in `ncli_db validatorDb` #3107

zah · 2021-11-16T14:17:47Z

Also adds a Jupyter notebook for producing validator performance reports

…lidatorDb` Also adds a Jupyter notebook for producing validator performance reports

arnetheduck · 2021-11-16T15:04:50Z

beacon_chain/beacon_chain_db.nim

@@ -76,7 +76,7 @@ type
    ## database - this may have a number of "natural" causes such as switching
    ## between different versions of the client and accidentally using an old
    ## database.
-    db: SqStoreRef
+    db*: SqStoreRef


this should not be exposed - expose a readOnly function instead

...and add a test - the db has fairly extensive coverage and should have coverage for this as well

github-actions · 2021-11-16T15:32:25Z

Unit Test Results

    12 files   744 suites 36m 30s ⏱️
1 464 tests 1 462 ✔️ 2 💤 0 ❌
8 924 runs 8 916 ✔️ 8 💤 0 ❌

Results for commit d51905e.

♻️ This comment has been updated with latest results.

arnetheduck · 2021-11-16T15:05:10Z

beacon_chain/beacon_chain_db.nim

@@ -76,7 +76,7 @@ type
    ## database - this may have a number of "natural" causes such as switching
    ## between different versions of the client and accidentally using an old
    ## database.
-    db: SqStoreRef
+    db*: SqStoreRef


...and add a test - the db has fairly extensive coverage and should have coverage for this as well

arnetheduck · 2021-11-16T15:19:24Z

beacon_chain/spec/state_transition_block.nim

@@ -606,7 +640,9 @@ type SomeMergeBlock =
 proc process_block*(
    cfg: RuntimeConfig,
    state: var merge.BeaconState, blck: SomeMergeBlock, flags: UpdateFlags,
-    cache: var StateCache): Result[void, cstring] {.nbench.}=
+    cache: var StateCache,
+    outSlotRewards: SlotRewards | type DontTrackRewards = DontTrackRewards):


this design puts too much noise and responsibilities into the state transition function - what should be done instead is to follow the info model and extract a minimal set of information that's computed anyway, then do the math outside (in ncli_db in this case).

Like this, it becomes hard to audit the state transition for its main purpose - ncli_db is not a main purpose so we design accordingly and don't invade the state transition for it.

Also, many of the same computations are needed for the val-mon branch - the aim is thus to find an architecture that allows both these features extract information with minimal impact to "normal" state transitioning and without magic compile-time modes that are only used sometimes.

Finally, like this, upstream test suite that verifies that computations are done correctly doesn't cover the numbers - in EpochInfo, the idea is that the cache that's built up to add up the rewards and penalties internally in the state transition is also used outside - this ensures that the exported information is as accurate as the one used inside.

To summarize: what I see should happen here is the introduction of a BlockInfo (it's not a SlotInfo - slots don't carry rewards) object that not only gets exported but also used as part of the "normal" flow of the function. Further, where possible, the function should not export computed numbers - instead, it should export the minimal and orthogonal information needed to recreate the computations, because the context in which the numbers get used may vary - that means for example not summing rewards and penalities, and where possible, exporting flags and base rewards instead of final numbers. Ideally we would get rid of the tract/don't track difference because the impact should, be minimal anyway - the way we test any change to the spec is using the db benchmarking tool for the last 50k mainnet blocks, like so: #3089 - the same should be done here.

I've tried briefly to redo the calculations in ncli_db, but this lead to significant duplication of the logic in the spec that would have been quite fragile and difficult to maintain in the future (not to mention that it was not trivial to get right in the first place). Having "too much noise" is a much smaller problem in my opinion.

Comparing the new features to the existing EpochInfo functionality is not quite right, because in ncli_db we replace EpochInfo with DetailedEpochInfo (which carries more fields that have to be updated in more places and summed up at the end). Are you suggesting that DetailedEpochInfo should be the default?

Running the spec benchmark should be unnecessary right now, because the new code is not compiled in the normal spec execution flow. The added DontTrackRewards type parameter is erased away form the final run-time code.

Yes, I'm suggesting to replace it with a single implementation that conveys all necessary information and benchmark that.

Further down, I also argue that separating the concerns and minimising the information transfer will give design advantages because it becomes more clear what information is simple (to someone that understands the reward code) to compute, and what information indeed needs complex data to represent, which in turn will inform a better database design.

I'm not at all comfortable with the idea of having parts of ncli_db in here - the minor duplication of logic is the lesser evil because the spec code doesn't change much except when the spec changes, while ncli_db should be quick and easy to modify - from a higher level, the spec code is not that difficult either - there are a few rewards, and you either get them or not - the most complex piece in here is the base reward - once you have that, you simply split it up between the various categories of rewards in a weighted fashion.

When debugging and reasoning about what the spec code does, this kind of code is a major distraction and obstacle, in a moment when fast and accurate understanding is critical and time matters.

On the other hand, updating ncli_db is never critical and the way we use it changes over time - we should not introduce barriers to making changes in ncli_db, ergo we put the logic there, separately, to do exactly what ncli_db needs.

Same answer as here:
#3107 (comment)

arnetheduck · 2021-11-16T15:32:41Z

beacon_chain/spec/datatypes/base.nim

@@ -483,6 +483,54 @@ type

    flags*: set[RewardFlags]

+  DetailedRewardsAndPenalties* = object


from what I can tell, most of this information is already present in EpochInfo, albeit in compressed form - same as a comment below, we should instead extract raw, minimal and orthogonal information from here as part of the "normal" flow for the state transition, then do any computations that ncli_db needs in ncli_db instead of pollution the state transition with pure ncli_db concerns.

The compression in EpochInfo is a lossy one though, which is the reason it needs the unified delta value for each validator. I didn't want to increase the memory usage of the normal state transition by introducing multiple per-component deltas.

we can fix both problems by exporting building blocks instead of computed data ;)

My opinion is that this is too complicated and can be explored in future PRs. I can perhaps put online the alternative code that I abandoned as a start of such an effort.

A defining feature of the current approach is that we currently have assertions that the component-wise balances match the actual balance changes that end up in the BeaconState. This really requires you to replicate the logic of every single source of balance change in ncli_db which will results in quite a lot of code that is much harder to maintain. Only the fact that the spec is not really expected to change provides some justification for such a heavy duplication of delicate logic.

I can perhaps put online the alternative code that I abandoned as a start of such an effort.

let's start there - an actual example will make it easier to judge.

That said, from the point of view of ncli_db, of course it makes sense to put the code here - the sticking point is that ncli_db is not a priority here, and that as far as possible, code for it should be kept out of here. This was a distinctive and clear priority in the architecture of the original solution and division of labour, both when writing the spec code and when creating ncli_db and its validator performance analysis features.

The alternative code is not finished though and I'm personally not convinced that spending more time on it is beneficial. If you strongly disagree with this and you are willing to take over the maintenance of the more complicated version of ncli_db, then fishing this code on your own in a separate effort makes most sense to me.

arnetheduck · 2021-11-16T15:34:50Z

beacon_chain/spec/state_transition_epoch.nim

@@ -567,8 +583,11 @@ func get_inactivity_penalty_delta*(validator: RewardStatus,

  delta

+func get_outcome(delta: RewardDelta): int64 =


functions like this, which introduce significant risk of overflow Defect:s, should be kept out of the state transition codebase entirely - from what I can tell, it's not needed or used for any "ordinary" state transition flow - in the spec, we use either saturating, explicitly failing, or in rare cases, overflowing math, but never Defect, as far as is humanely possible, with nim injecting defects left and right.

This function can be removed since it is trivial and indeed used only in the reward tracking branches of the code, but I don't think it can overflow considering the maximum rewards and penalties are bounded and far below the int64 capacity.

no, but when it's here, it's easy to call by accident - it also is distracting because during review, it's one more thing to consider from a security pov - we've had enough bugs of this nature.

Well, it's limited to the RewardDelta type. How can you construct overflowing instances of this type by accident?

by copy-pasting the poor programming style that converts uint64 to int64 without range-checking, on the assumption that the caller has done it already - a comment would perhaps solve it, but nevertheless, the original argument stands: the code in spec implements the python spec as closely as possible, using the conventions of the spec, as closely as possible - for better or worse, the spec uses saturating and/or overflowing logic and if we can stick to it, it's easier to reason about.

arnetheduck · 2021-11-16T16:02:03Z

ncli/ncli_db.nim

-      target_attester INTEGER NOT NULL,
-      head_attester INTEGER NOT NULL,
+      source_outcome INTEGER NOT NULL,
+      max_source_reward INTEGER NOT NULL,


the database is already quite heavy - this looks like it's storing redundant information that trivially could be computed instead - it would probably net a significant payoff in terms of performance and size to not store these redundant records

Can you clarify which information you consider trivially computable or redundant? The values that end up here could be unique per validator and and they depend on a number of epoch properties.

Where should the computation happen? The python code in the Jupyter notebook should replicate the reward computation logic of the spec?

I imagine storing the base reward per epoch + validator flags (one int) per validator is quite a lot more efficient. the reason I care is that I run this analysis for all validators to get baselines and other statistic. I initially ran it for all "perfect" records as well but this completely blew up the database.

the computation can be done either in the database with a view or indeed in jupyter (potentially by importing the pyspec) - the functions are mostly trivial.

that said, this doesn't really matter for this iteration of the PR - it can be adjusted later - I do think that it's important to remove the computations from the state transition itself and redo them in ncli_db directly to whatever format and information ncli_db needs/wants - when designing the entire functionality it that way, it will also become immediately clear what the minimal information set in the database is, and how far we should take the normalization of the data therein.

Same answer as here:
#3107 (comment)

Indeed, the produced database is huge, but why is this a problem? It's easy to get enough storage on a machine where you intend to run the tool. Considering that updates to the database are incremental, speed is not much of a problem either.

ncli/ncli_db.nim

arnetheduck · 2021-11-16T16:10:59Z

beacon_chain/spec/datatypes/base.nim

+    ## `RewardStatus` augmented with detailed `RewardInfo`.
+    detailed_info*: DetailedRewardsAndPenalties
+
+  SlotRewards* = ref object


why is this ref?

Nim has an issue with var types that appear in the branches of or types. When you have a parameter of a function with such an or type, the parameter symbol ends up being treated as a non-var within the body of the function even when the var branch is instantiated. I should have added a TODO comment to explain this indeed.

seems like the lesser evil still is to do var (SR | DT) then - the conventions help significantly when deciphering the state transition - what is safe and what is not safe to do with the various helper functions - in fact, I wish nim had call-site annotations for mutability, so you'd have to call with f(a, var b)

var (A | B) is not possible here because B is a type parameter:

outSlotRewards: SlotRewards | type DontTrackRewards

this is the kind of trickery that I very much like to avoid in spec code.

the feature / corner case was too complicated for the person who introduced it in the compiler to implement that they couldn't do it well - it's not a good sign in general that people that read the code will understand what's going on, without relying on complex and rarely used language details.

What's complex to understand precisely? You can claim that it's unfamiliar, sure, but claiming that it is complex is wildly exaggerated.

Also, in your accusations, you often confuse "complicated for the person who introduced it in the compiler to implement" with "the person who implemented in the compiler had very limited development time and wasn't able to follow the perfect Q/A process".

By paying the price of using a less familiar trick, you get to follow the zero overhead principle - you pay nothing for a feature that you don't use.

That's the point I'm trying to make: it's not zero-overhead - there's a significant cognitive overhead of sorting out the cleverness and filtering out the noise when reading the code, which is something we do a lot when reverse engineering problems.

The approach of making DontTrackRewards a var and not break convention is also zero-overhead in terms of execution time - compilers automatically remove unused arguments from the function signature (under the right conditions)

The approach of not making these clever things at all and simply check the performance with a benchmark goes even further: it makes the argument that if the overhead cannot be detected in real life, it's not there, and the complexity is actually unwarranted, no matter what theory you had in mind when writing the code.

significant cognitive overhead

I find this to be an exaggerated claim. Benchmarking the code, measuring the effects on compilation time, making sure that there won't be any regression in your results in the future (e.g. due to changes in the used compilers), etc is more development overhead and technical debt than learning a new Nim trick that everybody can benefit from in other situations.

arnetheduck · 2021-11-16T16:12:15Z

beacon_chain/spec/state_transition.nim

@@ -255,7 +259,9 @@ proc state_transition_block_aux(
                 altair.SigVerifiedSignedBeaconBlock | altair.TrustedSignedBeaconBlock |
                 merge.TrustedSignedBeaconBlock | merge.SigVerifiedSignedBeaconBlock |
                 merge.SignedBeaconBlock,
-    cache: var StateCache, flags: UpdateFlags): bool {.nbench.} =
+    cache: var StateCache, flags: UpdateFlags,
+    outSlotRewards: SlotRewards | type DontTrackRewards = DontTrackRewards):


throughout, we use var to signify mutability instead of hungarian naming - we don't use ref unless the function takes a copy of the reference for some reason, which these functions don't do.

Explained here:
#3107 (comment)

arnetheduck · 2021-11-16T16:18:01Z

beacon_chain/spec/beaconstate.nim

+# https://github.com/ethereum/consensus-specs/blob/v1.1.5/specs/altair/beacon-chain.md#modified-slash_validator
+# https://github.com/ethereum/consensus-specs/blob/v1.1.4/specs/merge/beacon-chain.md#modified-slash_validator
+# Extracted from `slash_validator` in order to reuse in ncli_db for detailed rewards tracking
+proc get_slashing_penalty*(


this could simply use overloading, which is what most of the state transition uses - it's easier to recognise, generally

The code where this was extracted from was not using overloading. In general, I disagree that overloading is superior to a series of when statements unless you are designing a library aiming to take advantage of the open-closed principle (which is not the case here). Overloading leads to inferior error messages, it has more compilation failure modes and it's more expensive for the compiler.

my point was mostly about conventions, and how we can make this new function stand out less - the conventions help guide the reader with less effort than a custom-written when jungle that breaks the convention (for example because it's easier for tooling to understand it) - in this particular case, it's trivial either way - in the larger functions, making sense of a when jungle with the extra indentation it brings is difficult, so overloading was chosen as the dominant strategy, and it has worked pretty well so far as far as I can tell, including when deciphering error messages.

the error messages are perhaps more relevant when you've created a complex and deep type machinery with many layers of generics and significant risk of having the wrong functions being called accidentally, like happens with nim-serialization - this code however is as simple as it gets, and it pays off, error messages included.

these style questions are a bit like colors, everyone has a favorite - mixing them all in one place however rarely ends up well (unless you like brown).

If the convention is wrong, let's fix the convention.

Also, the convention you are describing is hardly followed in the spec code. After all, I've just copy-pasted the exact body of this function which was inlined in another spec function.

well, yes, you copied an exception - I mentioned in the review how it can be brought in line with the rest, but also noted it's a minor point.

if you want to change the convention, the way to go about it would be through a separate PR that changes the convention across the board - I suspect it will be ugly mainly because it's not enforced by a language rule but rather by the whims of whoever writes the code, which is hard to maintain at a good level.

anyway.

I've come up with few more arguments against unnecessary use of overloading in general.

My primary way of navigating the project is through the "Jump to function name" functionality provided by Vim's CTRL+P plugin and its ctags integration. It looks like this:

... but you get to jump directly to function names. Needless to say, this workflow works much better when no overloading is present.

Your assertion that the error messages are just fine also doesn't match my experience. During the development of this feature I've made some trivial typos and I had to face some pretty bad error messages due to the heavy overloading of process_block.

In general, if we follow the general philosophy that the least powerful tool should be used to solve any particular problem, I think there is a clear gradient of power/complexity that goes like this:

Function are the most simple tool

Templates are needed sometimes to produce better code or abstract away control flow constructs

Closed sets of alternatives are better handled with case objects or a series of when statements

Open sets of alternatives need to be handled with overloading or methods (dynamic dispatch is considered more complex than static dispatch)

Hrm, Github doesn't seem to like uploading gifs. Here is the Ctrl+P animation I wanted to share:
https://miro.medium.com/max/1400/1*-DMqWWZ_btpEiyNgKpc8NQ.gif

Jump to function name

well, with a when jungle, you're jumping to the beginning of a potentially long tree of complex statements that you manually have to disambiguate - first, you have to remember what type you were jumping from (which likely was deduced as well), then find the right when clause by executing the type comparison code in your head and so on, and there is no hope that a tool will ever understand this.

With overloading, a tool like nimsuggest can understand that it's supposed to make a contextual jump and gets you to the right spot without the extra processing that you, the developer, have to do.

when statements are free-form - they provide no constraining structure that you can learn - at best, you can try to write a style guide with conventions, but that's generally not good enough for tooling, or reading by experience.

For anything larger than a few lines of code, the extra level of indent also makes a significant dent in readability - you don't know where the phase0 logic vs the altair logic begins and ends. Soon you end up with a fairly convoluted structure: a manually written when jungle that forwards calls to an actual implementation that has non-overloaded names.

All that said, more important than the particular style is the fact that having two or more styles for this problem brings out all the negatives of either approach, but none of the benefits - you end up having to deal with both styles and nobody is happy - this is what my comment primarily was about: most of the spec uses overloading to solve this problem and this new function represents an opportunity to align the code with that convention simply based on a majority decision. It's a completely different argument whether the convention should be changed or not, but no matter what your opinion is on this subject, I think we can agree that this PR is not the time and place for making that change.

The function previously used a when jungle because it had a "shared" section - the new function has no shared code - this difference gives the overload argument more weight. If we really want to make this piece of code clean, we'd simply select the constant based on type using a much smaller helper function, then make a generic function with the division logic.

Ah, the joy of arguing code style philosophy over morning ☕ :)

Here is the Ctrl+P animation I wanted to share:

very ambitious - however, I'd recommend picking a language that has overloading - free choice makes for a much more compelling argument 😛

This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.

Precise per-component ETH-denominated rewards tracking in `ncli_db va…

4bead36

…lidatorDb` Also adds a Jupyter notebook for producing validator performance reports

arnetheduck reviewed Nov 16, 2021

View reviewed changes

arnetheduck requested changes Nov 16, 2021

View reviewed changes

arnetheduck reviewed Nov 16, 2021

View reviewed changes

ncli/ncli_db.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Nov 16, 2021

View reviewed changes

ncli/ncli_db.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Nov 16, 2021

View reviewed changes

Address review comments

d51905e

zah added a commit that referenced this pull request Dec 29, 2021

Precise per-component ETH-denominated rewards tracking

a706c53

This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.

zah added a commit that referenced this pull request Dec 29, 2021

Precise per-component ETH-denominated rewards tracking

95b283f

This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.

zah mentioned this pull request Dec 29, 2021

Precise per-component ETH-denominated rewards tracking #3224

Merged

arnetheduck force-pushed the unstable branch 2 times, most recently from 657f9d5 to a4667d1 Compare January 6, 2022 16:14

zah added a commit that referenced this pull request Jan 17, 2022

Precise per-component ETH-denominated rewards tracking

66fa375

This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.

zah added a commit that referenced this pull request Jan 17, 2022

Precise per-component ETH-denominated rewards tracking

29aad02

This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.

zah closed this Jan 18, 2022

arnetheduck deleted the detailed-rewards-and-penalties branch February 16, 2022 10:09

		@@ -483,6 +483,54 @@ type

		flags*: set[RewardFlags]

		DetailedRewardsAndPenalties* = object

		@@ -567,8 +583,11 @@ func get_inactivity_penalty_delta*(validator: RewardStatus,

		delta

		func get_outcome(delta: RewardDelta): int64 =

Precise per-component ETH-denominated rewards tracking in ncli_db validatorDb #3107

Precise per-component ETH-denominated rewards tracking in ncli_db validatorDb #3107

Conversation

zah commented Nov 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 16, 2021 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Precise per-component ETH-denominated rewards tracking in `ncli_db validatorDb` #3107

Precise per-component ETH-denominated rewards tracking in `ncli_db validatorDb` #3107

github-actions bot commented Nov 16, 2021 •

edited

Loading

zah Nov 16, 2021 •

edited

Loading

zah Nov 16, 2021 •

edited

Loading

zah Nov 17, 2021 •

edited

Loading

zah Nov 16, 2021 •

edited

Loading

zah Nov 17, 2021 •

edited

Loading

zah Nov 17, 2021 •

edited

Loading

zah Nov 16, 2021 •

edited

Loading