Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip System::Events in TryDecodeEntireState #2560

Closed
liamaharon opened this issue Nov 30, 2023 · 15 comments · Fixed by #3454
Closed

Skip System::Events in TryDecodeEntireState #2560

liamaharon opened this issue Nov 30, 2023 · 15 comments · Fixed by #3454
Assignees
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. I5-enhancement An additional feature request. T1-FRAME This PR/Issue is related to core FRAME, the framework.

Comments

@liamaharon
Copy link
Contributor

liamaharon commented Nov 30, 2023

@joepetrowski encountered this failed try-runtime check on one of his PRs https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4551392

This is the offending block https://rococo.subscan.io/block/0x565751cc2bfaf946216484bb34466c1e5a4f75c444235557d48f822974987048?tab=event

This is likely due to events being emitted in that block using an old storage layout. We avoid this issue with other storage items by running migrations prior to checking if state is decodable. This doesn't work for System::Events.

We should modify our decode state checks to skip System::Events.

@liamaharon liamaharon added the T1-FRAME This PR/Issue is related to core FRAME, the framework. label Nov 30, 2023
@liamaharon liamaharon self-assigned this Nov 30, 2023
@liamaharon liamaharon added the C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. label Jan 29, 2024
@ggwpez ggwpez added D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. I5-enhancement An additional feature request. labels Feb 4, 2024
@kianenigma kianenigma changed the title Flakey System::Events state decoding skip System::Events in TryDecodeEntireState Feb 5, 2024
@kianenigma kianenigma changed the title skip System::Events in TryDecodeEntireState Skip System::Events in TryDecodeEntireState Feb 5, 2024
@xlc
Copy link
Contributor

xlc commented Feb 23, 2024

can you just modify the try runtime check to call initialize first, which will remove events and bunch other storage or just ignore the events. I just don't get why we need macro magic for this.

@liamaharon
Copy link
Contributor Author

We want this to be a general solution that doesn't make the assumption that the storage has the same characteristics as System::Events. @bkchr pointed out it's possible for downstream users have similar storage, which we can't assume will be cleared out during on_initialize.

@xlc
Copy link
Contributor

xlc commented Feb 23, 2024

Can you name an example? Otherwise I will say it is overengineering to build a solution to solve a problem that does not yet exist.

@bkchr
Copy link
Member

bkchr commented Feb 24, 2024

Given the amount of code written, I would not say it is overengineered. Theoretically you could have values set by inherents that do not require to be migrated because you change them in every block any way. Generally, if the macro that generates the code for checking that all types can be decoded wouldn't be the same that we now extend with this option, I would not have proposed to do it this way.

Generally you could also say that having this entire machinery written in Rust is the wrong way, because you can also just write some piece of javascript that can decode the entire state using the metadata.

@xlc
Copy link
Contributor

xlc commented Feb 25, 2024

Given the amount of code written, I would not say it is overengineered.

It may not be overengineered too much, nonetheless could still be poor code in the repo.

Let's define the problem. We have a tool try-runtime that's trying to decode all the storages and ensure all of them are decodable. Right now, it decodes the stroages at end of a block, using the metadata from next block? And because metadata could change, it have false positive.

Shouldn't the solution to be using the right metadata?

For the case you want to validate a new runtime that is still able to decode all the existing data, just do like what a real runtime does. Call initialize, maybe call inherents, and then decode that data.

Generally you could also say that having this entire machinery written in Rust is the wrong way, because you can also just write some piece of javascript that can decode the entire state using the metadata.

No. I am pointing out a classic mistake of not following separation of concern. The try runtime tool is just another software, not part of FRAME. (It maybe now to some people, but it shouldn't be for many reasons) This is a bug of try runtime tool and we should fix the try runtime tool. Not modify low level FRAME code to add a new attribute that only exists to fix the bug in try runtime tool.

Image I want to add a new attribute just because it allows a tool that Acala developed to achieve something, I am sure you will have a lot of questions about it. Try runtime tool should not be an exception here.

We can do this if we have to, but I don't see reason that we have to do this way.

Another issue of this implementation is that it is only accessible via Rust code directly to my understanding. It is not exposed by metadata or any other way. This means we cannot for example using JS to implement this feature. Tell me this is not tight coupling.

There are just too many downsides. It won't work with old blocks. It requires upgrade of pokladot-sdk to work. It makes alternative try-runtime implementation miserable. It makes FRAME harder to learn (due to one more feature no body going to use but still needs to be documented).

Let's fix the problem properly, please. This is exactly why FRAME is complicated and beginner found it is hard to learn.

@liamaharon
Copy link
Contributor Author

liamaharon commented Feb 25, 2024

The try runtime tool is just another software, not part of FRAME. (It maybe now to some people, but it shouldn't be for many reasons) This is a bug of try runtime tool and we should fix the try runtime tool. Not modify low level FRAME code to add a new attribute that only exists to fix the bug in try runtime tool.

I think there is some misunderstanding here. It is important to understand the distinction between the Substrate try-runtime feature and the try-runtime-cli tool.

The first is an integral part of FRAME and Substrate, the second is an independent tool. They are seperate. The changes in the proposed solution in this PR are for the try-runtime feature (which is part of FRAME), they have nothing to do with the downstream Parity maintained cli.

If you are unhappy with or have suggestions for how to improve the general direction of the try-runtime feature, I suggest you open up dedicated discussion for that.

Image I want to add a new attribute just because it allows a tool that Acala developed to achieve something, I am sure you will have a lot of questions about it. Try runtime tool should not be an exception here.

I don't see why there would be any problem with Acala adding an improvement to try-runtime features or any other features in Substrate?

Another issue of this implementation is that it is only accessible via Rust code directly to my understanding. It is not exposed by metadata or any other way.

This change has nothing to do with which languages can call into the try-runtime runtime APIs. If there is a problem with that, please open a dedicated issue so we can discuss making them easier to access with your language of choice.

It makes alternative try-runtime implementation miserable.

I don't understand what you are getting at here. This fixes a bug for tools which use try-runtime feature gated code for checking state decoding. It doesn't impact any other tools. Please explain what you mean by "alternative try-runtime implementation" and second how this makes it "miserable".

@xlc
Copy link
Contributor

xlc commented Feb 26, 2024

Maybe I misunderstood how #3454 works as we clearly not on a same picture. So let's get back to step one.

Can you describe the solution in details? To the point that I can understand how it works without reading any code.

@ggwpez
Copy link
Member

ggwpez commented Feb 26, 2024

Maybe we can also put it on ParachainSystem::HostConfiguration polkadot-fellows/runtimes#200 ?

@liamaharon
Copy link
Contributor Author

Maybe I misunderstood how #3454 works as we clearly not on a same picture. So let's get back to step one.

Can you describe the solution in details? To the point that I can understand how it works without reading any code.

Runtimes expose try-runtime gated apis allowing developers to check stuff including that state decodes using the latest runtime. This api allows the developer to whitelist particular storage that they don't want to be included in these checks.

@xlc
Copy link
Contributor

xlc commented Feb 27, 2024

Ok. Yeah I did misunderstand the whole thing a bit. However, I still think this is not the best way forward.

As I described in #2108, I would like to see a proper solution to this mess code, rather creating more interdependent unmaintainable code.

The goal is able to use write some code in pallet, feature gated with try-runtime flag, that is able to perform extra assertion of the onchain state.

The current situation: we have try-runtime API that does a lot of things, live in this repo. We have try-runtime-cli that uses the try-runtime API to do the work. We are adding low level code to FRAME to support a particular use case of the try-runtime API.

My suggestions:

The quick solution: avoid special case, just do like a normal runtime does. i.e. invoke on_initialize hook before checking state. This mimic real runtime and enforce the undocumented invariant: the storages are only guaranteed to be decodeable after on_initialize hook. Note this invariant is never specified but it is the code implies. We might as well formalize it rather introduce more hidden assumptions.

The proper solution: Move as much code as possible from polkadot-sdk to somewhere else and in this case, try-runtime-cli. There is zero reason that we need a runtime API that decodes everything. We can just write the logic in try-runtime-cli. This simplifies polkadot-sdk, make it easier to learn and use, and makes development iterator faster as we will be writing nicely decoupled code and try-runtime-cli can do whatever it needs without impacting polkadot-sdk at all. An extra benefit is that it works with production runtime and we can easily use such tool with any production chain without worrying about how to make a try-runtime build.

If you still don't agree with me, at least make the implementation geneic. Allow people to annotate custom metadata. I requested it 5 years ago paritytech/substrate#3686. I got a workaround, but still not as nice as I would like to.

@liamaharon
Copy link
Contributor Author

liamaharon commented Feb 27, 2024

The proper solution: Move as much code as possible from polkadot-sdk to somewhere else and in this case, try-runtime-cli. There is zero reason that we need a runtime API that decodes everything. We can just write the logic in try-runtime-cli. This simplifies polkadot-sdk, make it easier to learn and use, and makes development iterator faster as we will be writing nicely decoupled code and try-runtime-cli can do whatever it needs without impacting polkadot-sdk at all. An extra benefit is that it works with production runtime and we can easily use such tool with any production chain without worrying about how to make a try-runtime build.

I do agree this would be a nicer approach to try-runtime-related features in general: make the runtime apis as simple and composable as general, moving the complexity into the tools using those runtime apis.

make the implementation geneic. Allow people to annotate custom metadata.

I'm not familiar with metadata internals. Do you have ideas of what the api and implementation of this would look like?

The quick solution: avoid special case, just do like a normal runtime does. i.e. invoke on_initialize hook before checking state.

This would not resolve the HostConfiguration case unfortunately.

@xlc
Copy link
Contributor

xlc commented Feb 27, 2024

I'm not familiar with metadata internals. Do you have ideas of what the api and implementation of this would look like?

The API could be something like #[pallet::annotate(frame_support::try_runtime::DisableTryDecodeStorage, true)] and add a BTreeMap field to the metadata here. It will require a major metadata version bump but that shouldn't be a blocker.

This would not resolve the HostConfiguration case unfortunately.

Just also creates inherents? But at the same time I will argue the code handle HostConfiguration is poorly written. For example, that means we cannot read it from on_initialize hook. It is yet another special case people were fine about it and then we need more special cases to deal with it.

Again, I am not happy with many aspects of polkadot-sdk and I want to prevent it from keep adding complexity, which is the wrong direction. We need to stop making it worse and start thinking about how to fix the past mistakes.

@liamaharon
Copy link
Contributor Author

The API could be something like #[pallet::annotate(frame_support::try_runtime::DisableTryDecodeStorage, true)] and add a BTreeMap field to the metadata here. It will require a major metadata version bump but that shouldn't be a blocker.

TBH I think a better solution would be the try-runtime api just allows configuration of which storage items to try to decode to be specified by the user, rather than being hardcoded into the runtime (you have suggested improvements like this in the past). I think we need some new set of try-runtime runtime apis, and deprecate the olds ones.

Just also creates inherents? But at the same time I will argue the code handle HostConfiguration is poorly written. For example, that means we cannot read it from on_initialize hook. It is yet another special case people were fine about it and then we need more special cases to deal with it.

It's not trivial to understand which inherents each runtime needs, so this is not so simple. But I like the suggested general direction. maybe there could be some other try-runtime gated api that handles creating inherents to be included in empty blocks, to make this kind of inherent-dependent functionality easier to implement.

Since this issue is pressing and I don't see a straight forward alternative that's less complex than this 70 line change, I'm going to merge the pr. I do think though that you make valid points about it being preferable for the complexity of deciding what exactly try-runtime gated features should do should be moved to the caller rather than the runtime. I think ideally, we create a new, more flexible set of runtime apis and deprecate the existing ones. TBH I don't think we will have capacity at Parity to work on it in the coming months, but good to get the discussion started.

@xlc
Copy link
Contributor

xlc commented Feb 28, 2024

Given the fix is already done so I won't block it anymore but yeah it will be great if we can revisit the implementation in future and build it properly.

@liamaharon
Copy link
Contributor Author

#3499

github-merge-queue bot pushed a commit that referenced this issue Feb 28, 2024
…it on `System::Events` and `ParachainSystem::HostConfiguration` (#3454)

Closes #2560

Allows marking storage items with `#[disable_try_decode_storage]`, and
uses it with `System::Events`.

Question: what's the recommended way to write a test for this? I
couldn't find a test for similar existing macro `#[whitelist_storage]`.
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
…it on `System::Events` and `ParachainSystem::HostConfiguration` (paritytech#3454)

Closes paritytech#2560

Allows marking storage items with `#[disable_try_decode_storage]`, and
uses it with `System::Events`.

Question: what's the recommended way to write a test for this? I
couldn't find a test for similar existing macro `#[whitelist_storage]`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. I5-enhancement An additional feature request. T1-FRAME This PR/Issue is related to core FRAME, the framework.
Projects
None yet
4 participants