Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validators entering the active set are slow on validation because PVF artifacts are not compiled #4324

Closed
alexggh opened this issue Apr 29, 2024 · 5 comments · Fixed by #4791
Assignees
Labels
I2-bug The node fails to follow expected behavior.

Comments

@alexggh
Copy link
Contributor

alexggh commented Apr 29, 2024

PVF artifacts are cleanup every 24h if unused:

artifact_ttl: Duration::from_secs(3600 * 24),

So, when nodes join the active set for the first time or they have been out of the active set for 24 hours, they won't the PVF artifacts ready for approving or backing blocks so they will have to compile all PVFs the first time they need to execute a block from each parachain.

Each PVF compilation will take around 3 seconds or more and compiling all polkadot PVFs will take around 3 minutes, hence why the validator will cause no-shows on the approval-voting and will probably fail to get some of the backing points if the PVF of the parachain is not compiled yet.

This is a transient problems since all the PVF should be compiled after around 3-4 minutes, however this problem could probably be relatively easy to fix by proactively compiling the PVFs before the node become actives.

@alexggh alexggh added the I2-bug The node fails to follow expected behavior. label Apr 29, 2024
@sandreim
Copy link
Contributor

Good idea, nodes should be able to know that already by checking the next session keys.

@s0me0ne-unkn0wn
Copy link
Contributor

I had another idea lost somewhere in discussions: instead of invalidating the cache after 24h, make it size-bounded and only remove the stalest artifact if the cache size overflows. Not 100% sure but sounds like it's somewhat easier to implement than the lookahead compilation.

@alexggh
Copy link
Contributor Author

alexggh commented May 29, 2024

I had another idea lost somewhere in discussions: instead of invalidating the cache after 24h, make it size-bounded and only remove the stalest artifact if the cache size overflows

That would work for nodes that enter and live the active set, however it won't work for the situation where the node is simply a fresh node that just joins the active set.

@s0me0ne-unkn0wn
Copy link
Contributor

That would work for nodes that enter and live the active set, however it won't work for the situation where the node is simply a fresh node that just joins the active set.

Fair enough, but that's just a single no-show. Running a validator from scratch is not something that happens very often, I believe.
Maybe we could implement both? Mine would save them some CPU for the price of some storage, and yours would address all the remaining corner cases.

@alexggh
Copy link
Contributor Author

alexggh commented May 29, 2024

Maybe we could implement both?

Yes, implementing both makes sense.

@AndreiEres AndreiEres self-assigned this May 29, 2024
@AndreiEres AndreiEres moved this from Backlog to In Progress in parachains team board May 29, 2024
github-merge-queue bot pushed a commit that referenced this issue Jun 6, 2024
Part of #4324
We don't change but extend the existing cleanup strategy. 
- We still don't touch artifacts being stale less than 24h
- First time we attempt pruning only when we hit cache limit (10 GB)
- If somehow happened that after we hit 10 GB and least used artifact is
stale less than 24h we don't remove it.

---------

Co-authored-by: s0me0ne-unkn0wn <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
@AndreiEres AndreiEres moved this from In Progress to Review in progress in parachains team board Jun 18, 2024
github-merge-queue bot pushed a commit that referenced this issue Jul 22, 2024
Closes #4324
- On every active leaf candidate-validation subsystem checks if the node
is the next session authority.
- If it is, it fetches backed candidates and prepares unknown PVFs.
- We limit number of PVFs per block to not overload subsystem.
@github-project-automation github-project-automation bot moved this from Review in progress to Completed in parachains team board Jul 22, 2024
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
Part of paritytech#4324
We don't change but extend the existing cleanup strategy. 
- We still don't touch artifacts being stale less than 24h
- First time we attempt pruning only when we hit cache limit (10 GB)
- If somehow happened that after we hit 10 GB and least used artifact is
stale less than 24h we don't remove it.

---------

Co-authored-by: s0me0ne-unkn0wn <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
)

Closes paritytech#4324
- On every active leaf candidate-validation subsystem checks if the node
is the next session authority.
- If it is, it fetches backed candidates and prepares unknown PVFs.
- We limit number of PVFs per block to not overload subsystem.
AndreiEres added a commit that referenced this issue Aug 5, 2024
Closes #4324
- On every active leaf candidate-validation subsystem checks if the node
is the next session authority.
- If it is, it fetches backed candidates and prepares unknown PVFs.
- We limit number of PVFs per block to not overload subsystem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

4 participants