Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup PVF artifact by cache limit and stale time #4662

Merged
merged 20 commits into from
Jun 6, 2024

Conversation

AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented May 31, 2024

Part of #4324
We don't change but extend the existing cleanup strategy.

  • We still don't touch artifacts being stale less than 24h
  • First time we attempt pruning only when we hit cache limit (10 GB)
  • If somehow happened that after we hit 10 GB and least used artifact is stale less than 24h we don't remove it.

@AndreiEres AndreiEres changed the title [WIP] Add ability to remove PVF by size [WIP] Add ability to remove PVF artifact by size May 31, 2024
pub enum CleanupBy {
// Inactive time after which artefact is deleted
Time(Duration),
// Max size in bytes. Reaching it older artefacts are deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems wrong, we delete least used ones. At least for me older usually means in relation to creation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


for (k, v) in self.inner.iter() {
if let ArtifactState::Prepared { ref path, last_time_needed, .. } = *v {
if let Ok(metadata) = fs::metadata(path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bothers me here is that we're running a (possibly large) number of synchronous blocking filesystem operations in Artifacts::prune(), which is called from the PVF host's main loop running on a non-blocking threadpool. In case of I/O problems the whole PVF host will stall. I think we should either make use of tokio::fs::metadata(), or, even better, store the artifact's size as a property of the artifact itself along with the other data, and then no filesystem access is required inside prune().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added the size to artifact's state

Comment on lines 296 to 302
artifact_sizes.sort_by_key(|&(_, _, _, last_time_needed)| last_time_needed);

while total_size > *size_limit {
let Some((artifact_id, path, size, _)) = artifact_sizes.pop() else { break };
to_remove.push((artifact_id, path));
total_size -= size;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A unit test to check the correctness of this behavior would be definitely appreciated :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added (and it saved me because the implementation was not correct)

@AndreiEres AndreiEres changed the title [WIP] Add ability to remove PVF artifact by size Add ability to remove PVF artifact by size Jun 3, 2024
@alexggh alexggh self-requested a review June 3, 2024 12:23
@AndreiEres AndreiEres changed the title Add ability to remove PVF artifact by size Cleanup PVF artifact by cache limit and stale time Jun 5, 2024
@AndreiEres AndreiEres added T0-node This PR/Issue is related to the topic “node”. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 5, 2024
@AndreiEres AndreiEres requested a review from sandreim June 5, 2024 11:35
if now
.duration_since(last_time_needed)
.map(|stale_time| stale_time < cleanup_config.min_stale_time)
.unwrap_or(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns an Err if earlier is later than self, and the error contains how far from self the time is.

In case of error we should return true to break here, we don't want to delete things if time move backwards for whatever reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks alright to me.

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6412873

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you!

polkadot/node/core/pvf/src/artifacts.rs Outdated Show resolved Hide resolved
Comment on lines +315 to +321
let used_recently = now
.duration_since(last_time_needed)
.map(|stale_time| stale_time < cleanup_config.min_stale_time)
.unwrap_or(true);
if used_recently {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm 🤔 So, the cache may grow unbounded as long as all the artifacts are fresh? I mean, this mimics the current behavior, so no new attack vectors are introduced here for sure, but probably we should review the old ones... I don't call for any change here right now, it's just for discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache may grow unbounded as long as all the artifacts are fresh

I believe it growths no more than now because we use same 24h limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I'm just wondering if there's some scenario where someone buys some cheap coretime leftovers and starts pushing tons of PVFs around within 24 hours to overflow validators' disk space. Sounds unlikely, just paranoia, probably.

polkadot/node/core/pvf/src/artifacts.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/artifacts.rs Outdated Show resolved Hide resolved
Comment on lines 292 to 293
/// Remove artifacts older than the given TTL or the total artifacts size limit and return id
/// and path of the removed ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment needs fixing: we evict LRU artifacts only if we go over cache limit.

@AndreiEres AndreiEres enabled auto-merge June 6, 2024 18:59
@AndreiEres AndreiEres added this pull request to the merge queue Jun 6, 2024
Merged via the queue into master with commit 494448b Jun 6, 2024
156 of 157 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/remove-pvf-by-size branch June 6, 2024 19:49
ordian added a commit that referenced this pull request Jun 13, 2024
* master: (29 commits)
  Append overlay optimization. (#1223)
  finalization: Skip tree route calculation if no forks present (#4721)
  Remove unncessary call remove_from_peers_set (#4742)
  add pov-recovery unit tests and support for elastic scaling (#4733)
  approval-voting: Add no shows debug information (#4726)
  Revamp the Readme of the parachain template (#4713)
  Update README.md to move the PSVM link under a "Tooling" section under the "Releases" section (#4734)
  frame/proc-macro: Refactor code for better readability (#4712)
  Contracts:  update wasmi to 0.32 (#3679)
  Backport style changes from P<>K bridge to R<>W bridge (#4732)
  New reference doc for Custom RPC V2 (#4654)
  Frame Pallets: Clean a lot of test setups (#4642)
  Fix occupied core handling (#4691)
  statement-distribution: Fix false warning (#4727)
  Update the README to include a link to the Polkadot SDK Version Manager (#4718)
  Cleanup PVF artifact by cache limit and stale time (#4662)
  Update link to a latest polkadot release (#4711)
  [CI] Delete cargo-deny config (#4677)
  fix build on MacOS: bump secp256k1 and secp256k1-sys to patched versions (#4709)
  Unify dependency aliases (#4633)
  ...
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
Part of paritytech#4324
We don't change but extend the existing cleanup strategy. 
- We still don't touch artifacts being stale less than 24h
- First time we attempt pruning only when we hit cache limit (10 GB)
- If somehow happened that after we hit 10 GB and least used artifact is
stale less than 24h we don't remove it.

---------

Co-authored-by: s0me0ne-unkn0wn <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants