Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pageserver): ensure upload happens after delete #9844

Merged
merged 6 commits into from
Nov 22, 2024
Merged

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Nov 21, 2024

Problem

Follow up of #9682, that patch didn't fully address the problem: what if shutdown fails due to whatever reason and then we reattach the tenant? Then we will still remove the future layer. The underlying problem is that the fix for #5878 gets voided because of the generation optimizations.

Of course, we also need to ensure that delete happens after uploads, but note that we only schedule deletes when there are no ongoing upload tasks, so that's fine.

Summary of changes

  • Add a test case to reproduce the behavior (by changing the original test case to attach the same generation).
  • If layer upload happens after the deletion, drain the deletion queue before uploading.
    • If blocked_deletion is enabled, directly remove it from the blocked_deletion queue.
  • Local fs backend fix to avoid race between deletion and preload.
  • test_emergency_mode does not need to wait for uploads (and it's generally not possible to wait for uploads).
  • Optimize deletion executor to skip validation if there are no files to delete. this doesn't work

Copy link

github-actions bot commented Nov 21, 2024

5544 tests run: 5327 passed, 0 failed, 217 skipped (full report)


Flaky tests (2)

Postgres 17

Postgres 15

Code coverage* (full report)

  • functions: 31.4% (7962 of 25349 functions)
  • lines: 49.3% (63161 of 128148 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2a7d002 at 2024-11-22T18:38:22.198Z :recycle:

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member Author

skyzh commented Nov 22, 2024

I have no idea why 9e7fa98 will cause a lot of test failures but after reverting everything seems so good.

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh skyzh requested a review from jcsp November 22, 2024 17:23
@jcsp
Copy link
Collaborator

jcsp commented Nov 22, 2024

I see how this PR addresses the case of deleting+recreating the same image layer within the lifetime of a Timeline, but what about the case where some earlier shutdown went wrong and we start a new Timeline while some deletions are still in the global deletion queue? Don't we still need a change similar to #9796 to make it flush the deletion queue on start?

@skyzh
Copy link
Member Author

skyzh commented Nov 22, 2024

I see how this PR addresses the case of deleting+recreating the same image layer within the lifetime of a Timeline, but what about the case where some earlier shutdown went wrong and we start a new Timeline while some deletions are still in the global deletion queue? Don't we still need a change similar to #9796 to make it flush the deletion queue on start?

Yes; in theory, this could happen, but it has not been observed yet, and I assume it is rarer to happen. wWe can fix it in another patch. The problem I would like to address is the race between future layer deletion and the upload of the same layer. Both operations happen after the tenant reattach.

@skyzh
Copy link
Member Author

skyzh commented Nov 22, 2024

... and if I'm going to fix it, I will populate the recently_deleted hashset with the initial deletion list.

@jcsp
Copy link
Collaborator

jcsp commented Nov 22, 2024

problem I would like to address is the race between future layer deletion and the upload of the same layer. Both operations happen after the tenant reattach.

Makes sense

@skyzh skyzh enabled auto-merge November 22, 2024 18:28
@skyzh skyzh added this pull request to the merge queue Nov 22, 2024
Merged via the queue into main with commit c1937d0 Nov 22, 2024
74 checks passed
@skyzh skyzh deleted the skyzh/barrier-try-2 branch November 22, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants