-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleted timelines reappear in attach [P1:S1] #3560
Comments
I investigated one case reported by @shanyp for https://console.neon.tech/admin/projects/round-voice-411502 project:
We have two timelines ffe543561c4c40d710e75d5d6f8693e5 and 7177efa8250e462d5ca9330b4878320e. 7177 was forked from ffe5 at LSN=0/3162380 and contains no changes (nothing was done in this timeline or it is removed?) Parent timeline ffe5 doesn't contain any layer covering LSN range prior to 0/3162380
Looks like bug in GC. But I failed to locate pageserver logs for the dates before today. There are two timelines, one branch is empty and is branched from main at 0/3162380, |
Everything interested happens with this project a 9 of February: at this date branch was created and latest image layer also corresponds to this date. Unfortunately we do not have pgeserver log older than 11 of February. |
There are no remote layers, so the problem is not caused by eviction layers to S3. |
Timeline |
It can somehow explain the problem. Most likely after pageserver restart (normal or abnormal), we find out timeline metadata file on the disk and it cause pageserver to recreate timeline. or may be some information about timeline was received from console. In any case, the problem seems to be not related with GC, but with pageserver restart or console-pageserver interaction. I am not so familiar with this stuff, so may be somebody else can look at this issue? |
I have a hypothesis on that, I think attach can resurrect timelines because we dont delete s3 data. I'll validate it and post here |
Yes, the hypothesis is correct. Thanks for investigating @knizhnik! I wrote a failing test to showcase the problem: https://github.com/neondatabase/neon/compare/dkr/timeline-resurrection-on-attach?expand=1#diff-652bdcd9c6677e36bf9518135f852095f086dbd57f3f02453c8e06562fb9d19eR727 Here there will be two timelines instead of one. Thinking about solution. I think we need to start deleting data from s3 for real. There are some edge cases so we cannot just enumerate all files in index_json and delete them. Thats because deletion needs to be completed even in presence of pageserver restarts (and even in the event of pageserver loss). I think the most appealing option is to use control plane request retries to drive this when needed. Other options:
Opinions? @neondatabase/storage |
|
I think we should not try to delete the files on S3 from pageserver side for real:
|
Can you elaborate? The only thing that is preventing us from starting multiple pageserver on the same bucket is mutable index_file.json
Yeah, I'm thinking about schedule + poll type of thing for delete.
I dont think so, it looks like it should be easy to implement on top of detach. And such a movement if done manually will be quite time consuming in case of big projects.
Its the same with attach, I believe common parts can be shared. I understand the concerns, but you havent proposed a solution for original problem |
AFAIR it accepts up to 1k object ids, with big projects (tens of thousands of layers) it still can take significant amount of time |
Yes, the mutable index_part.json is the biggest culprit. Another way could be some sync-over-broker between such PS nodes.
Indeed, hence I'm also not sure if it's really worthy to remove the data from within the pageserver? Both deletion and "moves" (which is copy + deletion also) are super slow and what's the point to spend pageservers' resources on that?
I think I did: do not add the actual deletion (and whatever related slow processes) inside the pageserver, add it elsewhere. |
It doesnt solve the problem with deleted timelines appearing during attach.
Exactly because this operation is slow and we're async it seems that actually not so much resources will be used on our side. Its only aws doing the job and we waiting for it to complete. It may make sense if we can move whole gc process off of pageserver. And it's possible to do so for compaction as well. But we need to investigate how this will impact s3 costs. If we constantly download and upload stuff to compact it and let other pageservers download the result it may be more expensive than doing this locally |
I see the benefit of everything living inside pageserver. thats simpler to treat the component as a black box that provides an API that can be safely used from outside.
If your answer to that is "control plane should take care of that" then we add startup dependency on control plane for storage. It may not be a bad thing, but it will be there. And if in the future control plane starts to store data in neon itself we're in a chicken and egg problem. So IMO it is better to keep storage independent as much as possible |
Implement the delete RFC |
…ne resurrection (#3919) Before this patch, the following sequence would lead to the resurrection of a deleted timeline: - create timeline - wait for its index part to reach s3 - delete timeline - wait an arbitrary amount of time, including 0 seconds - detach tenant - attach tenant - the timeline is there and Active again This happens because we only kept track of the deletion in the tenant dir (by deleting the timeline dir) but not in S3. The solution is to turn the deleted timeline's IndexPart into a tombstone. The deletion status of the timeline is expressed in the `deleted_at: Option<NativeDateTime>` field of IndexPart. It's `None` while the timeline is alive and `Some(deletion time stamp)` if it is deleted. We change the timeline deletion handler to upload this tombstoned IndexPart. The handler does not return success if the upload fails. Coincidentally, this fixes the long-stanging TODO about the `std::fs::remove_dir_all` being not atomic. It need not be atomic anymore because we set the `deleted_at=Some()` before starting the `remove_dir_all`. The tombstone is in the IndexPart only, not in the `metadata`. So, we only have the tombstone and the `remove_dir_all` benefits mentioned above if remote storage is configured. This was a conscious trade-off because there's no good format evolution story for the current metadata file format. The introduction of this additional step into `delete_timeline` was painful because delete_timeline needs to be 1. cancel-safe 2. idempotent 3. safe to call concurrently These are mostly self-inflicted limitations that can be avoided by using request-coalescing. PR #4159 will do that. fixes #3560 refs #3889 (part of tenant relocation) Co-authored-by: Joonas Koivunen <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>
Steps to reproduce
prior to error appearance in logs I attached these tenants using /attach call
Expected result
No errors
Actual result
This happened to two relatively big and active projects. One of them triggered #3532
Environment
prod
Logs, links
UPD: #3560 (comment)
The text was updated successfully, but these errors were encountered: