Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted timelines reappear in attach [P1:S1] #3560

Closed
LizardWizzard opened this issue Feb 7, 2023 · 15 comments · Fixed by #3919
Closed

Deleted timelines reappear in attach [P1:S1] #3560

LizardWizzard opened this issue Feb 7, 2023 · 15 comments · Fixed by #3919
Assignees
Labels
c/storage/pageserver Component: storage: pageserver c/storage Component: storage t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Feb 7, 2023

Steps to reproduce

prior to error appearance in logs I attached these tenants using /attach call

Expected result

No errors

Actual result

2023-02-06T16:03:40.650792Z ERROR compaction_loop{tenant_id=X}:compact_timeline{timeline=Y}: could not compact, repartitioning keyspace failed: could not find data for key 000000000000000000000000000000000000 at LSN 0/3AA2681, for request at LSN 0/3AA2680

This happened to two relatively big and active projects. One of them triggered #3532

Environment

prod

Logs, links

UPD: #3560 (comment)

@LizardWizzard LizardWizzard added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver c/storage Component: storage labels Feb 7, 2023
@knizhnik
Copy link
Contributor

I investigated one case reported by @shanyp for https://console.neon.tech/admin/projects/round-voice-411502 project:

2023-02-15T13:48:44.313412Z ERROR compaction_loop{tenant_id=39fa0a33ca08f1948bd01b5f1e4ae898}:compact_timeline{timeline=7177efa8250e462d5ca9330b4878320e}: could not compact, repartitioning keyspace failed: could not find data for key 000000000000000000000000000000000000 at LSN 0/3162381, for request at LSN 0/3162380

We have two timelines ffe543561c4c40d710e75d5d6f8693e5 and 7177efa8250e462d5ca9330b4878320e. 7177 was forked from ffe5 at LSN=0/3162380 and contains no changes (nothing was done in this timeline or it is removed?) Parent timeline ffe5 doesn't contain any layer covering LSN range prior to 0/3162380
The only layer containing key 000..000 is image layer having LSN much larger than 0/3162380:

-rw-r--r-- 1 pageserver pageserver 118800384 Feb  9 12:11 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000AD5DA5B8
-rw-r--r-- 1 pageserver pageserver 118816768 Feb  9 12:11 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000AE2C11E8
-rw-r--r-- 1 pageserver pageserver 119406592 Feb 14 12:12 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000AFC5C608
-rw-r--r-- 1 pageserver pageserver 120373248 Feb 15 13:40 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000B0E89D08
-rw-r--r-- 1 pageserver pageserver 122650624 Feb  9 12:15 000000000000000000000000000000000000-000000067F00004002000060DB0300000000__000000009E94AD30
-rw-r--r-- 1 pageserver pageserver     24576 Feb 16 15:12 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000B1E5A879-00000000B1E5C1C9
-rw-r--r-- 1 pageserver pageserver     24576 Feb 16 15:22 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000B1E5C1C9-00000000B1E5DFB1
-rw-r--r-- 1 pageserver pageserver 101048320 Feb  9 12:15 000000067F000032AC000040040000000000-030000000000000000000000000000000002__000000009E481691-00000000A3D29089

Looks like bug in GC. But I failed to locate pageserver logs for the dates before today.
Also most f message produced by GC has Debug level, so are not dumped.
It is only possible to check number of layer removed by GC.

There are two timelines, one branch is empty and is branched from main at 0/3162380,
but for soe reasons there are no layers in main timeline covering LSN range preceding 0/3162380.
Looks like GC problem.

@knizhnik
Copy link
Contributor

Everything interested happens with this project a 9 of February: at this date branch was created and latest image layer also corresponds to this date. Unfortunately we do not have pgeserver log older than 11 of February.
So I could not check which operation were performed and how it is possible that GC or somebody else removed old layers.

@knizhnik
Copy link
Contributor

There are no remote layers, so the problem is not caused by eviction layers to S3.

@LizardWizzard
Copy link
Contributor Author

Timeline 7177efa8250e462d5ca9330b4878320e is marked as deleted in console. Not sure how its happened, there is definitely a bug somewhere. Maybe in deletion logic

@knizhnik
Copy link
Contributor

It can somehow explain the problem.
I can better call it as "resurrected timeline": so timeline is deleted, GC normally delete layers beyond PiTR boundary and then somehow timeline is resurrected. It can be downloaded from S3. But looks like it is the case - there are no remote layers for this timeline at S3.

Most likely after pageserver restart (normal or abnormal), we find out timeline metadata file on the disk and it cause pageserver to recreate timeline. or may be some information about timeline was received from console.

In any case, the problem seems to be not related with GC, but with pageserver restart or console-pageserver interaction. I am not so familiar with this stuff, so may be somebody else can look at this issue?

@LizardWizzard
Copy link
Contributor Author

I have a hypothesis on that, I think attach can resurrect timelines because we dont delete s3 data. I'll validate it and post here

@LizardWizzard
Copy link
Contributor Author

Yes, the hypothesis is correct. Thanks for investigating @knizhnik!

I wrote a failing test to showcase the problem: https://github.com/neondatabase/neon/compare/dkr/timeline-resurrection-on-attach?expand=1#diff-652bdcd9c6677e36bf9518135f852095f086dbd57f3f02453c8e06562fb9d19eR727

Here there will be two timelines instead of one.

Thinking about solution. I think we need to start deleting data from s3 for real. There are some edge cases so we cannot just enumerate all files in index_json and delete them. Thats because deletion needs to be completed even in presence of pageserver restarts (and even in the event of pageserver loss). I think the most appealing option is to use control plane request retries to drive this when needed. Other options:

  • Have local tombstone - only solves problem with pageserver restarts, wont help with pageserver loss
  • Have some tombstone flag in index_part.json - needs some process that will check the flag and continue deletion

Opinions? @neondatabase/storage

@LizardWizzard LizardWizzard changed the title could not compact, repartitioning keyspace failed: could not find data for key 000000000000000000000000000000000000 Deleted timelines reappear in attach Feb 20, 2023
@shanyp
Copy link
Contributor

shanyp commented Feb 21, 2023

There are some edge cases so we cannot just enumerate all files in index_json and delete them.
Is there a batch delete API in S3 that is all or nothing we can leverage here ?

@SomeoneToIgnore
Copy link
Contributor

I think we should not try to delete the files on S3 from pageserver side for real:

  • if PS nodes won't have any S3 deletion and common file editing code (not the case now, though), we'll be able to safely start multiple "virtual" pageservers on the same S3 bucket

  • AFAIK, S3 removes per se could take a while, if you have many files. We will have more layers the more data the users have, so the "blocking" S3 timeline removal will work worse with time

  • the code to handle the restart + finish delete cases could be big
    If we ever want to change the deletion process, say, instead of S3 deletion start moving the files into some cold storage with autoremoval, it could lead for big changes in pageserver.
    I'd prefer to make these in a separate component, to separate the concerns.

@LizardWizzard
Copy link
Contributor Author

if PS nodes won't have any S3 deletion and common file editing code (not the case now, though), we'll be able to safely start multiple "virtual" pageservers on the same S3 bucket

Can you elaborate? The only thing that is preventing us from starting multiple pageserver on the same bucket is mutable index_file.json

AFAIK, S3 removes per se could take a while, if you have many files. We will have more layers the more data the users have, so the "blocking" S3 timeline removal will work worse with time

Yeah, I'm thinking about schedule + poll type of thing for delete.

instead of S3 deletion start moving the files into some cold storage with autoremoval, it could lead for big changes in pageserver

I dont think so, it looks like it should be easy to implement on top of detach. And such a movement if done manually will be quite time consuming in case of big projects.

the code to handle the restart + finish delete cases could be big

Its the same with attach, I believe common parts can be shared.


I understand the concerns, but you havent proposed a solution for original problem

@LizardWizzard
Copy link
Contributor Author

Is there a batch delete API in S3 that is all or nothing we can leverage here ?

AFAIR it accepts up to 1k object ids, with big projects (tens of thousands of layers) it still can take significant amount of time

@SomeoneToIgnore
Copy link
Contributor

SomeoneToIgnore commented Feb 23, 2023

Can you elaborate?

Yes, the mutable index_part.json is the biggest culprit.
Technically, there's also compaction and GC S3 removals happening, that would be good to remove for the sake of purism but seems that it's not big of a problem now.
The point is, if all this does not exist in PS, and nothing else changes S3, it's safe to start arbitrary "virtual" pageservers on certain timelines in parallel and check the hypothesis.

Another way could be some sync-over-broker between such PS nodes.

And such a movement if done manually will be quite time consuming in case of big projects.

Indeed, hence I'm also not sure if it's really worthy to remove the data from within the pageserver? Both deletion and "moves" (which is copy + deletion also) are super slow and what's the point to spend pageservers' resources on that?
I'd have some slow, standalone process to handle the final GC the way we want.

I understand the concerns, but you havent proposed a solution for original problem

I think I did: do not add the actual deletion (and whatever related slow processes) inside the pageserver, add it elsewhere.

@LizardWizzard LizardWizzard changed the title Deleted timelines reappear in attach Deleted timelines reappear in attach [P1:S1] Feb 27, 2023
@LizardWizzard
Copy link
Contributor Author

I think I did: do not add the actual deletion (and whatever related slow processes) inside the pageserver, add it elsewhere.

It doesnt solve the problem with deleted timelines appearing during attach.

and what's the point to spend pageservers' resources on that

Exactly because this operation is slow and we're async it seems that actually not so much resources will be used on our side. Its only aws doing the job and we waiting for it to complete. It may make sense if we can move whole gc process off of pageserver. And it's possible to do so for compaction as well. But we need to investigate how this will impact s3 costs. If we constantly download and upload stuff to compact it and let other pageservers download the result it may be more expensive than doing this locally

@LizardWizzard
Copy link
Contributor Author

LizardWizzard commented Feb 27, 2023

I see the benefit of everything living inside pageserver. thats simpler to treat the component as a black box that provides an API that can be safely used from outside.

It doesnt solve the problem with deleted timelines appearing during attach.

If your answer to that is "control plane should take care of that" then we add startup dependency on control plane for storage. It may not be a bad thing, but it will be there. And if in the future control plane starts to store data in neon itself we're in a chicken and egg problem. So IMO it is better to keep storage independent as much as possible

@shanyp
Copy link
Contributor

shanyp commented Mar 23, 2023

Implement the delete RFC
Remove the actual left timelines

LizardWizzard added a commit that referenced this issue Apr 21, 2023
Resolves #3560
If the flag is set the timeline will be ignored in attach and during
initial loading.

This is the first step in #3889
problame added a commit that referenced this issue May 10, 2023
…ne resurrection (#3919)

Before this patch, the following sequence would lead to the resurrection of a deleted timeline:

- create timeline
- wait for its index part to reach s3
- delete timeline
- wait an arbitrary amount of time, including 0 seconds
- detach tenant
- attach tenant
- the timeline is there and Active again

This happens because we only kept track of the deletion in the tenant dir (by deleting the timeline dir) but not in S3.

The solution is to turn the deleted timeline's IndexPart into a tombstone.
The deletion status of the timeline is expressed in the `deleted_at: Option<NativeDateTime>` field of IndexPart.
It's `None` while the timeline is alive and `Some(deletion time stamp)` if it is deleted.

We change the timeline deletion handler to upload this tombstoned IndexPart.
The handler does not return success if the upload fails.

Coincidentally, this fixes the long-stanging TODO about the `std::fs::remove_dir_all` being not atomic.
It need not be atomic anymore because we set the `deleted_at=Some()` before starting the `remove_dir_all`.

The tombstone is in the IndexPart only, not in the `metadata`.
So, we only have the tombstone and the `remove_dir_all` benefits mentioned above if remote storage is configured.
This was a conscious trade-off because there's no good format evolution story for the current metadata file format.

The introduction of this additional step into `delete_timeline` was painful because delete_timeline needs to be
1. cancel-safe
2. idempotent
3. safe to call concurrently
These are mostly self-inflicted limitations that can be avoided by using request-coalescing.
PR #4159 will do that.

fixes #3560

refs #3889 (part of tenant relocation)


Co-authored-by: Joonas Koivunen <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver c/storage Component: storage t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
4 participants