Deleted timelines reappear in attach [P1:S1] #3560

LizardWizzard · 2023-02-07T16:15:02Z

Steps to reproduce

prior to error appearance in logs I attached these tenants using /attach call

Expected result

No errors

Actual result

2023-02-06T16:03:40.650792Z ERROR compaction_loop{tenant_id=X}:compact_timeline{timeline=Y}: could not compact, repartitioning keyspace failed: could not find data for key 000000000000000000000000000000000000 at LSN 0/3AA2681, for request at LSN 0/3AA2680

This happened to two relatively big and active projects. One of them triggered #3532

Environment

prod

Logs, links

https://neonprod.grafana.net/goto/okjdbEAVz?orgId=1

UPD: #3560 (comment)

The text was updated successfully, but these errors were encountered:

knizhnik · 2023-02-16T16:30:22Z

I investigated one case reported by @shanyp for https://console.neon.tech/admin/projects/round-voice-411502 project:

2023-02-15T13:48:44.313412Z ERROR compaction_loop{tenant_id=39fa0a33ca08f1948bd01b5f1e4ae898}:compact_timeline{timeline=7177efa8250e462d5ca9330b4878320e}: could not compact, repartitioning keyspace failed: could not find data for key 000000000000000000000000000000000000 at LSN 0/3162381, for request at LSN 0/3162380

We have two timelines ffe543561c4c40d710e75d5d6f8693e5 and 7177efa8250e462d5ca9330b4878320e. 7177 was forked from ffe5 at LSN=0/3162380 and contains no changes (nothing was done in this timeline or it is removed?) Parent timeline ffe5 doesn't contain any layer covering LSN range prior to 0/3162380
The only layer containing key 000..000 is image layer having LSN much larger than 0/3162380:

-rw-r--r-- 1 pageserver pageserver 118800384 Feb  9 12:11 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000AD5DA5B8
-rw-r--r-- 1 pageserver pageserver 118816768 Feb  9 12:11 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000AE2C11E8
-rw-r--r-- 1 pageserver pageserver 119406592 Feb 14 12:12 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000AFC5C608
-rw-r--r-- 1 pageserver pageserver 120373248 Feb 15 13:40 000000000000000000000000000000000000-000000067F00004002000060C20200000000__00000000B0E89D08
-rw-r--r-- 1 pageserver pageserver 122650624 Feb  9 12:15 000000000000000000000000000000000000-000000067F00004002000060DB0300000000__000000009E94AD30
-rw-r--r-- 1 pageserver pageserver     24576 Feb 16 15:12 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000B1E5A879-00000000B1E5C1C9
-rw-r--r-- 1 pageserver pageserver     24576 Feb 16 15:22 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000B1E5C1C9-00000000B1E5DFB1
-rw-r--r-- 1 pageserver pageserver 101048320 Feb  9 12:15 000000067F000032AC000040040000000000-030000000000000000000000000000000002__000000009E481691-00000000A3D29089

Looks like bug in GC. But I failed to locate pageserver logs for the dates before today.
Also most f message produced by GC has Debug level, so are not dumped.
It is only possible to check number of layer removed by GC.

There are two timelines, one branch is empty and is branched from main at 0/3162380,
but for soe reasons there are no layers in main timeline covering LSN range preceding 0/3162380.
Looks like GC problem.

knizhnik · 2023-02-17T14:22:13Z

Everything interested happens with this project a 9 of February: at this date branch was created and latest image layer also corresponds to this date. Unfortunately we do not have pgeserver log older than 11 of February.
So I could not check which operation were performed and how it is possible that GC or somebody else removed old layers.

knizhnik · 2023-02-17T17:18:42Z

There are no remote layers, so the problem is not caused by eviction layers to S3.

LizardWizzard · 2023-02-20T09:51:34Z

Timeline 7177efa8250e462d5ca9330b4878320e is marked as deleted in console. Not sure how its happened, there is definitely a bug somewhere. Maybe in deletion logic

knizhnik · 2023-02-20T15:09:29Z

It can somehow explain the problem.
I can better call it as "resurrected timeline": so timeline is deleted, GC normally delete layers beyond PiTR boundary and then somehow timeline is resurrected. It can be downloaded from S3. But looks like it is the case - there are no remote layers for this timeline at S3.

Most likely after pageserver restart (normal or abnormal), we find out timeline metadata file on the disk and it cause pageserver to recreate timeline. or may be some information about timeline was received from console.

In any case, the problem seems to be not related with GC, but with pageserver restart or console-pageserver interaction. I am not so familiar with this stuff, so may be somebody else can look at this issue?

LizardWizzard · 2023-02-20T15:15:16Z

I have a hypothesis on that, I think attach can resurrect timelines because we dont delete s3 data. I'll validate it and post here

LizardWizzard · 2023-02-20T16:45:09Z

Yes, the hypothesis is correct. Thanks for investigating @knizhnik!

I wrote a failing test to showcase the problem: https://github.com/neondatabase/neon/compare/dkr/timeline-resurrection-on-attach?expand=1#diff-652bdcd9c6677e36bf9518135f852095f086dbd57f3f02453c8e06562fb9d19eR727

Here there will be two timelines instead of one.

Thinking about solution. I think we need to start deleting data from s3 for real. There are some edge cases so we cannot just enumerate all files in index_json and delete them. Thats because deletion needs to be completed even in presence of pageserver restarts (and even in the event of pageserver loss). I think the most appealing option is to use control plane request retries to drive this when needed. Other options:

Have local tombstone - only solves problem with pageserver restarts, wont help with pageserver loss
Have some tombstone flag in index_part.json - needs some process that will check the flag and continue deletion

Opinions? @neondatabase/storage

shanyp · 2023-02-21T07:45:01Z

There are some edge cases so we cannot just enumerate all files in index_json and delete them.
Is there a batch delete API in S3 that is all or nothing we can leverage here ?

SomeoneToIgnore · 2023-02-21T10:01:09Z

I think we should not try to delete the files on S3 from pageserver side for real:

if PS nodes won't have any S3 deletion and common file editing code (not the case now, though), we'll be able to safely start multiple "virtual" pageservers on the same S3 bucket
AFAIK, S3 removes per se could take a while, if you have many files. We will have more layers the more data the users have, so the "blocking" S3 timeline removal will work worse with time
the code to handle the restart + finish delete cases could be big
If we ever want to change the deletion process, say, instead of S3 deletion start moving the files into some cold storage with autoremoval, it could lead for big changes in pageserver.
I'd prefer to make these in a separate component, to separate the concerns.

LizardWizzard · 2023-02-21T17:41:36Z

if PS nodes won't have any S3 deletion and common file editing code (not the case now, though), we'll be able to safely start multiple "virtual" pageservers on the same S3 bucket

Can you elaborate? The only thing that is preventing us from starting multiple pageserver on the same bucket is mutable index_file.json

AFAIK, S3 removes per se could take a while, if you have many files. We will have more layers the more data the users have, so the "blocking" S3 timeline removal will work worse with time

Yeah, I'm thinking about schedule + poll type of thing for delete.

instead of S3 deletion start moving the files into some cold storage with autoremoval, it could lead for big changes in pageserver

I dont think so, it looks like it should be easy to implement on top of detach. And such a movement if done manually will be quite time consuming in case of big projects.

the code to handle the restart + finish delete cases could be big

Its the same with attach, I believe common parts can be shared.

I understand the concerns, but you havent proposed a solution for original problem

LizardWizzard · 2023-02-21T17:43:49Z

Is there a batch delete API in S3 that is all or nothing we can leverage here ?

AFAIR it accepts up to 1k object ids, with big projects (tens of thousands of layers) it still can take significant amount of time

SomeoneToIgnore · 2023-02-23T08:27:02Z

Can you elaborate?

Yes, the mutable index_part.json is the biggest culprit.
Technically, there's also compaction and GC S3 removals happening, that would be good to remove for the sake of purism but seems that it's not big of a problem now.
The point is, if all this does not exist in PS, and nothing else changes S3, it's safe to start arbitrary "virtual" pageservers on certain timelines in parallel and check the hypothesis.

Another way could be some sync-over-broker between such PS nodes.

And such a movement if done manually will be quite time consuming in case of big projects.

Indeed, hence I'm also not sure if it's really worthy to remove the data from within the pageserver? Both deletion and "moves" (which is copy + deletion also) are super slow and what's the point to spend pageservers' resources on that?
I'd have some slow, standalone process to handle the final GC the way we want.

I understand the concerns, but you havent proposed a solution for original problem

I think I did: do not add the actual deletion (and whatever related slow processes) inside the pageserver, add it elsewhere.

LizardWizzard · 2023-02-27T22:15:41Z

I think I did: do not add the actual deletion (and whatever related slow processes) inside the pageserver, add it elsewhere.

It doesnt solve the problem with deleted timelines appearing during attach.

and what's the point to spend pageservers' resources on that

Exactly because this operation is slow and we're async it seems that actually not so much resources will be used on our side. Its only aws doing the job and we waiting for it to complete. It may make sense if we can move whole gc process off of pageserver. And it's possible to do so for compaction as well. But we need to investigate how this will impact s3 costs. If we constantly download and upload stuff to compact it and let other pageservers download the result it may be more expensive than doing this locally

LizardWizzard · 2023-02-27T22:22:37Z

I see the benefit of everything living inside pageserver. thats simpler to treat the component as a black box that provides an API that can be safely used from outside.

It doesnt solve the problem with deleted timelines appearing during attach.

If your answer to that is "control plane should take care of that" then we add startup dependency on control plane for storage. It may not be a bad thing, but it will be there. And if in the future control plane starts to store data in neon itself we're in a chicken and egg problem. So IMO it is better to keep storage independent as much as possible

shanyp · 2023-03-23T14:25:44Z

Implement the delete RFC
Remove the actual left timelines

Resolves #3560 If the flag is set the timeline will be ignored in attach and during initial loading. This is the first step in #3889

…ne resurrection (#3919) Before this patch, the following sequence would lead to the resurrection of a deleted timeline: - create timeline - wait for its index part to reach s3 - delete timeline - wait an arbitrary amount of time, including 0 seconds - detach tenant - attach tenant - the timeline is there and Active again This happens because we only kept track of the deletion in the tenant dir (by deleting the timeline dir) but not in S3. The solution is to turn the deleted timeline's IndexPart into a tombstone. The deletion status of the timeline is expressed in the `deleted_at: Option<NativeDateTime>` field of IndexPart. It's `None` while the timeline is alive and `Some(deletion time stamp)` if it is deleted. We change the timeline deletion handler to upload this tombstoned IndexPart. The handler does not return success if the upload fails. Coincidentally, this fixes the long-stanging TODO about the `std::fs::remove_dir_all` being not atomic. It need not be atomic anymore because we set the `deleted_at=Some()` before starting the `remove_dir_all`. The tombstone is in the IndexPart only, not in the `metadata`. So, we only have the tombstone and the `remove_dir_all` benefits mentioned above if remote storage is configured. This was a conscious trade-off because there's no good format evolution story for the current metadata file format. The introduction of this additional step into `delete_timeline` was painful because delete_timeline needs to be 1. cancel-safe 2. idempotent 3. safe to call concurrently These are mostly self-inflicted limitations that can be avoided by using request-coalescing. PR #4159 will do that. fixes #3560 refs #3889 (part of tenant relocation) Co-authored-by: Joonas Koivunen <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

LizardWizzard added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver c/storage Component: storage labels Feb 7, 2023

shanyp mentioned this issue Feb 16, 2023

WARN failed to calculate logical size at lsn #3418

Closed

vadim2404 assigned knizhnik Feb 16, 2023

LizardWizzard assigned LizardWizzard and unassigned knizhnik Feb 20, 2023

LizardWizzard changed the title ~~could not compact, repartitioning keyspace failed: could not find data for key 000000000000000000000000000000000000~~ Deleted timelines reappear in attach Feb 20, 2023

SomeoneToIgnore mentioned this issue Feb 21, 2023

Epic: Periodic service data integrity checks #2606

Closed

LizardWizzard mentioned this issue Feb 21, 2023

Epic: GC deleted timelines & tenants in S3 #2427

Closed

LizardWizzard changed the title ~~Deleted timelines reappear in attach~~ Deleted timelines reappear in attach [P1:S1] Feb 27, 2023

SomeoneToIgnore mentioned this issue Mar 1, 2023

Eliminate the mismatches on all pageservers, and clean up extra S3 files #3724

Closed

LizardWizzard mentioned this issue Mar 10, 2023

rfc: delete pageserver data from s3 #3792

Merged

shanyp added the triaged bugs that were already triaged label Mar 23, 2023

LizardWizzard mentioned this issue Mar 27, 2023

Delete pageserver data from s3 when project is deleted #3889

Closed

LizardWizzard mentioned this issue Mar 31, 2023

keep track of timeline deletion status in IndexPart to prevent timeline resurrection #3919

Merged

LizardWizzard added a commit that referenced this issue Apr 21, 2023

Use is_deleted flag in IndexPart to prevent timeline resurrection

9fda377

Resolves #3560 If the flag is set the timeline will be ignored in attach and during initial loading. This is the first step in #3889

problame closed this as completed in #3919 May 10, 2023

problame mentioned this issue May 17, 2023

clean up resurrected timelines from before #3560 was fixed #4263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleted timelines reappear in attach [P1:S1] #3560

Deleted timelines reappear in attach [P1:S1] #3560

LizardWizzard commented Feb 7, 2023 •

edited

Loading

knizhnik commented Feb 16, 2023

knizhnik commented Feb 17, 2023

knizhnik commented Feb 17, 2023

LizardWizzard commented Feb 20, 2023

knizhnik commented Feb 20, 2023

LizardWizzard commented Feb 20, 2023

LizardWizzard commented Feb 20, 2023

shanyp commented Feb 21, 2023

SomeoneToIgnore commented Feb 21, 2023

LizardWizzard commented Feb 21, 2023

LizardWizzard commented Feb 21, 2023

SomeoneToIgnore commented Feb 23, 2023 •

edited

Loading

LizardWizzard commented Feb 27, 2023

LizardWizzard commented Feb 27, 2023 •

edited

Loading

shanyp commented Mar 23, 2023

Deleted timelines reappear in attach [P1:S1] #3560

Deleted timelines reappear in attach [P1:S1] #3560

Comments

LizardWizzard commented Feb 7, 2023 • edited Loading

Steps to reproduce

Expected result

Actual result

Environment

Logs, links

knizhnik commented Feb 16, 2023

knizhnik commented Feb 17, 2023

knizhnik commented Feb 17, 2023

LizardWizzard commented Feb 20, 2023

knizhnik commented Feb 20, 2023

LizardWizzard commented Feb 20, 2023

LizardWizzard commented Feb 20, 2023

shanyp commented Feb 21, 2023

SomeoneToIgnore commented Feb 21, 2023

LizardWizzard commented Feb 21, 2023

LizardWizzard commented Feb 21, 2023

SomeoneToIgnore commented Feb 23, 2023 • edited Loading

LizardWizzard commented Feb 27, 2023

LizardWizzard commented Feb 27, 2023 • edited Loading

shanyp commented Mar 23, 2023

LizardWizzard commented Feb 7, 2023 •

edited

Loading

SomeoneToIgnore commented Feb 23, 2023 •

edited

Loading

LizardWizzard commented Feb 27, 2023 •

edited

Loading