Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor: Downsample second pass creates duplicate blocks #7488

Closed
tghartland opened this issue Jun 26, 2024 · 2 comments · Fixed by #7492
Closed

Compactor: Downsample second pass creates duplicate blocks #7488

tghartland opened this issue Jun 26, 2024 · 2 comments · Fixed by #7492

Comments

@tghartland
Copy link
Contributor

Thanos, Prometheus and Golang version used:
As from this image:

$ podman run -it docker.io/bitnami/thanos:0.35.1-debian-12-r1 --version
thanos, version 0.35.1 (branch: HEAD, revision: 086a698b2195adb6f3463ebbd032e780f39d2050)
  build user:       root@8216e8f68329
  build date:       20240604-21:03:34
  go version:       go1.21.11
  platform:         linux/amd64
  tags:             netgo

Prometheus is version 2.51.2 but shouldn't be relevant.

Object Storage Provider:
NetApp ONTAP S3, using generic S3 config.

What happened:
On the second downsampling pass, the compactor creates duplicates of blocks that were created during the first pass.

First pass:

ts=2024-06-26T05:32:12.111011957Z caller=compact.go:1516 level=info msg="start of compactions"
ts=2024-06-26T05:32:12.111160372Z caller=compact.go:1552 level=info msg="compaction iterations done"
ts=2024-06-26T05:32:12.111203973Z caller=compact.go:447 level=info msg="start first pass of downsampling"
ts=2024-06-26T05:32:12.189283173Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=78.067037ms duration_ms=78 cached=182 returned=6 partial=0
ts=2024-06-26T05:32:14.280048192Z caller=downsample.go:364 level=info msg="downloaded block" id=01J19GEH5AKEHPGAVAQXS2E0TD duration=2.090626326s duration_ms=2090
ts=2024-06-26T05:32:22.239824617Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=61.764617ms duration_ms=61 cached=182 returned=182 partial=0
ts=2024-06-26T05:33:01.793542374Z caller=streamed_block_writer.go:178 level=info msg="finalized downsampled block" mint=1719187200072 maxt=1719360000000 ulid=01J19GFGYACEHGJBP3C87GA0QB resolution=300000
ts=2024-06-26T05:33:01.793625559Z caller=downsample.go:392 level=info msg="downsampled block" from=01J19GEH5AKEHPGAVAQXS2E0TD to=01J19GFGYACEHGJBP3C87GA0QB duration=45.59324678s duration_ms=45593
ts=2024-06-26T05:33:07.30057477Z caller=downsample.go:426 level=info msg="uploaded block" id=01J19GFGYACEHGJBP3C87GA0QB duration=4.06305243s duration_ms=4063

Second pass:

ts=2024-06-26T05:33:07.458500509Z caller=compact.go:479 level=info msg="start second pass of downsampling"
ts=2024-06-26T05:33:07.531635487Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=73.096756ms duration_ms=73 cached=183 returned=7 partial=0
ts=2024-06-26T05:33:09.711916021Z caller=downsample.go:364 level=info msg="downloaded block" id=01J19GEH5AKEHPGAVAQXS2E0TD duration=2.180079153s duration_ms=2180
ts=2024-06-26T05:33:22.258682081Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=80.507454ms duration_ms=80 cached=183 returned=183 partial=0
ts=2024-06-26T05:33:57.435555986Z caller=streamed_block_writer.go:178 level=info msg="finalized downsampled block" mint=1719187200072 maxt=1719360000000 ulid=01J19GH72N2FD744DPMT4FYCAZ resolution=300000
ts=2024-06-26T05:33:57.435619844Z caller=downsample.go:392 level=info msg="downsampled block" from=01J19GEH5AKEHPGAVAQXS2E0TD to=01J19GH72N2FD744DPMT4FYCAZ duration=45.800286131s duration_ms=45800
ts=2024-06-26T05:34:03.291810053Z caller=downsample.go:426 level=info msg="uploaded block" id=01J19GH72N2FD744DPMT4FYCAZ duration=4.398772668s duration_ms=4398
ts=2024-06-26T05:34:03.447003903Z caller=compact.go:499 level=info msg="downsampling iterations done"

The block 01J19GEH5AKEHPGAVAQXS2E0TD is downsampled into two duplicate blocks 01J19GFGYACEHGJBP3C87GA0QB and 01J19GH72N2FD744DPMT4FYCAZ.
image877

By deleting both of those blocks, I can get it to consistently repeat the same process.

On the next GC pass, the duplicate block is marked for deletion:

ts=2024-06-26T05:36:17.094390278Z caller=compact.go:1493 level=info msg="start of GC"
ts=2024-06-26T05:36:17.094458183Z caller=compact.go:202 level=info msg="marking outdated block for deletion" block=01J19GH72N2FD744DPMT4FYCAZ
ts=2024-06-26T05:36:17.103300364Z caller=block.go:203 level=info msg="block has been marked for deletion" block=01J19GH72N2FD744DPMT4FYCAZ

I have tried deleting the duplicate block manually (instead of waiting), and the compactor does not recreate it.

My assumption was that the NetApp S3 storage does not provide strong consistency, and so with the second metadata sync only taking place 0.2 seconds after the upload of the previous block, that it did not see the newly uploaded block.
I can't find any NetApp documentation that either supports or refutes that assumption, though.

And from some testing with a python script using boto3, using list_objects and upload_file functions, it always returns the new object immediately after uploading it.

2024-06-26 15:18:16,252 - [list_objects] Starting with 31 objects in bucket
2024-06-26 15:18:16,261 - [upload_file]  Uploaded 1 object
2024-06-26 15:18:16,266 - [list_objects] Now 32 objects in bucket

To test further, I copied all data from the bucket over to a completely different object store, using Ceph's object gateway S3 API, and switched the compactor to point to that.
I have another setup of thanos using this same object store on an older v0.31.0, which has been running for around a year and I have never seen this duplication issue with it.

On the second object store I saw exactly the same behaviour. First pass:

ts=2024-06-26T15:32:16.352101857Z caller=compact.go:447 level=info msg="start first pass of downsampling"                                                                                                         
ts=2024-06-26T15:32:16.987346209Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=635.21877ms duration_ms=635 cached=194 returned=7 partial=0
ts=2024-06-26T15:32:20.083100002Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=424.92897ms duration_ms=424 cached=194 returned=194 partial=0
ts=2024-06-26T15:32:26.120016851Z caller=downsample.go:364 level=info msg="downloaded block" id=01J19GEH5AKEHPGAVAQXS2E0TD duration=9.132319206s duration_ms=9132                                                 
ts=2024-06-26T15:33:14.347153837Z caller=streamed_block_writer.go:178 level=info msg="finalized downsampled block" mint=1719187200072 maxt=1719360000000 ulid=01J1AJTGV25ZA71G5N1Z7YNCSC resolution=300000
ts=2024-06-26T15:33:14.347242516Z caller=downsample.go:392 level=info msg="downsampled block" from=01J19GEH5AKEHPGAVAQXS2E0TD to=01J1AJTGV25ZA71G5N1Z7YNCSC duration=46.219717383s duration_ms=46219
ts=2024-06-26T15:33:20.076440956Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=418.574699ms duration_ms=418 cached=194 returned=194 partial=0
ts=2024-06-26T15:33:32.717756553Z caller=downsample.go:426 level=info msg="uploaded block" id=01J1AJTGV25ZA71G5N1Z7YNCSC duration=16.947900538s duration_ms=16947

And the second pass:

ts=2024-06-26T15:33:32.890904564Z caller=compact.go:479 level=info msg="start second pass of downsampling"
ts=2024-06-26T15:33:33.857250449Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=966.301615ms duration_ms=966 cached=195 returned=8 partial=0
ts=2024-06-26T15:33:40.402263657Z caller=downsample.go:364 level=info msg="downloaded block" id=01J19GEH5AKEHPGAVAQXS2E0TD duration=6.544688254s duration_ms=6544
ts=2024-06-26T15:34:20.069405034Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=411.543445ms duration_ms=411 cached=195 returned=195 partial=0
ts=2024-06-26T15:34:30.13755673Z caller=streamed_block_writer.go:178 level=info msg="finalized downsampled block" mint=1719187200072 maxt=1719360000000 ulid=01J1AJWS9MPHRDR4FJ3Z6YH733 resolution=300000
ts=2024-06-26T15:34:30.137618027Z caller=downsample.go:392 level=info msg="downsampled block" from=01J19GEH5AKEHPGAVAQXS2E0TD to=01J1AJWS9MPHRDR4FJ3Z6YH733 duration=47.81596869s duration_ms=47815
ts=2024-06-26T15:34:40.402831795Z caller=downsample.go:426 level=info msg="uploaded block" id=01J1AJWS9MPHRDR4FJ3Z6YH733 duration=8.671214389s duration_ms=8671
ts=2024-06-26T15:34:40.568702257Z caller=compact.go:499 level=info msg="downsampling iterations done"

And marking for deletion:

ts=2024-06-26T15:37:17.015359585Z caller=compact.go:1493 level=info msg="start of GC"
ts=2024-06-26T15:37:17.015443832Z caller=compact.go:202 level=info msg="marking outdated block for deletion" block=01J1AJWS9MPHRDR4FJ3Z6YH733
ts=2024-06-26T15:37:17.073574209Z caller=block.go:203 level=info msg="block has been marked for deletion" block=01J1AJWS9MPHRDR4FJ3Z6YH733

The meta.json files of these two blocks, identical apart from the final block ID:
01J1AJWS9MPHRDR4FJ3Z6YH733-meta.json
01J1AJTGV25ZA71G5N1Z7YNCSC-meta.json

At this point I'm fairly confident that it isn't something related to the storage backend, and it must be something introduced since v0.31.0.

What you expected to happen:
Downsampling should not create duplicate blocks.

How to reproduce it (as minimally and precisely as possible):
Is affecting downsampling of 2-days blocks to 5m resolution on two different storage backends, one of which was previously known good.
I haven't tested with a more standard AWS or GCS storage, but I could try that if needed.

Full logs to relevant components:

Anything else we need to know:

@tghartland
Copy link
Contributor Author

I was looking at the file downsample.go yesterday which is why I couldn't find any related diff.

Now that I'm looking in compact.go, there is this commit: 68bef3f.

Before this change, the function sy.Metas() was used in the function call of downsampleBucket(...) for both the first and second pass.

The change to add in filtering moves that call to before the first pass, filters the list, and then passes filteredMetas to both function calls.

sy.SyncMetas(ctx) is still called before the second pass, but that does not affect the filteredMetas list and so the same list of blocks is passed to both the first and second pass downsamplings.

I will test and prepare a fix to regenerate the filtered list after the sync preceding the second pass.

@tghartland
Copy link
Contributor Author

Regenerating the filtered list works, a single downsampled block is produced as expected.

First pass:

ts=2024-06-27T14:19:10.888463617Z caller=compact.go:447 level=info msg="start first pass of downsampling"
ts=2024-06-27T14:19:11.080666564Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=192.18201ms duration_ms=192 cached=88 returned=10 partial=0
ts=2024-06-27T14:19:13.953316251Z caller=downsample.go:371 level=info msg="downloaded block" id=01J19GEH5AKEHPGAVAQXS2E0TD duration=2.872277655s duration_ms=2872
ts=2024-06-27T14:19:15.617337241Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=119.378994ms duration_ms=119 cached=88 returned=88 partial=0
ts=2024-06-27T14:20:04.643177789Z caller=streamed_block_writer.go:178 level=info msg="finalized downsampled block" mint=1719187200072 maxt=1719360000000 ulid=01J1D116NT0KF4NJ5NJQ10DHVG resolution=300000
ts=2024-06-27T14:20:04.643249129Z caller=downsample.go:399 level=info msg="downsampled block" from=01J19GEH5AKEHPGAVAQXS2E0TD to=01J1D116NT0KF4NJ5NJQ10DHVG duration=48.619147549s duration_ms=48619
ts=2024-06-27T14:20:10.787352179Z caller=downsample.go:433 level=info msg="uploaded block" id=01J1D116NT0KF4NJ5NJQ10DHVG duration=4.577254287s duration_ms=4577

Second pass:

ts=2024-06-27T14:20:10.984754236Z caller=compact.go:479 level=info msg="start second pass of downsampling"
ts=2024-06-27T14:20:11.195070096Z caller=fetcher.go:626 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=210.268301ms duration_ms=210 cached=89 returned=11 partial=0
ts=2024-06-27T14:20:11.195155859Z caller=compact.go:491 level=info msg="re-filtered list of blocks"
ts=2024-06-27T14:20:11.19546523Z caller=compact.go:508 level=info msg="downsampling iterations done"

No duplicate block is created.

I don't have enough data in this bucket to test if it would do the 5m->1h downsampling in the second pass, but I expect it would.
Without the fix, that would have needed to wait until the next loop of downsampling.

Preparing an MR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant