Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc for Upload/AssetBlob garbage collection #2068

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mvandenburgh
Copy link
Member

This design document proposes a design for garbage collection of Upload and AssetBlob records. A prerequisite to enabling this was the "trailing delete" feature that has landed in production (see doc)

A design document for garbage collection of Asset records will follow this one.

@yarikoptic @satra PTAL and comment here if you have any objections or general thoughts about the proposed process.

@mvandenburgh mvandenburgh force-pushed the asset-blob-upload-gc-doc branch 2 times, most recently from 98572f7 to bd9740f Compare November 5, 2024 16:09
@@ -0,0 +1,43 @@
# Upload and Asset Blob Garbage Collection

This document presents a design for garbage collection of uploads and asset blobs in the context of the S3 trailing delete feature. It explains the need for garbage collection and describes the scenarios of orphaned uploads and orphaned asset blobs. The implementation involves introducing a new daily task to query and delete uploads and asset blobs that meet certain criteria. The document also mentions the recoverability of uploaded data and provides a GitHub branch for the implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the GitHub branch to this doc.

Copy link
Member

@kabilar kabilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @mvandenburgh. I am new to this garbage collection effort, but this design doc looks good from a high-level perspective.

@kabilar
Copy link
Member

kabilar commented Nov 18, 2024

Hi @yarikoptic, when you have a chance, can you please review this design doc? Thank you.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall design sounds straightforward sensible and would be great to see it implemented (as promised in the opening about the branch?).

Left only a few points and minor formatting tune ups.

doc/design/garbage-collection-uploads-asset-blobs-2.md Outdated Show resolved Hide resolved

### Orphaned AssetBlobs

In this case, assume that the user properly completes the multipart upload flow and "finalizes" the `Upload` record such that it is now an `AssetBlob`, but they do not send a request to associate the new blob with an `Asset`. That `AssetBlob` record and associated S3 object will remain in the database/bucket indefinitely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if Asset was removed somehow, e.g. via future implementation of Assets GC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated wording to include this 959e9b4


Due to the trailing delete lifecycle rule, the actual uploaded data will remain recoverable for up to 30 days after this deletion, after which the lifecycle rule will clear it out of the bucket permanently.

In order to facilitate restoration of deleted data, as well as for general auditability of the garbage collection feature, a new database table will be created to store information on garbage-collection events. Rows in this new table will be garbage-collected themselves every 30 days, since that is the number of days that the trailing delete feature waits before deleting expired object versions. In other words, once the blob is no longer recoverable via trailing delete in S3, the corresponding `GarbageCollectionEvent` should be deleted as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to include this information in the Audit table?

  • It might be trickier to collate all information on activities of a user if we would need to harvest it across different tables
    • e.g. to collect information on stats of per-user uploads
  • Eventually I hope we would get "legit" API endpoint for canceling/removing uploads. Such events should also go into Audit. So why not to place GC'ed also there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waxlamp and I discussed this as a possibility when I was writing this doc, and came to the conclusion that audit events are really intended to represent "user-driven" actions, while garbage collection is a "system-driven" action. In other words, an event in the audit table should be associated with a specific user. Garbage collection is an automated process that doesn't fit into that category.

If someday we decided to expose data restoration as a user-facing feature via the API, that would be a different matter and would likely go into the audit table. I expect cancelling uploads via the API would also go into the audit table, as that would be user-driven as well.

We will introduce a new celery-beat task that runs daily. This task will

- Query for and delete any uploads that are older than the multipart upload presigned URL expiration time (this is currently 7 days).
- Query for and delete any AssetBlobs that are (1) not associated with any Assets, and (2) older than 7 days.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibility for a race condition that user is uploading a known to archive blob which is about to be GCed, and would initiate a new Asset by passing AssetBlob id right when it would be GC'ed.
Very unlikely, but can happen. Then we might end up with an error during new Asset creation and somehow inform client that it needs to re-do the logic for upload.

I do not know if we should somehow provision to avoid such a possible race since I do not see non-complicated solution. But may be implementation could really minimize the duration from "query" to "delete" as to not delay DELETE for considerable amount of time upon query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case, the server should return a 400 error when attempting to create the Asset, and it would be up to the client to infer that the AssetBlob doesn't exist anymore. Since creating an asset requires multiple API requests, I don't see any way to guard against this. One could make the argument that once an API response comes back, the info is immediately stale and potentially out of date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants