How do you backfill bags from an existing storage provider to a new provider? #1007

alexwlchan · 2022-06-10T07:41:29Z

Suppose you add a new storage provider (see #1006). This means all new bags will be replicated to the new provider, but how do you backfill all the existing bags?

Assumptions

Every existing bag will be replicated to every storage provider. It would be additional work to support mixed locations.

Prior art

When we first build the storage service, we only replicated bags to Amazon S3.

We added support for Azure Blob later; backfilling the existing bags was a somewhat manual and hacked-together process that wouldn't be easily repeatable. We should try to find a more robust approach.

High-level proposal

Within the storage service, an ingest is a record of some processing on a bag. Currently we support two ingest types:

create a new copy of a bag
update an existing copy of a bag

Replicating an existing bag to a new location could be another type of ingest.

You'd start the job using the ingests API (exact design tbd), track it through the ingests API and ingests tracker, and it would be added to the storage manifest when the bag was completed. It'd look something like this:

graph LR
     A[... pipeline for<br/>new bags] --> EV[verifiers for<br/>new bags]
     EV --> RA[Replica aggregator]

     IA[ingests API] --> BRe[backfill replicator]
     BRe --> BV[backfill verifier]
     BV --> RA
     RA --> BR[Bag register]

Considerations

The storage service reporting would let you work out which bags haven't been backfilled into the new location.
You'd likely start from the "warm" replica location, which is S3, but objects in this location aren't always available for retrieval, e.g. sometimes objects get cycled to Glacier (see Large Things Living in Cold Places). Would we need a "bag warmer" step that retrieves any objects from Glacier before kicking off the replication step?

alexwlchan added 💬 Discussion ↪️ Bag replication labels Jun 10, 2022

alexwlchan mentioned this issue Jun 10, 2022

Could we replicate to other storage providers? #1006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you backfill bags from an existing storage provider to a new provider? #1007

How do you backfill bags from an existing storage provider to a new provider? #1007

alexwlchan commented Jun 10, 2022 •

edited

Loading

How do you backfill bags from an existing storage provider to a new provider? #1007

How do you backfill bags from an existing storage provider to a new provider? #1007

Comments

alexwlchan commented Jun 10, 2022 • edited Loading

Assumptions

Prior art

High-level proposal

Considerations

alexwlchan commented Jun 10, 2022 •

edited

Loading