You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Suppose you add a new storage provider (see #1006). This means all new bags will be replicated to the new provider, but how do you backfill all the existing bags?
Assumptions
Every existing bag will be replicated to every storage provider. It would be additional work to support mixed locations.
Prior art
When we first build the storage service, we only replicated bags to Amazon S3.
We added support for Azure Blob later; backfilling the existing bags was a somewhat manual and hacked-together process that wouldn't be easily repeatable. We should try to find a more robust approach.
High-level proposal
Within the storage service, an ingest is a record of some processing on a bag. Currently we support two ingest types:
create a new copy of a bag
update an existing copy of a bag
Replicating an existing bag to a new location could be another type of ingest.
You'd start the job using the ingests API (exact design tbd), track it through the ingests API and ingests tracker, and it would be added to the storage manifest when the bag was completed. It'd look something like this:
graph LR
A[... pipeline for<br/>new bags] --> EV[verifiers for<br/>new bags]
EV --> RA[Replica aggregator]
IA[ingests API] --> BRe[backfill replicator]
BRe --> BV[backfill verifier]
BV --> RA
RA --> BR[Bag register]
Loading
Considerations
The storage service reporting would let you work out which bags haven't been backfilled into the new location.
You'd likely start from the "warm" replica location, which is S3, but objects in this location aren't always available for retrieval, e.g. sometimes objects get cycled to Glacier (see Large Things Living in Cold Places). Would we need a "bag warmer" step that retrieves any objects from Glacier before kicking off the replication step?
The text was updated successfully, but these errors were encountered:
Suppose you add a new storage provider (see #1006). This means all new bags will be replicated to the new provider, but how do you backfill all the existing bags?
Assumptions
Prior art
When we first build the storage service, we only replicated bags to Amazon S3.
We added support for Azure Blob later; backfilling the existing bags was a somewhat manual and hacked-together process that wouldn't be easily repeatable. We should try to find a more robust approach.
High-level proposal
Within the storage service, an ingest is a record of some processing on a bag. Currently we support two ingest types:
Replicating an existing bag to a new location could be another type of ingest.
You'd start the job using the ingests API (exact design tbd), track it through the ingests API and ingests tracker, and it would be added to the storage manifest when the bag was completed. It'd look something like this:
Considerations
The storage service reporting would let you work out which bags haven't been backfilled into the new location.
You'd likely start from the "warm" replica location, which is S3, but objects in this location aren't always available for retrieval, e.g. sometimes objects get cycled to Glacier (see Large Things Living in Cold Places). Would we need a "bag warmer" step that retrieves any objects from Glacier before kicking off the replication step?
The text was updated successfully, but these errors were encountered: