You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Too many bags – especially bags with lots of files or large files – fail when we try to verify them in Azure. This is blocking the migration (#4744) and more generally is bad for reliability.
The text was updated successfully, but these errors were encountered:
One way we help reliability in S3 is to tag objects once they've been verified. We have a couple of options for tagging in Azure, none of them ideal:
Use Azure Metadata. This is what we've implemented in the AzureBlobMetadata class, but metadata is immutable once a blob is written.
Use Azure Tags. This is closest to what we have in S3, but this feature is only in preview and not available to us yet.
Use DynamoDB. Kinda icky to have metadata split across Azure and AWS, but hopefully it's only temporary. Eventually we'll switch to tags and drop the tables; the tag data is entirely reproducible.
Tagging objects allows a verification to be retried, but somebody needs to retry the verifications manually. Boo!
Looking at the logs in Kibana, I see two common failures:
DeterministicFailure(reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 5000ms in 'flatMap' (and no fallback has been configured),Some(root=azure://wellcomecollection-storage-replica-netherlands/digitised/b20414341/v1, status=incomplete, ingestId=2e0d696e-4e3b-4530-82b1-78779155027c, duration=PT47M17.058418S, durationSeconds=2837))
DeterministicFailure(java.lang.RuntimeException: Unable to read range OpenByteRange(0) from azure://wellcomecollection-storage-replica-netherlands/digitised/b21467742/v1/data/alto/b21467742_0004_0033.xml,Some(root=azure://wellcomecollection-storage-replica-netherlands/digitised/b21467742/v1, status=incomplete, ingestId=ef5dcbbe-1276-4c9f-803d-94bd7232ec14, duration=PT1H33.899271S, durationSeconds=3633))
It should be possible to notice both of these, and mark them as retryable failures at the SQS level.
There’s a private method buildStepResult() in the BagVerifier class, which maps a result to IngestFailed/Succeeded. If we overrode that method in the AzureBagVerifier and caught these two exceptions, we could replace them with IngestRetry instead.
Too many bags – especially bags with lots of files or large files – fail when we try to verify them in Azure. This is blocking the migration (#4744) and more generally is bad for reliability.
The text was updated successfully, but these errors were encountered: