Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verification in Azure isn't reliable enough #4774

Closed
alexwlchan opened this issue Sep 9, 2020 · 2 comments
Closed

Verification in Azure isn't reliable enough #4774

alexwlchan opened this issue Sep 9, 2020 · 2 comments
Assignees
Labels
🐛 Bug 📦 Storage service Work related to the storage service

Comments

@alexwlchan
Copy link
Contributor

alexwlchan commented Sep 9, 2020

Too many bags – especially bags with lots of files or large files – fail when we try to verify them in Azure. This is blocking the migration (#4744) and more generally is bad for reliability.

@alexwlchan alexwlchan added 🐛 Bug 📦 Storage service Work related to the storage service labels Sep 9, 2020
@alexwlchan alexwlchan self-assigned this Sep 9, 2020
@alexwlchan
Copy link
Contributor Author

One way we help reliability in S3 is to tag objects once they've been verified. We have a couple of options for tagging in Azure, none of them ideal:

  • Use Azure Metadata. This is what we've implemented in the AzureBlobMetadata class, but metadata is immutable once a blob is written.
  • Use Azure Tags. This is closest to what we have in S3, but this feature is only in preview and not available to us yet.
  • Use DynamoDB. Kinda icky to have metadata split across Azure and AWS, but hopefully it's only temporary. Eventually we'll switch to tags and drop the tables; the tag data is entirely reproducible.

This is done in wellcomecollection/storage-service#729

@alexwlchan
Copy link
Contributor Author

Tagging objects allows a verification to be retried, but somebody needs to retry the verifications manually. Boo!

Looking at the logs in Kibana, I see two common failures:

DeterministicFailure(reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 5000ms in 'flatMap' (and no fallback has been configured),Some(root=azure://wellcomecollection-storage-replica-netherlands/digitised/b20414341/v1, status=incomplete, ingestId=2e0d696e-4e3b-4530-82b1-78779155027c, duration=PT47M17.058418S, durationSeconds=2837))

DeterministicFailure(java.lang.RuntimeException: Unable to read range OpenByteRange(0) from azure://wellcomecollection-storage-replica-netherlands/digitised/b21467742/v1/data/alto/b21467742_0004_0033.xml,Some(root=azure://wellcomecollection-storage-replica-netherlands/digitised/b21467742/v1, status=incomplete, ingestId=ef5dcbbe-1276-4c9f-803d-94bd7232ec14, duration=PT1H33.899271S, durationSeconds=3633))

It should be possible to notice both of these, and mark them as retryable failures at the SQS level.

There’s a private method buildStepResult() in the BagVerifier class, which maps a result to IngestFailed/Succeeded. If we overrode that method in the AzureBagVerifier and caught these two exceptions, we could replace them with IngestRetry instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug 📦 Storage service Work related to the storage service
Projects
None yet
Development

No branches or pull requests

3 participants