Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WX-1153 Azure blob read md5 from metadata for large files #7204

Merged
merged 22 commits into from
Aug 17, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
359ef4d
Initial commit
aednichols Aug 11, 2023
0b9aca0
Consistently use `Try` for IO
aednichols Aug 11, 2023
cad86dc
Whitespace disabled from working on WDL parser...
aednichols Aug 11, 2023
ade9145
Fix sneaky `None.get` disguised as `Map.get()`
aednichols Aug 12, 2023
39024c2
Add test, fix thing test found
aednichols Aug 14, 2023
108f60e
Merge remote-tracking branch 'origin/develop' into aen_wx_1153
aednichols Aug 14, 2023
c4fbfb1
Try populating Azure creds for unit tests
aednichols Aug 15, 2023
ac4bd02
See what happens with other tests
aednichols Aug 15, 2023
cf96a7c
You work locally, will you work in CI?
aednichols Aug 15, 2023
d348428
Don't hide the exception when tests fail
aednichols Aug 15, 2023
c5d991e
Can we, uh, see the error maybe?
aednichols Aug 16, 2023
b62c9de
Do you pass in CI now?
aednichols Aug 17, 2023
c4e3ece
Do you pass in CI now?
aednichols Aug 17, 2023
4c9a642
Merge remote-tracking branch 'origin/develop' into aen_wx_1153
aednichols Aug 17, 2023
b4872d9
Cleanup
aednichols Aug 17, 2023
c36e3e5
Add PR suggested test
aednichols Aug 17, 2023
2fe419c
Unignore other tests
jgainerdewar Aug 17, 2023
23217a2
Enhance `FileSystemAPI` to allow injection
aednichols Aug 17, 2023
35ec356
Merge branch 'aen_wx_1153' of github.com:broadinstitute/cromwell into…
jgainerdewar Aug 17, 2023
f845d8b
Merge branch 'aen_wx_1153' of github.com:broadinstitute/cromwell into…
jgainerdewar Aug 17, 2023
d17274f
Re-add test I accidentally deleted
jgainerdewar Aug 17, 2023
ba60ef8
Merge branch 'develop' into aen_wx_1153
aednichols Aug 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

import java.net.{MalformedURLException, URI}
import java.nio.file.{Files, LinkOption}
import scala.jdk.CollectionConverters._
import scala.language.postfixOps
import scala.util.{Failure, Success, Try}

Expand Down Expand Up @@ -78,6 +79,17 @@
// format the library expects
// 2) If the path looks like <container>:<path>, strip off the <container>: to leave the absolute path inside the container.
private val brokenPathRegex = "https:/([a-z0-9]+).blob.core.windows.net/([-a-zA-Z0-9]+)/(.*)".r

// Blob files larger than 5 GB upload in parallel parts [0][1] and do not get a native `CONTENT-MD5` property.
// Instead, some uploaders such as TES [2] may optionally calculate the MD5 themselves and store it under this key in metadata.
// N.B. most if not virtually all large files in the wild will NOT have this key populated because they were not created by TES. [3]
//
// [0] https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets
// [1] https://learn.microsoft.com/en-us/rest/api/storageservices/version-2019-12-12
// [2] https://github.com/microsoft/ga4gh-tes/pull/236
// [3] As of 2023-08 there are zero search engine results for `md5_hashlist_root_hash` and the only sure-thing client is TES
private val largeBlobFileMetadataKey = "md5_hashlist_root_hash"

def cleanedNioPathString(nioString: String): String = {
val pathStr = nioString match {
case brokenPathRegex(_, containerName, pathInContainer) =>
Expand Down Expand Up @@ -116,16 +128,32 @@
def blobFileAttributes: Try[AzureBlobFileAttributes] =
Try(Files.readAttributes(nioPath, classOf[AzureBlobFileAttributes]))

def blobFileMetadata: Try[Option[Map[String, String]]] = blobFileAttributes.map { attrs =>
// `metadata()` has a documented `null` case
Option(attrs.metadata()).map(_.asScala.toMap)
}

def md5HexString: Try[Option[String]] = {
blobFileAttributes.map(h =>
Option(h.blobHttpHeaders().getContentMd5) match {
case None => None
case Some(arr) if arr.isEmpty => None
// Convert the bytes to a hex-encoded string. Note that this value
// is rendered in base64 in the Azure web portal.
case Some(bytes) => Option(bytes.map("%02x".format(_)).mkString)
def md5FromMetadata: Option[String] = (blobFileMetadata map { maybeMetadataMap: Option[Map[String, String]] =>
maybeMetadataMap flatMap { metadataMap: Map[String, String] =>
metadataMap.get(BlobPath.largeBlobFileMetadataKey)
}
}).toOption.flatten

// Convert the bytes to a hex-encoded string. Note that the value
// is rendered in base64 in the Azure web portal.
def hexString(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString

blobFileAttributes.map { attr: AzureBlobFileAttributes =>
(Option(attr.blobHttpHeaders().getContentMd5), md5FromMetadata) match {
case (None, None) => None
case (None, Some(metadataMd5)) => Option(metadataMd5)

Check warning on line 150 in filesystems/blob/src/main/scala/cromwell/filesystems/blob/BlobPathBuilder.scala

View check run for this annotation

Codecov / codecov/patch

filesystems/blob/src/main/scala/cromwell/filesystems/blob/BlobPathBuilder.scala#L149-L150

Added lines #L149 - L150 were not covered by tests
case (Some(headerMd5Bytes), None) if headerMd5Bytes.isEmpty => None
// (Some, Some) could happen if an uploader redundantly populates an md5 for a small file.
// Doesn't seem like an erroneous condition so just choose the native one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice seems logical to me but we should run it by the TES folks to ensure we're not missing some reason to prefer the other hash. Not 100% sure but I believe they are always setting the metadata value on upload, regardless of file size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the decision from today's discussion.

case (Some(headerMd5Bytes), _) => Option(hexString(headerMd5Bytes))
}
)
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,30 @@ class BlobPathBuilderSpec extends AnyFlatSpec with Matchers with MockSugar {
new BlobPathBuilder(store, endpoint)(fsm)
}

it should "read md5 from small files <5g" in {
val builder = makeBlobPathBuilder(endpoint, store)
val evalPath = "/test/inputFile.txt"
val testString = endpoint.value + "/" + store + evalPath
val blobPath1: BlobPath = builder build testString getOrElse fail()
blobPath1.md5HexString.toOption.get should equal(Some("967f0f086992f1a8b48f0a533f80290b"))
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add test cases for the two types of missing hashes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the tests are working... with pleasure!

it should "read md5 from large files >5g" in {
val builder = makeBlobPathBuilder(endpoint, store)
val evalPath = "/test/Rocky-9.2-aarch64-dvd.iso"
val testString = endpoint.value + "/" + store + evalPath
val blobPath1: BlobPath = builder build testString getOrElse fail()
blobPath1.md5HexString.toOption.get should equal(Some("13cb09331d2d12c0f476f81c672a4319"))
}

it should "choose the native md5 over the metadata md5 for files that have both" in {
val builder = makeBlobPathBuilder(endpoint, store)
val evalPath = "/test/redundant_md5_test.txt"
val testString = endpoint.value + "/" + store + evalPath
val blobPath1: BlobPath = builder build testString getOrElse fail()
blobPath1.md5HexString.toOption.get should equal(Some("9e5ceec07c8730b593a3a4b4ae324475"))
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy with immediately ignoreing these tests, gonna mess around to see if they can be made to work in CI.

ignore should "resolve an absolute path string correctly to a path" in {
val builder = makeBlobPathBuilder(endpoint, store)
val rootString = s"${endpoint.value}/${store.value}/cromwell-execution"
Expand Down
Loading