Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WX-1410] Sanitize 4 byte UTF-8 characters before inserting into METADATA_ENTRY #7414

Merged
merged 9 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions cromwell.example.backends/cromwell.examples.conf
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,11 @@ services {
# # count against this limit.
# metadata-read-row-number-safety-threshold = 1000000
#
# # Remove any UTF-8 mb4 (4 byte) characters from metadata keys in the list.
# # These characters (namely emojis) will cause metadata writing to fail in database collations
# # that do not support 4 byte UTF-8 characters.
# metadata-keys-to-sanitize-utf8mb4 = ["submittedFiles:workflow", "commandLine"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment!

#
# metadata-write-statistics {
# # Not strictly necessary since the 'metadata-write-statistics' section itself is enough for statistics to be recorded.
# # However, this can be set to 'false' to disable statistics collection without deleting the section.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,14 @@ case class MetadataServiceActor(serviceConfig: Config, globalConfig: Config, ser
val metadataWriteStatisticsConfig = MetadataStatisticsRecorderSettings(
serviceConfig.as[Option[Config]]("metadata-write-statistics")
)
val metadataKeysToClean = serviceConfig.getOrElse[List[String]]("metadata-keys-to-sanitize-utf8mb4", List())
val writeActor = context.actorOf(
WriteMetadataActor.props(dbBatchSize,
dbFlushRate,
serviceRegistryActor,
LoadConfig.MetadataWriteThreshold,
metadataWriteStatisticsConfig
metadataWriteStatisticsConfig,
metadataKeysToClean
),
"WriteMetadataActor"
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@
import cromwell.core.Mailbox.PriorityMailbox
import cromwell.core.WorkflowId
import cromwell.core.instrumentation.InstrumentationPrefixes
import cromwell.services.metadata.MetadataEvent
import cromwell.services.metadata.{MetadataEvent, MetadataValue}
import cromwell.services.metadata.MetadataService._
import cromwell.services.metadata.impl.MetadataStatisticsRecorder.MetadataStatisticsRecorderSettings
import cromwell.services.{EnhancedBatchActor, MetadataServicesStore}
import wdl.util.StringUtil

import scala.concurrent.duration._
import scala.util.{Failure, Success}
Expand All @@ -18,7 +19,8 @@
override val flushRate: FiniteDuration,
override val serviceRegistryActor: ActorRef,
override val threshold: Int,
metadataStatisticsRecorderSettings: MetadataStatisticsRecorderSettings
metadataStatisticsRecorderSettings: MetadataStatisticsRecorderSettings,
metadataKeysToClean: List[String]
) extends EnhancedBatchActor[MetadataWriteAction](flushRate, batchSize)
with ActorLogging
with MetadataDatabaseAccess
Expand All @@ -27,9 +29,10 @@
private val statsRecorder = MetadataStatisticsRecorder(metadataStatisticsRecorderSettings)

override def process(e: NonEmptyVector[MetadataWriteAction]) = instrumentedProcess {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small recommendation for performance reasons:

  • We can avoid re-loading/re-computing the metadataKeysToClean list every time process is called by assigning it to alazy val instead of a val.
    • It might be easiest to do this by making a function that returns the list for you (after reading stuff from config) and assigning that to a lazy val.
  • This lets us avoid calling a (potentially slow because of file i/o) ConfigFactory.load() function more than once.

https://leobenkel.com/2020/07/skb-scala-val-lazy-def/#:~:text=The%20keyword%20lazy%20allows%20a,as%20the%20value%20is%20declared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is resolved by passing in the metadataKeysToClean at actor creation time, see Janet's comment below

val cleanedMetadataWriteActions = if (metadataKeysToClean.isEmpty) e else sanitizeInputs(e)
val empty = (Vector.empty[MetadataEvent], List.empty[(Iterable[MetadataEvent], ActorRef)])

val (putWithoutResponse, putWithResponse) = e.foldLeft(empty) {
val (putWithoutResponse, putWithResponse) = cleanedMetadataWriteActions.foldLeft(empty) {
case ((putEvents, putAndRespondEvents), action: PutMetadataAction) =>
(putEvents ++ action.events, putAndRespondEvents)
case ((putEvents, putAndRespondEvents), action: PutMetadataActionAndRespond) =>
Expand All @@ -46,7 +49,7 @@
case Success(_) =>
putWithResponse foreach { case (ev, replyTo) => replyTo ! MetadataWriteSuccess(ev) }
case Failure(cause) =>
val (outOfTries, stillGood) = e.toVector.partition(_.maxAttempts <= 1)
val (outOfTries, stillGood) = cleanedMetadataWriteActions.toVector.partition(_.maxAttempts <= 1)

Check warning on line 52 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L52

Added line #L52 was not covered by tests

handleOutOfTries(outOfTries, cause)
handleEventsToReconsider(stillGood)
Expand All @@ -55,6 +58,23 @@
dbAction.map(_ => allPutEvents.size)
}

def sanitizeInputs(
metadataWriteActions: NonEmptyVector[MetadataWriteAction]
): NonEmptyVector[MetadataWriteAction] =
metadataWriteActions.map { metadataWriteAction =>

Check warning on line 64 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L64

Added line #L64 was not covered by tests
val metadataEvents =
metadataWriteAction.events.map { event =>
event.value match {
case Some(eventVal) => event.copy(value = Option(MetadataValue(StringUtil.cleanUtf8mb4(eventVal.value))))
case None => event

Check warning on line 69 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L66-L69

Added lines #L66 - L69 were not covered by tests
}
}
metadataWriteAction match {
case action: PutMetadataAction => action.copy(events = metadataEvents)
case actionAndResp: PutMetadataActionAndRespond => actionAndResp.copy(events = metadataEvents)

Check warning on line 74 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L73-L74

Added lines #L73 - L74 were not covered by tests
}
}

private def countActionsByWorkflow(writeActions: Vector[MetadataWriteAction]): Map[WorkflowId, Int] =
writeActions.flatMap(_.events).groupBy(_.key.workflowId).map { case (k, v) => k -> v.size }

Expand Down Expand Up @@ -106,9 +126,18 @@
flushRate: FiniteDuration,
serviceRegistryActor: ActorRef,
threshold: Int,
statisticsRecorderSettings: MetadataStatisticsRecorderSettings
statisticsRecorderSettings: MetadataStatisticsRecorderSettings,
metadataKeysToClean: List[String]
): Props =
Props(new WriteMetadataActor(dbBatchSize, flushRate, serviceRegistryActor, threshold, statisticsRecorderSettings))
Props(
new WriteMetadataActor(dbBatchSize,
flushRate,
serviceRegistryActor,
threshold,
statisticsRecorderSettings,
metadataKeysToClean
)
)
.withDispatcher(ServiceDispatcher)
.withMailbox(PriorityMailbox)
}
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ class WriteMetadataActorBenchmark extends TestKitSuite with AnyFlatSpecLike with

it should "provide good throughput" taggedAs IntegrationTest in {
val writeActor =
TestFSMRef(new WriteMetadataActor(1000, 5.seconds, registry, Int.MaxValue, MetadataStatisticsDisabled) {
TestFSMRef(new WriteMetadataActor(1000, 5.seconds, registry, Int.MaxValue, MetadataStatisticsDisabled, List()) {
override val metadataDatabaseInterface: MetadataSlickDatabase = dataAccess
})

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ class WriteMetadataActorSpec extends TestKitSuite with AnyFlatSpecLike with Matc

it should "process jobs in the correct batch sizes" in {
val registry = TestProbe().ref
val writeActor = TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue) {
val writeActor = TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue, List()) {
override val metadataDatabaseInterface = mockDatabaseInterface(0)
})

Expand Down Expand Up @@ -71,9 +71,10 @@ class WriteMetadataActorSpec extends TestKitSuite with AnyFlatSpecLike with Matc
failuresBetweenSuccessValues foreach { failureRate =>
it should s"succeed metadata writes and respond to all senders even with $failureRate failures between each success" in {
val registry = TestProbe().ref
val writeActor = TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue) {
override val metadataDatabaseInterface = mockDatabaseInterface(failureRate)
})
val writeActor =
TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue, List()) {
override val metadataDatabaseInterface = mockDatabaseInterface(failureRate)
})

def metadataEvent(index: Int, probe: ActorRef) =
PutMetadataActionAndRespond(List(
Expand Down Expand Up @@ -111,7 +112,7 @@ class WriteMetadataActorSpec extends TestKitSuite with AnyFlatSpecLike with Matc

it should s"fail metadata writes and respond to all senders with failures" in {
val registry = TestProbe().ref
val writeActor = TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue) {
val writeActor = TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue, List()) {
override val metadataDatabaseInterface = mockDatabaseInterface(100)
})

Expand Down Expand Up @@ -146,6 +147,90 @@ class WriteMetadataActorSpec extends TestKitSuite with AnyFlatSpecLike with Matc
writeActor.stop()
}

it should s"test removing emojis from metadata works as expected" in {
val registry = TestProbe().ref
val writeActor =
TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue, List("metadata_key")) {
override val metadataDatabaseInterface = mockDatabaseInterface(100)
})

def metadataEvent(index: Int, probe: ActorRef) = PutMetadataActionAndRespond(
List(
MetadataEvent(MetadataKey(WorkflowId.randomId(), None, "metadata_key"), MetadataValue(s"🎉_$index"))
),
probe
)

val probes = (0 until 43)
.map { _ =>
val probe = TestProbe()
probe
}
.zipWithIndex
.map { case (probe, index) =>
probe -> metadataEvent(index, probe.ref)
}

val metadataWriteActions = probes.map(probe => probe._2).toVector
val metadataWriteActionNE = NonEmptyVector(metadataWriteActions.head, metadataWriteActions.tail)

val sanitizedWriteActions = writeActor.underlyingActor.sanitizeInputs(metadataWriteActionNE)

sanitizedWriteActions.map { writeAction =>
writeAction.events.map { event =>
if (event.value.getOrElse(fail("Removed value from metadata event")).value.matches("[\\x{10000}-\\x{FFFFF}]")) {
fail("Metadata event contains emoji")
}

if (!event.value.getOrElse(fail("Removed value from metadata event")).value.contains("\uFFFD")) {
fail("Incorrect character used to replace emoji")
}
}
}
}

it should s"test removing emojis from metadata which doesn't contain emojis returns the string" in {
val registry = TestProbe().ref
val writeActor =
TestFSMRef(new BatchSizeCountingWriteMetadataActor(10, 10.millis, registry, Int.MaxValue, List("metadata_key")) {
override val metadataDatabaseInterface = mockDatabaseInterface(100)
})

def metadataEvent(index: Int, probe: ActorRef) = PutMetadataActionAndRespond(
List(
MetadataEvent(MetadataKey(WorkflowId.randomId(), None, "metadata_key"), MetadataValue(s"hello_$index"))
),
probe
)

val probes = (0 until 43)
.map { _ =>
val probe = TestProbe()
probe
}
.zipWithIndex
.map { case (probe, index) =>
probe -> metadataEvent(index, probe.ref)
}

val metadataWriteActions = probes.map(probe => probe._2).toVector
val metadataWriteActionNE = NonEmptyVector(metadataWriteActions.head, metadataWriteActions.tail)

val sanitizedWriteActions = writeActor.underlyingActor.sanitizeInputs(metadataWriteActionNE)

sanitizedWriteActions.map { writeAction =>
writeAction.events.map { event =>
if (event.value.getOrElse(fail("Removed value from metadata event")).value.matches("[\\x{10000}-\\x{FFFFF}]")) {
fail("Metadata event contains emoji")
}

if (event.value.getOrElse(fail("Removed value from metadata event")).value.contains("\uFFFD")) {
fail("Incorrectly replaced character in metadata event")
}
}
}
}

// Mock database interface.
// A customizable number of failures occur between each success
def mockDatabaseInterface(failuresBetweenEachSuccess: Int) = new MetadataSqlDatabase with SqlDatabase {
Expand Down Expand Up @@ -382,8 +467,15 @@ object WriteMetadataActorSpec {
class BatchSizeCountingWriteMetadataActor(override val batchSize: Int,
override val flushRate: FiniteDuration,
override val serviceRegistryActor: ActorRef,
override val threshold: Int
) extends WriteMetadataActor(batchSize, flushRate, serviceRegistryActor, threshold, MetadataStatisticsDisabled) {
override val threshold: Int,
val metadataKeysToClean: List[String]
) extends WriteMetadataActor(batchSize,
flushRate,
serviceRegistryActor,
threshold,
MetadataStatisticsDisabled,
metadataKeysToClean
) {

var batchSizes: Vector[Int] = Vector.empty
var failureCount: Int = 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import cromwell.core.ExecutionEvent
import cromwell.core.logging.JobLogger
import mouse.all._
import PipelinesUtilityConversions._
import wdl.util.StringUtil

import scala.language.postfixOps

Expand Down Expand Up @@ -67,7 +68,7 @@ trait PipelinesUtilityConversions {
// characters (typically emoji). Some databases have trouble storing these; replace them with the standard
// "unknown character" unicode symbol.
val name = Option(event.getContainerStopped) match {
case Some(_) => cleanUtf8mb4(event.getDescription)
case Some(_) => StringUtil.cleanUtf8mb4(event.getDescription)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the config to include this metadata in what we're sanitizing rather than cleaning it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup: in theory we could, but we would need to check every single executionEvent, of which there are many, because their keys are all like:

executionEvents[1173839490]:startTime
executionEvents[1173839490]:endTime
executionEvents[1173839490]:grouping
executionEvents[1173839490]:description

Leaving this comment in case anyone else has this idea, but I'm OK leaving this as-is for now. Would be great to add a comment explaining this context, though (we sanitize other metadata values elsewhere, but are handing this one differently).

We'll also need to keep in mind that we may need to do something different for Batch. I'll create a ticket in that epic.

case _ => event.getDescription
}

Expand Down Expand Up @@ -101,9 +102,4 @@ object PipelinesUtilityConversions {
None
}
}

lazy val utf8mb4Regex = "[\\x{10000}-\\x{FFFFF}]"
lazy val utf8mb3Replacement = "\uFFFD" // This is the standard char for replacing invalid/unknown unicode chars
def cleanUtf8mb4(in: String): String =
in.replaceAll(utf8mb4Regex, utf8mb3Replacement)
}

This file was deleted.

17 changes: 17 additions & 0 deletions wdl/model/draft2/src/test/scala/wdl/util/StringUtilSpec.scala
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,21 @@ class StringUtilSpec extends AnyFlatSpec with CromwellTimeoutSpec with Matchers
}
}

it should "not modify strings that contain only ascii characters" in {
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
val input = "hi there!?"
StringUtil.cleanUtf8mb4(input) shouldBe input
}

it should "not modify strings with 3-byte unicode characters" in {
val input = "Here is my non-ascii character: \u1234 Do you like it?"
StringUtil.cleanUtf8mb4(input) shouldBe input
}

it should "replace 4-byte unicode characters" in {
val cry = new String(Character.toChars(Integer.parseInt("1F62D", 16)))
val barf = new String(Character.toChars(Integer.parseInt("1F92E", 16)))
val input = s"When I try to put an emoji in the database it $barf and then I $cry"
val cleaned = "When I try to put an emoji in the database it \uFFFD and then I \uFFFD"
StringUtil.cleanUtf8mb4(input) shouldBe cleaned
}
}
10 changes: 10 additions & 0 deletions wom/src/main/scala/wdl/util/StringUtil.scala
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ import scala.annotation.tailrec
* WOMmy TaskDefinition. That should get straightened out. */
object StringUtil {
val Ws = Pattern.compile("[\\ \\t]+")
val utf8mb4Regex = "[\\x{10000}-\\x{FFFFF}]"
val utf8mb3Replacement = "\uFFFD" // This is the standard char for replacing

/**
* 1) Remove all leading newline chars
Expand Down Expand Up @@ -63,4 +65,12 @@ object StringUtil {

start(0)
}

/**
* Remove all utf8mb4 exclusive characters (emoji) from the given string.
* @param in String to clean
* @return String with all utf8mb4 exclusive characters removed
*/
def cleanUtf8mb4(in: String): String =
in.replaceAll(utf8mb4Regex, utf8mb3Replacement)
}
Loading