Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WX-1410] Sanitize 4 byte UTF-8 characters before inserting into METADATA_ENTRY #7414

Merged
merged 9 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions core/src/main/resources/reference.conf
Original file line number Diff line number Diff line change
Expand Up @@ -540,6 +540,7 @@ services {
config {
# See cromwell.examples.conf for details on settings one can use here as they depend on the implementation
# being used.
metadata-keys-to-sanitize-utf8mb4: []
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
}
}
Instrumentation {
Expand Down
2 changes: 2 additions & 0 deletions core/src/main/scala/cromwell/core/WorkflowMetadataKeys.scala
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,6 @@ object WorkflowMetadataKeys {
val Labels = "labels"
val MetadataArchiveStatus = "metadataArchiveStatus"
val Message = "message"

val CommandLine = "commandLine"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I added that in a previous iteration -- its been removed.

}
5 changes: 5 additions & 0 deletions cromwell.example.backends/cromwell.examples.conf
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,11 @@ services {
# # count against this limit.
# metadata-read-row-number-safety-threshold = 1000000
#
# # Remove any UTF-8 mb4 (4 byte) characters from metadata keys in the list.
# # These characters (namely emojis) will cause metadata writing to fail in database collations
# # that do not support 4 byte UTF-8 characters.
# metadata-keys-to-sanitize-utf8mb4 = ["submittedFiles:workflow", "commandLine"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment!

#
# metadata-write-statistics {
# # Not strictly necessary since the 'metadata-write-statistics' section itself is enough for statistics to be recorded.
# # However, this can be set to 'false' to disable statistics collection without deleting the section.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,17 @@

import akka.actor.{ActorLogging, ActorRef, Props}
import cats.data.NonEmptyVector
import com.typesafe.config.ConfigFactory
import cromwell.core.Dispatcher.ServiceDispatcher
import cromwell.core.Mailbox.PriorityMailbox
import cromwell.core.WorkflowId
import cromwell.core.instrumentation.InstrumentationPrefixes
import cromwell.services.metadata.MetadataEvent
import cromwell.services.metadata.{MetadataEvent, MetadataValue}
import cromwell.services.metadata.MetadataService._
import cromwell.services.metadata.impl.MetadataStatisticsRecorder.MetadataStatisticsRecorderSettings
import cromwell.services.{EnhancedBatchActor, MetadataServicesStore}
import net.ceedubs.ficus.Ficus._
import wdl.util.StringUtil

import scala.concurrent.duration._
import scala.util.{Failure, Success}
Expand All @@ -27,9 +30,14 @@
private val statsRecorder = MetadataStatisticsRecorder(metadataStatisticsRecorderSettings)

override def process(e: NonEmptyVector[MetadataWriteAction]) = instrumentedProcess {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small recommendation for performance reasons:

  • We can avoid re-loading/re-computing the metadataKeysToClean list every time process is called by assigning it to alazy val instead of a val.
    • It might be easiest to do this by making a function that returns the list for you (after reading stuff from config) and assigning that to a lazy val.
  • This lets us avoid calling a (potentially slow because of file i/o) ConfigFactory.load() function more than once.

https://leobenkel.com/2020/07/skb-scala-val-lazy-def/#:~:text=The%20keyword%20lazy%20allows%20a,as%20the%20value%20is%20declared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is resolved by passing in the metadataKeysToClean at actor creation time, see Janet's comment below

val config = ConfigFactory.load().getConfig("services.MetadataService.config")
val metadataKeysToCleanOption = config.as[Option[List[String]]]("metadata-keys-to-sanitize-utf8mb4")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be reading this wrong or misunderstanding how configs work - but to me it looks like the config value in reference.conf is a List[String] and not an Option[List[String]]. Would that be a more appropriate type to pass to as?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reason to use Option[List[String]] is to not crash if the config item is entirely missing. More or less important depending on whether we include the key with a default value in reference.conf, but good defensive programming either way.

val metadataKeysToClean = metadataKeysToCleanOption.getOrElse(List())
rsaperst marked this conversation as resolved.
Show resolved Hide resolved

val cleanedMetadataWriteActions = if (metadataKeysToClean.isEmpty) e else sanitizeInputs(e, metadataKeysToClean)
val empty = (Vector.empty[MetadataEvent], List.empty[(Iterable[MetadataEvent], ActorRef)])

val (putWithoutResponse, putWithResponse) = e.foldLeft(empty) {
val (putWithoutResponse, putWithResponse) = cleanedMetadataWriteActions.foldLeft(empty) {
case ((putEvents, putAndRespondEvents), action: PutMetadataAction) =>
(putEvents ++ action.events, putAndRespondEvents)
case ((putEvents, putAndRespondEvents), action: PutMetadataActionAndRespond) =>
Expand All @@ -46,7 +54,7 @@
case Success(_) =>
putWithResponse foreach { case (ev, replyTo) => replyTo ! MetadataWriteSuccess(ev) }
case Failure(cause) =>
val (outOfTries, stillGood) = e.toVector.partition(_.maxAttempts <= 1)
val (outOfTries, stillGood) = cleanedMetadataWriteActions.toVector.partition(_.maxAttempts <= 1)

Check warning on line 57 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L57

Added line #L57 was not covered by tests

handleOutOfTries(outOfTries, cause)
handleEventsToReconsider(stillGood)
Expand All @@ -55,6 +63,43 @@
dbAction.map(_ => allPutEvents.size)
}

private def sanitizeInputs(
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
metadataWriteActions: NonEmptyVector[MetadataWriteAction],
metadataKeysToClean: List[String]
): NonEmptyVector[MetadataWriteAction] =
metadataWriteActions.map { metadataWriteAction =>

Check warning on line 70 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L70

Added line #L70 was not covered by tests
val metadataEvents =
metadataWriteAction.events.map { event =>

Check warning on line 72 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L72

Added line #L72 was not covered by tests
if (
metadataKeysToClean.contains(event.key.key)
|| (event.key.key.contains(":")
&& metadataKeysToClean.contains(event.key.key.substring(0, event.key.key.indexOf(":"))))

Check warning on line 76 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L74-L76

Added lines #L74 - L76 were not covered by tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're doing something a little more complicated than matching on the keys in the config - we're also matching on keys prefixed with them? Was that necessary for the usage we've seen?

It might be easier to think about if we did something like this, which will match the full key or a colon-delimited prefix of one or more items:

if (metadataKeysToClean.exists(kc => event.key.key == kc  || event.key.key.startsWith(s"${kc}:")) {
    ...
}

Either way, we should update the comment in cromwell.examples.conf to reflect the prefix behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was from an old iteration of code, we only need to sanitize on exact matches. This has been fixed in the new PR.

) {
event.value match {
case Some(_) =>
val value = event.value.get
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
MetadataEvent(event.key.copy(),
Option(MetadataValue(StringUtil.cleanUtf8mb4(value.value), value.valueType)),
event.offsetDateTime

Check warning on line 83 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L78-L83

Added lines #L78 - L83 were not covered by tests
)
case _ => event

Check warning on line 85 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L85

Added line #L85 was not covered by tests
}
} else {
event

Check warning on line 88 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L88

Added line #L88 was not covered by tests
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
}
}
metadataWriteAction match {
case action: PutMetadataAction =>
PutMetadataAction(metadataEvents, action.maxAttempts)

Check warning on line 93 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L93

Added line #L93 was not covered by tests
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
case action: PutMetadataActionAndRespond =>
PutMetadataActionAndRespond(

Check warning on line 95 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L95

Added line #L95 was not covered by tests
metadataEvents,
action.replyTo,
action.maxAttempts

Check warning on line 98 in services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala

View check run for this annotation

Codecov / codecov/patch

services/src/main/scala/cromwell/services/metadata/impl/WriteMetadataActor.scala#L97-L98

Added lines #L97 - L98 were not covered by tests
)
}
}

private def countActionsByWorkflow(writeActions: Vector[MetadataWriteAction]): Map[WorkflowId, Int] =
writeActions.flatMap(_.events).groupBy(_.key.workflowId).map { case (k, v) => k -> v.size }

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import cromwell.core.ExecutionEvent
import cromwell.core.logging.JobLogger
import mouse.all._
import PipelinesUtilityConversions._
import wdl.util.StringUtil

import scala.language.postfixOps

Expand Down Expand Up @@ -67,7 +68,7 @@ trait PipelinesUtilityConversions {
// characters (typically emoji). Some databases have trouble storing these; replace them with the standard
// "unknown character" unicode symbol.
val name = Option(event.getContainerStopped) match {
case Some(_) => cleanUtf8mb4(event.getDescription)
case Some(_) => StringUtil.cleanUtf8mb4(event.getDescription)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the config to include this metadata in what we're sanitizing rather than cleaning it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup: in theory we could, but we would need to check every single executionEvent, of which there are many, because their keys are all like:

executionEvents[1173839490]:startTime
executionEvents[1173839490]:endTime
executionEvents[1173839490]:grouping
executionEvents[1173839490]:description

Leaving this comment in case anyone else has this idea, but I'm OK leaving this as-is for now. Would be great to add a comment explaining this context, though (we sanitize other metadata values elsewhere, but are handing this one differently).

We'll also need to keep in mind that we may need to do something different for Batch. I'll create a ticket in that epic.

case _ => event.getDescription
}

Expand Down Expand Up @@ -101,9 +102,4 @@ object PipelinesUtilityConversions {
None
}
}

lazy val utf8mb4Regex = "[\\x{10000}-\\x{FFFFF}]"
lazy val utf8mb3Replacement = "\uFFFD" // This is the standard char for replacing invalid/unknown unicode chars
def cleanUtf8mb4(in: String): String =
in.replaceAll(utf8mb4Regex, utf8mb3Replacement)
}

This file was deleted.

17 changes: 17 additions & 0 deletions wdl/model/draft2/src/test/scala/wdl/util/StringUtilSpec.scala
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,21 @@ class StringUtilSpec extends AnyFlatSpec with CromwellTimeoutSpec with Matchers
}
}

it should "not modify strings that contain only ascii characters" in {
rsaperst marked this conversation as resolved.
Show resolved Hide resolved
val input = "hi there!?"
StringUtil.cleanUtf8mb4(input) shouldBe input
}

it should "not modify strings with 3-byte unicode characters" in {
val input = "Here is my non-ascii character: \u1234 Do you like it?"
StringUtil.cleanUtf8mb4(input) shouldBe input
}

it should "replace 4-byte unicode characters" in {
val cry = new String(Character.toChars(Integer.parseInt("1F62D", 16)))
val barf = new String(Character.toChars(Integer.parseInt("1F92E", 16)))
val input = s"When I try to put an emoji in the database it $barf and then I $cry"
val cleaned = "When I try to put an emoji in the database it \uFFFD and then I \uFFFD"
StringUtil.cleanUtf8mb4(input) shouldBe cleaned
}
}
10 changes: 10 additions & 0 deletions wom/src/main/scala/wdl/util/StringUtil.scala
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ import scala.annotation.tailrec
* WOMmy TaskDefinition. That should get straightened out. */
object StringUtil {
val Ws = Pattern.compile("[\\ \\t]+")
val utf8mb4Regex = "[\\x{10000}-\\x{FFFFF}]"
val utf8mb3Replacement = "\uFFFD" // This is the standard char for replacing

/**
* 1) Remove all leading newline chars
Expand Down Expand Up @@ -63,4 +65,12 @@ object StringUtil {

start(0)
}

/**
* Remove all utf8mb4 exclusive characters (emoji) from the given string.
* @param in String to clean
* @return String with all utf8mb4 exclusive characters removed
*/
def cleanUtf8mb4(in: String): String =
in.replaceAll(utf8mb4Regex, utf8mb3Replacement)
}
Loading