Feature: Add Row Level Result Treatment Options for Uniqueness and Completeness #532

eycho-am · 2024-02-13T15:31:09Z

Issue #, if available: #530

Description of changes:
This PR adds the option FileteredRow to AnalyzerOptions that defines how filtered rows will be labeled as when retrieving row level results. The two options are True and Null and this defaults to True.

This PR defines the behavior for the Completeness and Uniqueness analyzers, and will be updated for other analyzers in future PRs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

src/main/scala/com/amazon/deequ/VerificationRunBuilder.scala

src/main/scala/com/amazon/deequ/analyzers/Analyzer.scala

src/main/scala/com/amazon/deequ/utilities/RowLevelFilterTreatement.scala

…-level results

…rows as null for row-level results

…dition

…ed rows will be labeled (default True)

…able instead of extending, create RowLevelAnalyzer trait

…lterTreatment trait

rdsharma26 · 2024-02-15T19:04:44Z

src/main/scala/com/amazon/deequ/analyzers/Analyzer.scala

@@ -255,12 +256,18 @@ case class NumMatchesAndCount(numMatches: Long, count: Long, override val fullCo
  }
 }

-case class AnalyzerOptions(nullBehavior: NullBehavior = NullBehavior.Ignore)
+case class AnalyzerOptions(nullBehavior: NullBehavior = NullBehavior.Ignore,
+                           filteredRow: FilteredRow = FilteredRow.TRUE)


How about filteredRowOutcome or filteredRowEvaluationStatus. AnalyzerOptions is a public facing API, and filteredRow could be confusing for customers.

rdsharma26 · 2024-02-15T19:08:00Z

src/main/scala/com/amazon/deequ/VerificationRunBuilder.scala

@@ -25,7 +25,7 @@ import com.amazon.deequ.repository._
 import org.apache.spark.sql.{DataFrame, SparkSession}

 /** A class to build a VerificationRun using a fluent API */
-class VerificationRunBuilder(val data: DataFrame) {
+class VerificationRunBuilder(val data: DataFrame)  {


nit: excess

rdsharma26 · 2024-02-15T19:14:27Z

src/main/scala/com/amazon/deequ/analyzers/Analyzer.scala

+  : Column = {
+    conditionColumn
+      .map { condition => {
+        when(not(condition), expr(filterTreatment)).when(condition, selection)


nit: we can remove the { after => and its enclosing }

rdsharma26 · 2024-02-15T19:35:16Z

src/main/scala/com/amazon/deequ/analyzers/Completeness.scala

@@ -51,4 +53,16 @@ case class Completeness(column: String, where: Option[String] = None) extends

  @VisibleForTesting // required by some tests that compare analyzer results to an expected state
  private[deequ] def criterion: Column = conditionalSelection(column, where).isNotNull
+
+  @VisibleForTesting


Do we need this annotation? The method is accessible to classes in com.amazon.deequ which the tests are under.

rdsharma26 · 2024-02-15T19:42:21Z

src/main/scala/com/amazon/deequ/analyzers/Analyzer.scala

+  : Column = {
+    conditionColumn
+      .map { condition => {
+        when(not(condition), expr(filterTreatment)).when(condition, selection)


Can we delegate the expr(filterTreatment) to the parameter of the method? We can update the type of filterTreatment to a type of FilteredRow. The expressions for each type of FilteredRow enumerations can sit inside FilteredRow itself. Right now, we have a .toString which breaks the connection between FilteredRow and this method. Ideally, we want to keep that connection to aid in refactoring and general readability of the code.

rdsharma26 · 2024-02-15T20:18:33Z

src/main/scala/com/amazon/deequ/analyzers/UniqueValueRatio.scala

+  private def getRowLevelFilterTreatment: FilteredRow = {
+    analyzerOptions
+      .map { options => options.filteredRow }
+      .getOrElse(FilteredRow.TRUE)
+  }


nit: This is repeated in a few places, so could go into the base class.

rdsharma26 · 2024-02-15T20:21:27Z

src/main/scala/com/amazon/deequ/analyzers/UniqueValueRatio.scala

+      rowLevelColumn => {
+        conditionColumn.map {
+          condition => {
+            when(not(condition), expr(getRowLevelFilterTreatment.toString))


Same comment for expr(getRowLevelFilterTreatment.toString) as above

rdsharma26 · 2024-02-15T20:25:58Z

src/main/scala/com/amazon/deequ/checks/Check.scala

+  def hasUniqueness(column: String, assertion: Double => Boolean, hint: Option[String],
+                    analyzerOptions: Option[AnalyzerOptions])
+  : CheckWithLastConstraintFilterable = {
+    hasUniqueness(Seq(column), assertion, hint, analyzerOptions)
+  }


Can't we combine all the hasUniqueness into one method and update the body of the method based on the existence of parameters?

rdsharma26 · 2024-02-15T20:27:10Z

src/test/scala/com/amazon/deequ/analyzers/runners/AnalyzerContextTest.scala

+    //    assert(SimpleResultSerde.deserialize(jsonA) ==
+    //      SimpleResultSerde.deserialize(jsonB))


Can we remove these comments here and in other files below?

rdsharma26 · 2024-02-15T20:28:15Z

src/test/scala/com/amazon/deequ/analyzers/runners/AnalysisRunnerTests.scala

+        separateResults.asInstanceOf[Set[DoubleMetric]].foreach( result => {
+            assert(runnerResults.toString.contains(result.toString))
+          }
+          )
      }


This block of code has some excess whitespace and misaligned brackets. In a future PR, can we add these rules into scala-style such that the build fails otherwise?

…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition