Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add Row Level Result Treatment Options for Uniqueness and Completeness #532

Merged
merged 8 commits into from
Feb 15, 2024

Conversation

eycho-am
Copy link
Contributor

@eycho-am eycho-am commented Feb 13, 2024

Issue #, if available: #530

Description of changes:
This PR adds the option FileteredRow to AnalyzerOptions that defines how filtered rows will be labeled as when retrieving row level results. The two options are True and Null and this defaults to True.

This PR defines the behavior for the Completeness and Uniqueness analyzers, and will be updated for other analyzers in future PRs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mentekid mentekid merged commit 5b818be into awslabs:master Feb 15, 2024
1 check passed
@@ -255,12 +256,18 @@ case class NumMatchesAndCount(numMatches: Long, count: Long, override val fullCo
}
}

case class AnalyzerOptions(nullBehavior: NullBehavior = NullBehavior.Ignore)
case class AnalyzerOptions(nullBehavior: NullBehavior = NullBehavior.Ignore,
filteredRow: FilteredRow = FilteredRow.TRUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about filteredRowOutcome or filteredRowEvaluationStatus. AnalyzerOptions is a public facing API, and filteredRow could be confusing for customers.

@@ -25,7 +25,7 @@ import com.amazon.deequ.repository._
import org.apache.spark.sql.{DataFrame, SparkSession}

/** A class to build a VerificationRun using a fluent API */
class VerificationRunBuilder(val data: DataFrame) {
class VerificationRunBuilder(val data: DataFrame) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: excess

: Column = {
conditionColumn
.map { condition => {
when(not(condition), expr(filterTreatment)).when(condition, selection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can remove the { after => and its enclosing }

@@ -51,4 +53,16 @@ case class Completeness(column: String, where: Option[String] = None) extends

@VisibleForTesting // required by some tests that compare analyzer results to an expected state
private[deequ] def criterion: Column = conditionalSelection(column, where).isNotNull

@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this annotation? The method is accessible to classes in com.amazon.deequ which the tests are under.

: Column = {
conditionColumn
.map { condition => {
when(not(condition), expr(filterTreatment)).when(condition, selection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delegate the expr(filterTreatment) to the parameter of the method? We can update the type of filterTreatment to a type of FilteredRow. The expressions for each type of FilteredRow enumerations can sit inside FilteredRow itself. Right now, we have a .toString which breaks the connection between FilteredRow and this method. Ideally, we want to keep that connection to aid in refactoring and general readability of the code.

Comment on lines +57 to +61
private def getRowLevelFilterTreatment: FilteredRow = {
analyzerOptions
.map { options => options.filteredRow }
.getOrElse(FilteredRow.TRUE)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is repeated in a few places, so could go into the base class.

rowLevelColumn => {
conditionColumn.map {
condition => {
when(not(condition), expr(getRowLevelFilterTreatment.toString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for expr(getRowLevelFilterTreatment.toString) as above

Comment on lines +353 to +357
def hasUniqueness(column: String, assertion: Double => Boolean, hint: Option[String],
analyzerOptions: Option[AnalyzerOptions])
: CheckWithLastConstraintFilterable = {
hasUniqueness(Seq(column), assertion, hint, analyzerOptions)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we combine all the hasUniqueness into one method and update the body of the method based on the existence of parameters?

Comment on lines +149 to +150
// assert(SimpleResultSerde.deserialize(jsonA) ==
// SimpleResultSerde.deserialize(jsonB))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these comments here and in other files below?

Comment on lines +207 to 211
separateResults.asInstanceOf[Set[DoubleMetric]].foreach( result => {
assert(runnerResults.toString.contains(result.toString))
}
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code has some excess whitespace and misaligned brackets. In a future PR, can we add these rules into scala-style such that the build fails otherwise?

eycho-am added a commit to eycho-am/deequ that referenced this pull request Feb 20, 2024
eycho-am added a commit that referenced this pull request Feb 21, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
eycho-am added a commit that referenced this pull request Feb 21, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
eycho-am added a commit that referenced this pull request Feb 21, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
eycho-am added a commit that referenced this pull request Feb 21, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
eycho-am added a commit that referenced this pull request Feb 22, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
rdsharma26 pushed a commit that referenced this pull request Apr 17, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
rdsharma26 pushed a commit that referenced this pull request Apr 17, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
rdsharma26 pushed a commit that referenced this pull request Apr 17, 2024
…mpleteness (#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
rdsharma26 pushed a commit that referenced this pull request Apr 17, 2024
…um (#535)

* Address comments on PR #532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
arsenalgunnershubert777 pushed a commit to arsenalgunnershubert777/deequ that referenced this pull request Nov 8, 2024
…mpleteness (awslabs#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
arsenalgunnershubert777 pushed a commit to arsenalgunnershubert777/deequ that referenced this pull request Nov 8, 2024
…um (awslabs#535)

* Address comments on PR awslabs#532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants