Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Check Range/Thresholds) through backwards compatible function #593

arsenalgunnershubert777 · 2024-11-08T22:50:28Z

Description of changes:

This PR adds functionality to expose anomaly detection metadata for anomaly checks.

There is a legacy PR here, and this PR addresses feedback to make the anomaly detection metadata functionality backwards compatible.

AnomalyDetectionExtendedResult is a newly exposed field that contains metadata for anomaly check runs. The user is able to see these details from ConstraintResult.
The user can see those extended results when using a new addAnomalyCheckWithExtendedResults method in the verification suite builder (instead of addAnomalyCheck)
The original addAnomalyCheck functionality works the same (though it now internally uses the new detectWithExtendedResults function see point about that below)
There are the following changes to make this happen
- First of all a new AnomalyDetectionDataPoint class
  - This class has the following fields:
    - dataMetricValue: The metric value that is the data point.
    - anomalyMetricValue: The metric value that is being checked for the anomaly detection strategy, which isn't always equal to the dataMetricValue.
    - anomalyCheckRange: The range of bounds used in the anomaly check, the anomalyMetricValue is compared to this range.
    - isAnomaly: If the anomalyMetricValue is outside the anomalyCheckRange, this is true.
    - confidence: The confidence of the anomaly detection.
    - detail: An optional detail message.
  - AnomalyDetectionExtendedResult is currently just a wrapper for this AnomalyDetectionDataPoint class.
- All the anomaly detection strategies extend a new trait called AnomalyDetectionStrategyWithExtendedResults
  - In the batch normal strategy for example, the new detectWithExtendedResults function extends the new trait.
- The new detectWithExtendedResults function is also updated to be the function that the regular detect function uses after mapping from the AnomalyDetectionDataPoint to the original Anomaly class, since they are using the same underlying calculations. If preferred to keep them independent, I can separate those.
- Using the detectWithExtendedResults function all the Anomaly Strategies output all the AnomalyDetectionDataPoints with those extra Anomaly details corresponding to all the data points in the search interval via ExtendedDetectionResults
- The last AnomalyDetectionDataPoint which corresponds to the new metric being checked is passed along the through the following:
  - The new isNewestPointNonAnomalousWithExtendedResults function
  - The new Check.isNewestPointNonAnomalousWithExtendedResults function called by the above
  - The new getNewestPointAnomalyResults function called by the above which returns a new kind of Anomaly assertion function
  - This new anomaly assertion function called by the above returns a new AnomalyDetectionAssertionResult.
  - The new AnomalyDetectionAssertionResult contains both an assertion boolean as well as the AnomalyDetectionExtendedResult wrapping the data point of the newest data point. The assertion boolean is determined from that data point's isAnomaly field.
    - This is different from the previous anomaly assertion function which just returns a boolean, because this allows the passing through of the anomaly detection data point.
  - This new anomaly assertion function uses a new AnomalyExtendedResultsConstraint class to handle the assertion function.
  - In the future if the Anomaly Check focuses on multiple datapoints instead of just one, the AnomalyDetectionExtendedResult can add another field and be adjusted accordingly.
- The anomaly detection metadata is added to the ConstraintResult class as an optional field
  - Maybe in the future this can be separated into a different class that inherits from a common trait since the AnomalyDetectionMetadata does not pertain to other non anomaly checks, would take a lot more refactoring/changes
- Tests are updated to reflect the above changes
- Documentation with links to an example executable is added in this anomaly detection with extended results page that is linked from the previous anomaly detection page

Let me know if anyone has any questions/feedback, thanks!

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Add Spark 3.5 support * Replace with DataTypeUtils.fromAttributes * Remove unintended new line

…mpleteness (awslabs#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition

…elOperations is not available (awslabs#536)

…um (awslabs#535) * Address comments on PR awslabs#532 * Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers * Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic

…wslabs#537) * Add analyzerOption to add filteredRowOutcome for isPrimaryKey Check * Add analyzerOption to add filteredRowOutcome for hasUniqueValueRatio Check

…vior.EmptyString option, the where filter wasn't properly applied (awslabs#538)

…slabs#543) * [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.

…aluation (awslabs#547) * [MinLength/MaxLength] Apply filtered row behavior at the row level evaluation - For certain scenarios, the filtered row behavior for MinLength and MaxLength was not working correctly. - For example, when using both minLength and maxLength constraints in a single check, and with both using == <value> as an assertion. This was resulting in the row level outcome of the filtered rows to be false. This was because we were replacing values for filtered rows for Min to MaxValue and for Max to MinValue. But a number could not equal both at the same time. - Updated the logic of the row level assertion to MinLength/MaxLength to match what was done for Min/Max.

- The satisfies constraint was incorrectly using the provided assertion to evaluate the row level outcomes. The assertion should only be used to evaluate the final outcome. - As part of this change, we have updated the row level results to return a true/false. The cast to an integer happens as part of the aggregation result. - Added a test to verify the row level results using checks made up of different assertions.

* Added RatioOfSums analyzer and tests * Unit test for divide by zero and code cleanup. * More detailed Scaladoc * Fixed docs to include Double.NegativeInfinity * Add copyright to new file

* Fix flaky KLL test * Move CustomSql state to CustomSql analyzer * Implement new Analyzer to count columns * Improve documentation, remove unused parameter, replace if/else with map --------- Co-authored-by: Yannis Mentekidis <[email protected]>

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

…awslabs#569) Co-authored-by: Tyler Mcdaniel <[email protected]>

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

* Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID

… for the bound value and isThresholdInclusive, also adding anomaly detection with extended results README, and adding anomaly detection test with 2 anomaly checks on the same suite

rdsharma26 · 2024-11-13T15:45:49Z

@arsenalgunnershubert777 Thanks for the PR! Please allow for additional time to review the PR given the amount of changes. In the meantime, would it be possible to rebase your branch off of latest master so that it reduces the number of commits in the PR?

arsenalgunnershubert777 · 2024-11-13T17:17:45Z

Hey @rdsharma26 thanks for reaching out. I think I have some issues with the conflicts (and I also can improve my understanding of Git)
So for now I may just create a clean branch and apply the changes on top of that manually and open a new PR, so there's just 1 commit

rdsharma26 · 2024-11-13T17:35:50Z

Thanks @arsenalgunnershubert777
After syncing your forked repository with the main repository, you could try:

git checkout master
git pull
git checkout exposeAnomalyThresholdBackwardsCompatible
git rebase master
git push origin exposeAnomalyThresholdBackwardsCompatible

arsenalgunnershubert777 · 2024-11-13T17:41:33Z

@rdsharma26 got it yea when I run that it has more conflicts, and I notice some classes have reverted to previous states

arsenalgunnershubert777 and others added 24 commits September 20, 2024 20:20

added more tests to the anomaly detection with extended results changes

5338fe4

Add Spark 3.5 support (awslabs#514)

6040cef

* Add Spark 3.5 support * Replace with DataTypeUtils.fromAttributes * Remove unintended new line

fix merge conflicts

02ed720

Skip SparkTableMetricsRepositoryTest iceberg test when SupportsRowLev…

185ce01

…elOperations is not available (awslabs#536)

Add analyzerOption to add filteredRowOutcome for isPrimaryKey Check (a…

efeec97

…wslabs#537) * Add analyzerOption to add filteredRowOutcome for isPrimaryKey Check * Add analyzerOption to add filteredRowOutcome for hasUniqueValueRatio Check

Fix bug in MinLength and MaxLength analyzers where given the NullBeha…

9fa5096

…vior.EmptyString option, the where filter wasn't properly applied (awslabs#538)

fix merge conflicts

c2e862f

New analyzer, RatioOfSums (awslabs#552)

b69f2b8

* Added RatioOfSums analyzer and tests * Unit test for divide by zero and code cleanup. * More detailed Scaladoc * Fixed docs to include Double.NegativeInfinity * Add copyright to new file

Update breeze to match spark 3.5 breeze version (awslabs#545)

a4a8aa6

Configurable RetainCompletenessRule (awslabs#564)

572d776

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

Optional specification of instance name in CustomSQL analyzer metric. (…

dc9ba7e

…awslabs#569) Co-authored-by: Tyler Mcdaniel <[email protected]>

CustomAggregator (awslabs#572)

ee26d1c

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

fix typo (awslabs#574)

97f7a3e

Fix performance of building row-level results (awslabs#577)

9d92d94

* Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID

fix merge conflicts

0f81982

updating anomaly check bounds to not have defaults and require inputs…

fdebce5

… for the bound value and isThresholdInclusive, also adding anomaly detection with extended results README, and adding anomaly detection test with 2 anomaly checks on the same suite

add accidentally removed import

5da25c4

arsenalgunnershubert777 mentioned this pull request Nov 8, 2024

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Thresholds) #525

Open

update readme to be more clear about the anomalyMetricValue

198a41f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Check Range/Thresholds) through backwards compatible function #593

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Check Range/Thresholds) through backwards compatible function #593

arsenalgunnershubert777 commented Nov 8, 2024

rdsharma26 commented Nov 13, 2024

arsenalgunnershubert777 commented Nov 13, 2024

rdsharma26 commented Nov 13, 2024 •

edited

Loading

arsenalgunnershubert777 commented Nov 13, 2024

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Check Range/Thresholds) through backwards compatible function #593

Are you sure you want to change the base?

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Check Range/Thresholds) through backwards compatible function #593

Conversation

arsenalgunnershubert777 commented Nov 8, 2024

rdsharma26 commented Nov 13, 2024

arsenalgunnershubert777 commented Nov 13, 2024

rdsharma26 commented Nov 13, 2024 • edited Loading

arsenalgunnershubert777 commented Nov 13, 2024

rdsharma26 commented Nov 13, 2024 •

edited

Loading