Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#97 Aggregate control type strategy #107

Merged
merged 6 commits into from
Aug 12, 2021

Conversation

dk1844
Copy link
Collaborator

@dk1844 dk1844 commented Aug 5, 2021

This PR directly addresses the feature request in #97. In summary, the ControlMeasurementBuilder now has a new API for aggregateColumns setup (in-depth described in the README.md, too):

import za.co.absa.atum.utils.controlmeasure.ControlMeasureBuilder
import ControlMeasureBuilder.ControlTypeStrategy.{Default, Specific}
import za.co.absa.atum.core.ControlType.{Count, DistinctCount, AggregatedTotal, AbsAggregatedTotal, HashCrc32}

// controlMeasureBuilder obtainable by ControlMeasureBuilder.forDf(df)
// with Default, the ControlType will be chosen based on the field type (AbsAggregatedTotal for numeric, HashCrc32 otherwise)
val updatedBuilder1 = controlMeasureBuilder.withAggregateColumns(Seq("col1", "col2")) // equivalent to .withAggregateColumns(Seq("col1", "col2"), Default)
val updatedBuilder2 = controlMeasureBuilder.withAggregateColumns(Seq("col1", "col2"), Specific(HashCrc32)) // all columns will use HashCrc32

val iterativelyUpdatedBuilder3 = controlMeasureBuilder
.withAggregateColumn("col1") // equivalent to .withAggregateColumn("col1", Default). AbsAggregatedTotal used if col1 is numeric, HashCrc32 otherwise
.withAggregateColumn("col2", Specific(DistinctCount)) // DistinctCount for this column's measurement

Some implementation details:

  • the original behavior is kept as the Default strategy
  • the original behavior duplicated code from MeasurementsProcessor, so MP was split into an object (with general processing methods) and a class (with support for processing based on existing measurements), and the general code was reused for the ControlMeasurementBuilder
  • unit tests have been added (existing unit tests show evidence of consistency between the original and the rework)
  • branched from master and targeting Atum 3.x - deliberately

…o needs testing, cleanup & documentation update

MeasurementProcessor split into object/class to offer generic processing methods to be reusable.
…al only-default `cmBuilder.calculateMeasurement` removed
@dk1844 dk1844 marked this pull request as ready for review August 5, 2021 07:30
@dk1844 dk1844 linked an issue Aug 5, 2021 that may be closed by this pull request
@dk1844 dk1844 modified the milestone: 3.6.0 Aug 5, 2021
Comment on lines +50 to +54
case AbsAggregatedTotal =>
(ds: Dataset[Row]) => {
val aggCol = sum(abs(col(valueColumnName)))
aggregateColumn(ds, controlCol, aggCol)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we validate for proper field type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do -- in ControlMeasureBuilder.deriveControlType(). There, if Specific strategy is used and controlType == AggregatedTotal || controlType == AbsAggregatedTotal, the column type is checked to be numeric. We issue a warning in such a case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is not too high in the hierarchy
Also, is only warning enough? Do we know what happens if it is misused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not too high in the hierarchy, you mean that the warning should not occur when calculateMeasurement() is called and the controlType gets derived then, but right when the user attempts to set it? It can be done, it is just a bit unfortunate that since there are multiple API options for this, there may need to be multiple places to perform the check (at least in private def withAggregateColumn and private def withAggregateColumnsDirectly) -- but sure, it would make sense I guess.

As for the warning vs error, in that case, I am slightly siding with a warning, but I let myself be persuaded. I can imagine weird scenarios where perhaps some obscure experimental usage might make sense on non-numeric fields, too (Binary, TimeStamp, ...), so I don't want to restrict it right away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the check "higher", now it checks the datatype vs controlType numericity on the builder methods (.withAggregateColumn(s)). I left it as a warning for now.

Copy link
Collaborator

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thank you

@dk1844 dk1844 merged commit 7362a73 into master Aug 12, 2021
@dk1844 dk1844 deleted the feature/97-aggregateControlTypeStrategy branch August 12, 2021 10:45
@dk1844
Copy link
Collaborator Author

dk1844 commented Aug 13, 2021

Note: a small readme update follows in #110

Zejnilovic added a commit that referenced this pull request Apr 13, 2022
* #88: Add some files to configure the project (#89)
* #88: Add some files configure the project
* Git configuration
* Scalastyle support
* Ensured no Scalastyle errors
* Added CRLF for Windows *.bat and *.cmd files to .editorconfig
* Added Spark 3.1 build into the `build-all.sh` script
* Created Windows `build-all.cmd` script
* Upgrade from Spark 3.1.1 to 3.1.2 (fixes several issues of the previous version)
* `--no-transfer-progress` added to build.yml
* #97 Aggregate control type strategy (#107)
* #97 AggregateControlTypeStrategy suggested API for ControlMeasureBuilder usage
* #97 ControlMeasureBuilder.withAggregateColumn(s) implementations. Todo needs testing, cleanup & documentation update
MeasurementProcessor split into object/class to offer generic processing methods to be reusable.
* #97 ControlMeasureBuilder.withAggregateColumn(s) unit tests added (regression guard)
* #97 ControlMeasureBuilder.withAggregateColumn(s) in README.md, original only-default `cmBuilder.calculateMeasurement` removed
* [maven-release-plugin] prepare release v3.6.0
* [maven-release-plugin] prepare for next development iteration
* #97 readme update - ControlMeasureBuilder API (#110)
* #97 readme update (related to #97, too)
* #97 maven central version badge added
* Feature/113 info permissions config (#114)
* #113 atum info file permissions for hdfs loaded from `atum.hdfs.info.file.permissions` config value
  - tests use MiniDfsCluster to assert controlled correct behavior
  - test update (custom MiniDfsCluster with umask 000 allows max permissions)
  - HdfsFileUtils.DefaultFilePermissions is now publicly exposed; the user is expected to call compose the default and configured value it on his own by e.g.:
`HdfsFileUtils.getInfoFilePermissionsFromConfig().getOrElse(HdfsFileUtils.DefaultFilePermissions)`
* #77 Fix parameter handling bug in CreateInfoFileToolCSV (#78)

* #121 sbt cross comptilation
* #121 multiversion build (scala, spark, json4s).
* #121 hadoop3 used for Spark3/Scala2.12
* #121 sbt github autobuild
* #125 publish, pgp/gpg plugin added; sbt-sonatype howto referenced; model, parent prefixed by `atum-` to conform to the mvn publish, cleanup
* Upgrade dependencies and remove MiniDFSCluster
* GH Action fix
* Remove pom.xml files
* Add licence header and header check
* Fix examples

Co-authored-by: David Benedeki <[email protected]>
Co-authored-by: Daniel K <[email protected]>
Co-authored-by: Jan Scherbaum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable user to specify measurement controlType
3 participants