-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#97 Aggregate control type strategy #107
Conversation
…o needs testing, cleanup & documentation update MeasurementProcessor split into object/class to offer generic processing methods to be reusable.
…gression guard)
…al only-default `cmBuilder.calculateMeasurement` removed
atum/src/main/scala/za/co/absa/atum/utils/controlmeasure/ControlMeasureBuilder.scala
Show resolved
Hide resolved
atum/src/main/scala/za/co/absa/atum/core/MeasurementProcessor.scala
Outdated
Show resolved
Hide resolved
case AbsAggregatedTotal => | ||
(ds: Dataset[Row]) => { | ||
val aggCol = sum(abs(col(valueColumnName))) | ||
aggregateColumn(ds, controlCol, aggCol) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we validate for proper field type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do -- in ControlMeasureBuilder.deriveControlType()
. There, if Specific
strategy is used and controlType == AggregatedTotal || controlType == AbsAggregatedTotal
, the column type is checked to be numeric. We issue a warning in such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is not too high in the hierarchy
Also, is only warning enough? Do we know what happens if it is misused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By not too high in the hierarchy, you mean that the warning should not occur when calculateMeasurement()
is called and the controlType gets derived then, but right when the user attempts to set it? It can be done, it is just a bit unfortunate that since there are multiple API options for this, there may need to be multiple places to perform the check (at least in private def withAggregateColumn
and private def withAggregateColumnsDirectly
) -- but sure, it would make sense I guess.
As for the warning vs error, in that case, I am slightly siding with a warning, but I let myself be persuaded. I can imagine weird scenarios where perhaps some obscure experimental usage might make sense on non-numeric fields, too (Binary, TimeStamp, ...), so I don't want to restrict it right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the check "higher", now it checks the datatype vs controlType numericity on the builder methods (.withAggregateColumn(s)
). I left it as a warning for now.
atum/src/main/scala/za/co/absa/atum/utils/controlmeasure/ControlMeasureBuilder.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you
Note: a small readme update follows in #110 |
* #88: Add some files to configure the project (#89) * #88: Add some files configure the project * Git configuration * Scalastyle support * Ensured no Scalastyle errors * Added CRLF for Windows *.bat and *.cmd files to .editorconfig * Added Spark 3.1 build into the `build-all.sh` script * Created Windows `build-all.cmd` script * Upgrade from Spark 3.1.1 to 3.1.2 (fixes several issues of the previous version) * `--no-transfer-progress` added to build.yml * #97 Aggregate control type strategy (#107) * #97 AggregateControlTypeStrategy suggested API for ControlMeasureBuilder usage * #97 ControlMeasureBuilder.withAggregateColumn(s) implementations. Todo needs testing, cleanup & documentation update MeasurementProcessor split into object/class to offer generic processing methods to be reusable. * #97 ControlMeasureBuilder.withAggregateColumn(s) unit tests added (regression guard) * #97 ControlMeasureBuilder.withAggregateColumn(s) in README.md, original only-default `cmBuilder.calculateMeasurement` removed * [maven-release-plugin] prepare release v3.6.0 * [maven-release-plugin] prepare for next development iteration * #97 readme update - ControlMeasureBuilder API (#110) * #97 readme update (related to #97, too) * #97 maven central version badge added * Feature/113 info permissions config (#114) * #113 atum info file permissions for hdfs loaded from `atum.hdfs.info.file.permissions` config value - tests use MiniDfsCluster to assert controlled correct behavior - test update (custom MiniDfsCluster with umask 000 allows max permissions) - HdfsFileUtils.DefaultFilePermissions is now publicly exposed; the user is expected to call compose the default and configured value it on his own by e.g.: `HdfsFileUtils.getInfoFilePermissionsFromConfig().getOrElse(HdfsFileUtils.DefaultFilePermissions)` * #77 Fix parameter handling bug in CreateInfoFileToolCSV (#78) * #121 sbt cross comptilation * #121 multiversion build (scala, spark, json4s). * #121 hadoop3 used for Spark3/Scala2.12 * #121 sbt github autobuild * #125 publish, pgp/gpg plugin added; sbt-sonatype howto referenced; model, parent prefixed by `atum-` to conform to the mvn publish, cleanup * Upgrade dependencies and remove MiniDFSCluster * GH Action fix * Remove pom.xml files * Add licence header and header check * Fix examples Co-authored-by: David Benedeki <[email protected]> Co-authored-by: Daniel K <[email protected]> Co-authored-by: Jan Scherbaum <[email protected]>
This PR directly addresses the feature request in #97. In summary, the
ControlMeasurementBuilder
now has a new API foraggregateColumns
setup (in-depth described in the README.md, too):Some implementation details:
Default
strategyControlMeasurementBuilder
master
and targeting Atum 3.x - deliberately