-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure correct metrics despite model failures on some CV folds #404
Conversation
…lculations correct
Codecov Report
@@ Coverage Diff @@
## master #404 +/- ##
==========================================
- Coverage 86.96% 86.95% -0.02%
==========================================
Files 337 337
Lines 11054 11060 +6
Branches 361 591 +230
==========================================
+ Hits 9613 9617 +4
- Misses 1441 1443 +2
Continue to review full report at Codecov.
|
log.info(s"Best set of parameters:\n${grid(bestIndex)}") | ||
require(folds.map(_.model.uid).toSet.size == 1) // Should be called only on instances of the same model | ||
val gridCounts = folds.map(_.grids.map(_ -> 1).toMap).reduce(_ + _) | ||
val maxFolds = gridCounts.maxBy(_._2)._2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability please replace _._2
in val maxFolds = gridCounts.maxBy(_._2)._2
with val (_, maxFolds) = gridCounts.maxBy { case (_, folds) => folds }
Same in a few operations below.
@tovbinm @gerashegalov I discovered some fun behavior using algebird implicit for maps: val test = Seq(Map("a" -> 1, "b" -> 1), Map("a" -> 1, "b" -> 1)) val test = Seq(Map("a" -> 0, "b" -> 0), Map("a" -> 0, "b" -> 0)) Think I may need to write out the reduce function... |
…eird aggregator behavior
else metrics.zipWithIndex.minBy(_._1) | ||
log.info(s"Best set of parameters:\n${grid(bestIndex)}") | ||
require(folds.map(_.model.uid).toSet.size == 1) // Should be called only on instances of the same model | ||
val gridCounts = folds.map(_.grids.map(_ -> 1).toMap).reduce(_ + _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same as
folds.flatMap(_.grids.map(_ -> 1)).sumByKey
val keys = m1.keySet.union(m2.keySet) | ||
keys.map(k => k -> (m1.getOrElse(k, 0.0) + m2.getOrElse(k, 0.0))).toMap | ||
} | ||
.filterKeys(gridsIn.contains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let us filter first, so we have less to reduce, maybe
val gridMetrics = folds.flatMap(f => f.grids.zip(f.metrics))
.collect { case (pm, met) if gridsIn.contains(pm) => (pm, met / maxFolds) }
.sumByKey
@leahmcguire Map behavior is as expected under monoid rules. In order to keep the map keys with 0 values use semigroup for map values. Example for import com.twitter.algebird._
import com.twitter.algebird.Operators._
implicit val longSemigroup = Semigroup.from[Long](_ + _)
implicit val mapLongMonoid = Monoid.mapMonoid[String, Long](longSemigroup)
// works!
(Map("a" -> 0L) + Map("b" -> 0L)) shouldBe Map("a" -> 0L, "b" -> 0L) |
if (evaluator.isLargerBetter) metrics.zipWithIndex.maxBy(_._1) | ||
else metrics.zipWithIndex.minBy(_._1) | ||
log.info(s"Best set of parameters:\n${grid(bestIndex)}") | ||
require(folds.map(_.model.uid).toSet.size == 1) // Should be called only on instances of the same model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to be defensive, my concern we call it in iteration in a private method completely in the scope here, and by construction we already know that folds are for the same model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove and just put a description on the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)
Related issues
GLM optimizer is not very stable so sometimes runs will fail in some CV folds but not others - this was causing metrics and grids to not be computed correctly since we assumed that the same number of parameter grids successfully run in each cross validation fold
Describe the proposed solution
throw out grids which are not in every cv run (since they are likely to fail in some modeling runs)
Describe alternatives you've considered
use grids with metrics in a map to make sure that the correct metrics are combined in CV and simply take the best metric
Additional context
Add any other context about the changes here.