Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure correct metrics despite model failures on some CV folds #404

Merged
merged 17 commits into from
Sep 11, 2019

Conversation

leahmcguire
Copy link
Collaborator

@leahmcguire leahmcguire commented Sep 9, 2019

Related issues
GLM optimizer is not very stable so sometimes runs will fail in some CV folds but not others - this was causing metrics and grids to not be computed correctly since we assumed that the same number of parameter grids successfully run in each cross validation fold

Describe the proposed solution
throw out grids which are not in every cv run (since they are likely to fail in some modeling runs)

Describe alternatives you've considered
use grids with metrics in a map to make sure that the correct metrics are combined in CV and simply take the best metric

Additional context
Add any other context about the changes here.

@codecov
Copy link

codecov bot commented Sep 9, 2019

Codecov Report

Merging #404 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #404      +/-   ##
==========================================
- Coverage   86.96%   86.95%   -0.02%     
==========================================
  Files         337      337              
  Lines       11054    11060       +6     
  Branches      361      591     +230     
==========================================
+ Hits         9613     9617       +4     
- Misses       1441     1443       +2
Impacted Files Coverage Δ
...orce/op/stages/impl/tuning/OpCrossValidation.scala 97.95% <100%> (+0.28%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95a77b1...6687a34. Read the comment docs.

@leahmcguire leahmcguire mentioned this pull request Sep 9, 2019
log.info(s"Best set of parameters:\n${grid(bestIndex)}")
require(folds.map(_.model.uid).toSet.size == 1) // Should be called only on instances of the same model
val gridCounts = folds.map(_.grids.map(_ -> 1).toMap).reduce(_ + _)
val maxFolds = gridCounts.maxBy(_._2)._2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability please replace _._2 in val maxFolds = gridCounts.maxBy(_._2)._2 with val (_, maxFolds) = gridCounts.maxBy { case (_, folds) => folds }
Same in a few operations below.

@leahmcguire
Copy link
Collaborator Author

@tovbinm @gerashegalov I discovered some fun behavior using algebird implicit for maps:

val test = Seq(Map("a" -> 1, "b" -> 1), Map("a" -> 1, "b" -> 1))
val combined = test.reduce( _ + _) // Map("a" -> 2, "b" -> 2)) as expected

val test = Seq(Map("a" -> 0, "b" -> 0), Map("a" -> 0, "b" -> 0))
val combined = test.reduce( _ + _) // Map() :-(

Think I may need to write out the reduce function...

else metrics.zipWithIndex.minBy(_._1)
log.info(s"Best set of parameters:\n${grid(bestIndex)}")
require(folds.map(_.model.uid).toSet.size == 1) // Should be called only on instances of the same model
val gridCounts = folds.map(_.grids.map(_ -> 1).toMap).reduce(_ + _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same as

folds.flatMap(_.grids.map(_ -> 1)).sumByKey

val keys = m1.keySet.union(m2.keySet)
keys.map(k => k -> (m1.getOrElse(k, 0.0) + m2.getOrElse(k, 0.0))).toMap
}
.filterKeys(gridsIn.contains)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us filter first, so we have less to reduce, maybe

    val gridMetrics = folds.flatMap(f => f.grids.zip(f.metrics))
      .collect { case (pm, met) if gridsIn.contains(pm) => (pm, met / maxFolds) }
      .sumByKey

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 10, 2019

@leahmcguire Map behavior is as expected under monoid rules. In order to keep the map keys with 0 values use semigroup for map values. Example for Map[String, Long] below:

import com.twitter.algebird._
import com.twitter.algebird.Operators._

implicit val longSemigroup = Semigroup.from[Long](_ + _)
implicit val mapLongMonoid = Monoid.mapMonoid[String, Long](longSemigroup)

// works!
(Map("a" -> 0L) + Map("b" -> 0L)) shouldBe Map("a" -> 0L, "b" -> 0L)

if (evaluator.isLargerBetter) metrics.zipWithIndex.maxBy(_._1)
else metrics.zipWithIndex.minBy(_._1)
log.info(s"Best set of parameters:\n${grid(bestIndex)}")
require(folds.map(_.model.uid).toSet.size == 1) // Should be called only on instances of the same model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to be defensive, my concern we call it in iteration in a private method completely in the scope here, and by construction we already know that folds are for the same model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove and just put a description on the method

Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some comments

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@gerashegalov gerashegalov changed the title allow for model to fail on some CV folds and still get the metrics calculations correct Ensure correct metrics despite model failures on some CV folds Sep 11, 2019
@gerashegalov gerashegalov merged commit d223c46 into master Sep 11, 2019
@gerashegalov gerashegalov deleted the lm/cvFix branch September 11, 2019 04:42
@leahmcguire leahmcguire restored the lm/cvFix branch September 11, 2019 18:50
@leahmcguire leahmcguire deleted the lm/cvFix branch September 11, 2019 19:23
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
@tovbinm tovbinm mentioned this pull request Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants