Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

made case class to deal with model selector metadata #39

Merged
merged 14 commits into from
Aug 9, 2018

Conversation

leahmcguire
Copy link
Collaborator

Related issues
metadata is terrible

Describe the proposed solution
we should have concrete classes to deal with it instead of nasty nested maps

Describe alternatives you've considered
leave the nasty nested maps and deal with them

@codecov
Copy link

codecov bot commented Aug 7, 2018

Codecov Report

Merging #39 into master will increase coverage by 0.68%.
The diff coverage is 93.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #39      +/-   ##
==========================================
+ Coverage   83.49%   84.18%   +0.68%     
==========================================
  Files         296      298       +2     
  Lines        9380     9749     +369     
  Branches      344      559     +215     
==========================================
+ Hits         7832     8207     +375     
+ Misses       1548     1542       -6
Impacted Files Coverage Δ
...p/evaluators/OpBinaryClassificationEvaluator.scala 81.57% <ø> (ø) ⬆️
...lesforce/op/evaluators/OpRegressionEvaluator.scala 91.66% <ø> (ø) ⬆️
...tages/impl/preparators/SanityCheckerMetadata.scala 92.2% <ø> (ø) ⬆️
...op/evaluators/OpMultiClassificationEvaluator.scala 94.66% <ø> (ø) ⬆️
...ce/op/stages/impl/selector/ModelSelectorBase.scala 98.87% <100%> (-0.02%) ⬇️
...cala/com/salesforce/op/evaluators/Evaluators.scala 96.72% <100%> (+3.61%) ⬆️
...m/salesforce/op/stages/OpPipelineStageParams.scala 91.17% <100%> (+0.26%) ⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala 95.65% <100%> (ø) ⬆️
...salesforce/op/stages/impl/tuning/OpValidator.scala 98.55% <100%> (-0.03%) ⬇️
...sforce/op/stages/impl/selector/ModelSelector.scala 93.87% <100%> (+1.87%) ⬆️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9155e7c...7082038. Read the comment docs.

BinaryClassEvalMetrics.withNameInsensitiveOption(name)
.orElse(MultiClassEvalMetrics.withNameInsensitiveOption(name))
.orElse(RegressionEvalMetrics.withNameInsensitiveOption(name))
.orElse(OpEvaluatorNames.withNameInsensitiveOption(name))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it weird to have evaluator names to be an evaluator metrics. perhaps lets just have evaluation metric and drop the OpEvaluatorNames completely. wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a change I introduced it was always the case because there are metrics that are grouped by the evaluator. if we want to restructure it it should be in a separate PR

}

object EvalMetric {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs please

*
* @param json encoded metrics
*/
def fromJson(className: String, json: String): EvaluationMetrics = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type should be Try[EvaluationMetrics]

def error(c: Class[_]) = throw new IllegalArgumentException(
s"Could not extract metrics of type $c from ${json.mkString(",")}"
)
className match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to compare the full class instead of simple class name, i.e.

val metricsClass = ReflectionUtils.classForName(className).asInstanceOf[Class[_ <: EvaluationMetrics]]

 metricsClass match {
      case n if n == classOf[MultiMetrics] =>
      case n if n == classOf[BinaryClassificationMetrics] =>
      // etc
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is totally unnecessary - I am going to delete this companion object and make this a private method inside the metadata so it is more clear how it is used and prevent people from trying to reuse it

@@ -521,15 +362,33 @@ case class Insights
case object ModelInsights {
@transient protected lazy val log = LoggerFactory.getLogger(this.getClass)

val SerFormats: Formats = Serialization.formats(FullTypeHints(List(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit more readable version:

val SerializationFormats: Formats = {
    val typeHints = FullTypeHints(List(
      classOf[Continuous], classOf[Discrete],
      classOf[DataBalancerSummary], classOf[DataCutterSummary], classOf[DataSplitterSummary],
      classOf[SingleMetric], classOf[MultiMetrics], classOf[BinaryClassificationMetrics], classOf[ThresholdMetrics],
      classOf[MultiClassificationMetrics], classOf[RegressionMetrics]
    ))
    val evalMetricsSerializer = new CustomSerializer[EvalMetric](_ =>
      ( { case JString(s) => EvalMetric.withNameInsensitive(s) },
        { case x: EvalMetric => JString(x.entryName) }
      )
    )
    Serialization.formats(typeHints) +
      EnumEntrySerializer.json4s[ValidationType](ValidationType) +
      EnumEntrySerializer.json4s[ProblemType](ProblemType) +
      new SpecialDoubleSerializer +
      evalMetricsSerializer
  }


private[op] object SplitterSummary {
val ClassName: String = "className"
def fromMap(map: Map[String, Any]): SplitterSummary = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below is slightly modified and a better version (pros: returns Try, uses class comparison and handles the default case):

 def fromMap(map: Map[String, Any]): Try[SplitterSummary] = Try {
    val summaryClass = ReflectionUtils.classForName(map(ClassName).toString).asInstanceOf[Class[_ <: SplitterSummary]]
    summaryClass match {
      case s if s == classOf[DataSplitterSummary] => DataSplitterSummary()
      case s if s == classOf[DataBalancerSummary] => DataBalancerSummary(
        positiveLabels = map(ModelSelectorBaseNames.Positive).asInstanceOf[Long],
        negativeLabels = map(ModelSelectorBaseNames.Negative).asInstanceOf[Long],
        desiredFraction = map(ModelSelectorBaseNames.Desired).asInstanceOf[Double],
        upSamplingFraction = map(ModelSelectorBaseNames.UpSample).asInstanceOf[Double],
        downSamplingFraction = map(ModelSelectorBaseNames.DownSample).asInstanceOf[Double]
      )
      case s if s == classOf[DataCutterSummary] => DataCutterSummary(
        labelsKept = map(ModelSelectorBaseNames.LabelsKept).asInstanceOf[Array[Double]],
        labelsDropped = map(ModelSelectorBaseNames.LabelsDropped).asInstanceOf[Array[Double]]
      )
      case s =>
        throw new IllegalArgumentException(s"Unrecognised splitter summary class: $s")
    }
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this does not have general use it is only for metadata deser - that is why it is private i will move it and make it a private method

}
}

implicit val doubleOptEquality = new Equality[Option[Double]] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! how useful are these? do you think we should have them in a trait somewhere in com.salesforce.op.test in our test kit?

decoded.bestModelName shouldEqual summary.bestModelName
decoded.bestModelType shouldEqual summary.bestModelType
decoded.validationResults shouldEqual summary.validationResults
decoded.trainEvaluation.toJson() shouldEqual summary.trainEvaluation.toJson()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this toJson() to avoid NaN issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trainEvaluation doesn't contain NaN, no?
Does scala know how to compare two complex case class objects?

@@ -107,7 +107,8 @@ class OpValidatorTest extends FlatSpec with TestSparkContext {
assertFractions(Array(1 - p, p), train)
assertFractions(Array(1 - p, p), validate)
}
balancer.get.metadataBuilder.build() should not be new MetadataBuilder().build()
println(balancer.get.summary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove println

@@ -45,7 +45,8 @@ import org.apache.spark.sql.types.MetadataBuilder
import com.salesforce.op.utils.spark.RichDataset._

@RunWith(classOf[JUnitRunner])
class OpValidatorTest extends FlatSpec with TestSparkContext {
class
OpValidatorTest extends FlatSpec with TestSparkContext {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove redundant end line

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!!

@leahmcguire leahmcguire merged commit 8805218 into master Aug 9, 2018
@leahmcguire leahmcguire deleted the lm/metadata branch August 9, 2018 21:59
def areEqual(a: Option[Double], b: Any): Boolean = b match {
case None => a.isEmpty
case s: Option[Double] => (a.exists(_.isNaN) && s.exists(_.isNaN)) ||
(a.nonEmpty && a.toSeq.zip(s.toSeq).forall{ case (n, m) => n == m })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (a.exists(_.isNaN) && s.exists(_.isNaN)) || (a == b)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants