Added support for DateType and TimestampType primitive spark types #135

ajayborra · 2018-09-19T20:02:46Z

Related issues
#115

Describe the proposed solution

Added conversions for DateType and TimestampType spark types for completeness.
Better error messages for non supported spark types.

tovbinm · 2018-09-20T02:01:45Z

features/src/main/scala/com/salesforce/op/features/types/FeatureTypeSparkConverter.scala

      case wt if wt <:< weakTypeOf[t.Binary] => (value: Any) =>
        if (value == null) FeatureTypeDefaults.Binary.value else Some(value.asInstanceOf[Boolean])

+      // Date & Time


Why do you need special handling for Date and DateTime? were you planning to take care of TimestampType spark type?

Since Date is a 4-byte value which keeps tracks of how many days have passed since 1970-01-01/epoch vs Datetime is a 8-byte value the numbers mean different things (days vs ms) and ranges are different, Hence picked up this route initially.

But, We do have separate types for data and datatime where date is the super type in the relation. What is the idea here ? Is it that, as long as the user honors and passes in the values of all date's (or) datatime's in the columns consistently semantics doesn't change and we choose the type with highest range ?

Since Date and DateTime types in TransmogrifAI inherit from Integral types, their values are captured by this case case wt if wt <:< weakTypeOf[t.Integral] and it works, because <:< operator on type tags check inheritance.

If you would like to mape spark types DateType and TimestampType into TransmogrifAI Date and DateTime types accordingly then you should move the cases case wt if wt <:< weakTypeOf[t.Date] and case wt if wt <:< weakTypeOf[t.DateTime] before the numeric checks.

tovbinm · 2018-09-20T02:04:18Z

features/src/main/scala/com/salesforce/op/features/types/FeatureTypeSparkConverter.scala

-        if (value == null) None else Some(value.asInstanceOf[Double])
+        value match {
+          case null => None
+          case _: Float => Some(value.asInstanceOf[Float].toDouble)


case v: Float => Some(v.toDouble) - and the same everywhere.

tovbinm · 2018-09-20T02:24:06Z

In any case, the main problem with this approach can result in loosing precision. If spark type is say float, we convert it to double, then user applies the operation on doubles, but then we store it back into Dataframe as float loosing precision. Example:

scala> val f = Float.MaxValue
f: Float = 3.4028235E38
scala> val d = f.oDouble + f.toDouble
d: Double = 6.805646932770577E38
scala> val f2 = d.toFloat 
f2: Float = Infinity

same applies to short, int etc.

tovbinm · 2018-09-20T03:30:06Z

so I think we can start by adding support for DateType and TimestampType and as for int, short etc let's discuss it.

ajayborra · 2018-09-20T03:33:37Z

In any case, the main problem with this approach can result in loosing precision. If spark type is say float, we convert it to double, then user applies the operation on doubles, but then we store it back into Dataframe as float loosing precision. Example:
scala> val f = Float.MaxValue
f: Float = 3.4028235E38
scala> val d = f.oDouble + f.toDouble
d: Double = 6.805646932770577E38
scala> val f2 = d.toFloat 
f2: Float = Infinity
same applies to short, int etc.

It is true that dropping of the precision can be a problem introducing vanishing gradients etc..., In your opinion what is the best approach to handle this types?
There are few approaches that we can take here,

Auto convert to largest range type and warn the user about the precision problems involved?
Maybe provide a way to turn on/off auto conversions as a parameter?
Avoid auto convert and recommend user to choose the type with the largest range in the category by throwing an error when we receive lower range? (Eg. Recommend Long when we see a Short )

tovbinm · 2018-09-20T03:45:20Z

I think for a start we can add nicer error message for these types in FeatureSparkTypes.featureTypeTagOf. Something like this:

case IntegerType => throw new IllegalArgumentException("Spark IntegerType is currently not supported. Please use LongType instead.")

tovbinm · 2018-09-20T03:46:48Z

I would rather have user explicitly make a conversion from int -> long, float -> double rather than taking care of precision issues.

ajayborra · 2018-09-20T03:52:44Z

Good point. Will update the PR with changes.

tovbinm · 2018-09-21T18:14:06Z

@ajayborra please add test cases for these changes in FeatureTypeSparkConverterTest.

leahmcguire · 2018-09-21T18:26:00Z

features/src/main/scala/com/salesforce/op/features/FeatureSparkTypes.scala

+    case IntegerType =>
+      throw new IllegalArgumentException("Spark IntegerType is currently not supported. Please use LongType instead.")
+    case FloatType =>
+      throw new IllegalArgumentException("Spark FloatType is currently not supported. Please use DoubleType instead.")


I think we can just map these to our types. Once that are mapped into the Transmografai types they will never convert back to the spark primitives so it is ok to map them into the super type.

…versions

codecov · 2018-09-24T15:33:25Z

Codecov Report

Merging #135 into master will increase coverage by 0.1%.
The diff coverage is 75.86%.

@@           Coverage Diff            @@
##           master    #135     +/-   ##
========================================
+ Coverage   86.19%   86.3%   +0.1%     
========================================
  Files         299     299             
  Lines        9723    9749     +26     
  Branches      340     538    +198     
========================================
+ Hits         8381    8414     +33     
+ Misses       1342    1335      -7

Impacted Files	Coverage Δ
...com/salesforce/op/features/FeatureSparkTypes.scala	`95.68% <100%> (+5.06%)`	⬆️
.../op/features/types/FeatureTypeSparkConverter.scala	`90.74% <66.66%> (-5.93%)`	⬇️
.../salesforce/op/features/FeatureBuilderMacros.scala	`100% <0%> (+100%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61b48cb...36ab4c6. Read the comment docs.

…versions

tovbinm · 2018-09-29T05:43:07Z

@ajayborra are you planning to be working on this one or? ;)

tovbinm · 2018-09-29T14:50:24Z

features/src/main/scala/com/salesforce/op/features/types/FeatureTypeSparkConverter.scala

+          case _ => throw new IllegalArgumentException(s"Date type mapping is not defined for ${value.getClass}")
+        }
+
+      // Numerals


// Numerics

tovbinm · 2018-09-29T14:55:52Z

features/src/test/scala/com/salesforce/op/features/FeatureSparkTypeTest.scala

+      weakTypeTag[types.GeolocationMap])
+  )
+
+  val sparkNonNullableTypeToTypeTagMappings = Seq(


RealNN is non nullable, why is it Real here?

tovbinm · 2018-09-29T15:01:07Z

features/src/test/scala/com/salesforce/op/features/FeatureSparkTypeTest.scala

+  )
+
+  Spec(FeatureSparkTypes.getClass) should "assign appropriate feature type tags for valid types" in {
+    sparkTypeToTypeTagMappings.foreach(mapping =>


avoid _1 and _2, better name args:

for { (sparkType, featureTypeTag) <- sparkTypeToTypeTagMappings } { FeatureSparkTypes.featureTypeTagOf(sparkType, isNullable = false) shouldBe featureTypeTag }

tovbinm · 2018-09-29T15:02:24Z

features/src/test/scala/com/salesforce/op/features/types/FeatureTypeSparkConverterTest.scala

@@ -52,6 +52,12 @@ class FeatureTypeSparkConverterTest
  )
  val bogusNames = Gen.alphaNumStr

+  val naturalNumbers = Table("NaturalNumbers", Byte.MaxValue, Short.MaxValue, Int.MaxValue, Long.MaxValue)


Are you sure that this generates numbers in these ranges correctly? what about negatives and special values?

tovbinm · 2018-09-29T20:16:46Z

features/src/test/scala/com/salesforce/op/features/types/FeatureTypeSparkConverterTest.scala

+  property("converts string to text feature type") {
+    val text = FeatureTypeSparkConverter[Text]().fromSpark("Simple")
+    text shouldBe a[Text]
+    text.value.get shouldBe a[String]


text.value shouldBe Some("Simple")

tovbinm · 2018-09-29T20:17:03Z

features/src/test/scala/com/salesforce/op/features/types/FeatureTypeSparkConverterTest.scala

+  property("converts a Boolean to Binary feature type") {
+    val bool = FeatureTypeSparkConverter[Binary]().fromSpark(false)
+    bool shouldBe a[Binary]
+    bool.value.get shouldBe a[java.lang.Boolean]


bool.value shouldBe Some(false)

)

Added support for primitive spark types

3aa8a85

ajayborra requested review from leahmcguire and tovbinm as code owners September 19, 2018 20:02

salesforce-cla bot added the cla:signed label Sep 19, 2018

tovbinm reviewed Sep 20, 2018

View reviewed changes

CR changes

740ff4b

tovbinm changed the title ~~Added support for more primitive spark types~~ Added support for DateType and TimestampType primitive spark types Sep 21, 2018

leahmcguire reviewed Sep 21, 2018

View reviewed changes

Merge branch 'master' into ab/define_missing_spark_type_to_tftype_con…

ffd330e

…versions

Merge branch 'master' into ab/define_missing_spark_type_to_tftype_con…

7de5f71

…versions

Converting primitive types to super types and mapped to feature types

36ab4c6

ajayborra force-pushed the ab/define_missing_spark_type_to_tftype_conversions branch from 3744aa4 to 36ab4c6 Compare September 29, 2018 13:00

tovbinm reviewed Sep 29, 2018

View reviewed changes

tovbinm approved these changes Sep 29, 2018

View reviewed changes

tovbinm merged commit f219465 into salesforce:master Sep 29, 2018

tovbinm mentioned this pull request Sep 29, 2018

Use proper test ranges in feature converter test #143

Merged

ajayborra deleted the ab/define_missing_spark_type_to_tftype_conversions branch November 7, 2018 19:16

ericwayman pushed a commit that referenced this pull request Feb 8, 2019

Added support for DateType and TimestampType primitive spark types (#135

978ece4

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for DateType and TimestampType primitive spark types #135

Added support for DateType and TimestampType primitive spark types #135

ajayborra commented Sep 19, 2018 •

edited by tovbinm

Loading

tovbinm Sep 20, 2018

ajayborra Sep 20, 2018

tovbinm Sep 20, 2018

tovbinm Sep 20, 2018

tovbinm Sep 20, 2018

tovbinm commented Sep 20, 2018 •

edited

Loading

tovbinm commented Sep 20, 2018

ajayborra commented Sep 20, 2018

tovbinm commented Sep 20, 2018 •

edited

Loading

tovbinm commented Sep 20, 2018

ajayborra commented Sep 20, 2018

tovbinm commented Sep 21, 2018

leahmcguire Sep 21, 2018

codecov bot commented Sep 24, 2018 •

edited

Loading

tovbinm commented Sep 29, 2018

tovbinm Sep 29, 2018

tovbinm Sep 29, 2018

tovbinm Sep 29, 2018

tovbinm Sep 29, 2018

tovbinm Sep 29, 2018

tovbinm Sep 29, 2018

Added support for DateType and TimestampType primitive spark types #135

Added support for DateType and TimestampType primitive spark types #135

Conversation

ajayborra commented Sep 19, 2018 • edited by tovbinm Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm commented Sep 20, 2018 • edited Loading

tovbinm commented Sep 20, 2018

ajayborra commented Sep 20, 2018

tovbinm commented Sep 20, 2018 • edited Loading

tovbinm commented Sep 20, 2018

ajayborra commented Sep 20, 2018

tovbinm commented Sep 21, 2018

Choose a reason for hiding this comment

codecov bot commented Sep 24, 2018 • edited Loading

Codecov Report

tovbinm commented Sep 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajayborra commented Sep 19, 2018 •

edited by tovbinm

Loading

tovbinm commented Sep 20, 2018 •

edited

Loading

tovbinm commented Sep 20, 2018 •

edited

Loading

codecov bot commented Sep 24, 2018 •

edited

Loading