Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for DateType and TimestampType primitive spark types #135

Conversation

ajayborra
Copy link
Contributor

@ajayborra ajayborra commented Sep 19, 2018

Related issues
#115

Describe the proposed solution

  • Added conversions for DateType and TimestampType spark types for completeness.
  • Better error messages for non supported spark types.

case wt if wt <:< weakTypeOf[t.Binary] => (value: Any) =>
if (value == null) FeatureTypeDefaults.Binary.value else Some(value.asInstanceOf[Boolean])

// Date & Time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need special handling for Date and DateTime? were you planning to take care of TimestampType spark type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Date is a 4-byte value which keeps tracks of how many days have passed since 1970-01-01/epoch vs Datetime is a 8-byte value the numbers mean different things (days vs ms) and ranges are different, Hence picked up this route initially.

But, We do have separate types for data and datatime where date is the super type in the relation. What is the idea here ? Is it that, as long as the user honors and passes in the values of all date's (or) datatime's in the columns consistently semantics doesn't change and we choose the type with highest range ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Date and DateTime types in TransmogrifAI inherit from Integral types, their values are captured by this case case wt if wt <:< weakTypeOf[t.Integral] and it works, because <:< operator on type tags check inheritance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would like to mape spark types DateType and TimestampType into TransmogrifAI Date and DateTime types accordingly then you should move the cases case wt if wt <:< weakTypeOf[t.Date] and case wt if wt <:< weakTypeOf[t.DateTime] before the numeric checks.

if (value == null) None else Some(value.asInstanceOf[Double])
value match {
case null => None
case _: Float => Some(value.asInstanceOf[Float].toDouble)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case v: Float => Some(v.toDouble) - and the same everywhere.

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 20, 2018

In any case, the main problem with this approach can result in loosing precision. If spark type is say float, we convert it to double, then user applies the operation on doubles, but then we store it back into Dataframe as float loosing precision. Example:

scala> val f = Float.MaxValue
f: Float = 3.4028235E38
scala> val d = f.oDouble + f.toDouble
d: Double = 6.805646932770577E38
scala> val f2 = d.toFloat 
f2: Float = Infinity

same applies to short, int etc.

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 20, 2018

so I think we can start by adding support for DateType and TimestampType and as for int, short etc let's discuss it.

@ajayborra
Copy link
Contributor Author

In any case, the main problem with this approach can result in loosing precision. If spark type is say float, we convert it to double, then user applies the operation on doubles, but then we store it back into Dataframe as float loosing precision. Example:

scala> val f = Float.MaxValue
f: Float = 3.4028235E38
scala> val d = f.oDouble + f.toDouble
d: Double = 6.805646932770577E38
scala> val f2 = d.toFloat 
f2: Float = Infinity

same applies to short, int etc.

It is true that dropping of the precision can be a problem introducing vanishing gradients etc..., In your opinion what is the best approach to handle this types?
There are few approaches that we can take here,

  • Auto convert to largest range type and warn the user about the precision problems involved?
  • Maybe provide a way to turn on/off auto conversions as a parameter?
  • Avoid auto convert and recommend user to choose the type with the largest range in the category by throwing an error when we receive lower range? (Eg. Recommend Long when we see a Short )

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 20, 2018

I think for a start we can add nicer error message for these types in FeatureSparkTypes.featureTypeTagOf. Something like this:

case IntegerType => throw new IllegalArgumentException("Spark IntegerType is currently not supported. Please use LongType instead.")
    

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 20, 2018

I would rather have user explicitly make a conversion from int -> long, float -> double rather than taking care of precision issues.

@ajayborra
Copy link
Contributor Author

Good point. Will update the PR with changes.

@tovbinm tovbinm changed the title Added support for more primitive spark types Added support for DateType and TimestampType primitive spark types Sep 21, 2018
@tovbinm
Copy link
Collaborator

tovbinm commented Sep 21, 2018

@ajayborra please add test cases for these changes in FeatureTypeSparkConverterTest.

case IntegerType =>
throw new IllegalArgumentException("Spark IntegerType is currently not supported. Please use LongType instead.")
case FloatType =>
throw new IllegalArgumentException("Spark FloatType is currently not supported. Please use DoubleType instead.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just map these to our types. Once that are mapped into the Transmografai types they will never convert back to the spark primitives so it is ok to map them into the super type.

@codecov
Copy link

codecov bot commented Sep 24, 2018

Codecov Report

Merging #135 into master will increase coverage by 0.1%.
The diff coverage is 75.86%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #135     +/-   ##
========================================
+ Coverage   86.19%   86.3%   +0.1%     
========================================
  Files         299     299             
  Lines        9723    9749     +26     
  Branches      340     538    +198     
========================================
+ Hits         8381    8414     +33     
+ Misses       1342    1335      -7
Impacted Files Coverage Δ
...com/salesforce/op/features/FeatureSparkTypes.scala 95.68% <100%> (+5.06%) ⬆️
.../op/features/types/FeatureTypeSparkConverter.scala 90.74% <66.66%> (-5.93%) ⬇️
.../salesforce/op/features/FeatureBuilderMacros.scala 100% <0%> (+100%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61b48cb...36ab4c6. Read the comment docs.

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 29, 2018

@ajayborra are you planning to be working on this one or? ;)

@ajayborra ajayborra force-pushed the ab/define_missing_spark_type_to_tftype_conversions branch from 3744aa4 to 36ab4c6 Compare September 29, 2018 13:00
case _ => throw new IllegalArgumentException(s"Date type mapping is not defined for ${value.getClass}")
}

// Numerals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Numerics

weakTypeTag[types.GeolocationMap])
)

val sparkNonNullableTypeToTypeTagMappings = Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RealNN is non nullable, why is it Real here?

)

Spec(FeatureSparkTypes.getClass) should "assign appropriate feature type tags for valid types" in {
sparkTypeToTypeTagMappings.foreach(mapping =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid _1 and _2, better name args:

for { (sparkType, featureTypeTag) <- sparkTypeToTypeTagMappings } {
    FeatureSparkTypes.featureTypeTagOf(sparkType, isNullable = false) shouldBe featureTypeTag
}

@@ -52,6 +52,12 @@ class FeatureTypeSparkConverterTest
)
val bogusNames = Gen.alphaNumStr

val naturalNumbers = Table("NaturalNumbers", Byte.MaxValue, Short.MaxValue, Int.MaxValue, Long.MaxValue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this generates numbers in these ranges correctly? what about negatives and special values?

property("converts string to text feature type") {
val text = FeatureTypeSparkConverter[Text]().fromSpark("Simple")
text shouldBe a[Text]
text.value.get shouldBe a[String]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text.value shouldBe Some("Simple")

property("converts a Boolean to Binary feature type") {
val bool = FeatureTypeSparkConverter[Binary]().fromSpark(false)
bool shouldBe a[Binary]
bool.value.get shouldBe a[java.lang.Boolean]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool.value shouldBe Some(false)

@tovbinm tovbinm merged commit f219465 into salesforce:master Sep 29, 2018
@ajayborra ajayborra deleted the ab/define_missing_spark_type_to_tftype_conversions branch November 7, 2018 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants