-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for DateType and TimestampType primitive spark types #135
Added support for DateType and TimestampType primitive spark types #135
Conversation
case wt if wt <:< weakTypeOf[t.Binary] => (value: Any) => | ||
if (value == null) FeatureTypeDefaults.Binary.value else Some(value.asInstanceOf[Boolean]) | ||
|
||
// Date & Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need special handling for Date
and DateTime
? were you planning to take care of TimestampType
spark type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Date is a 4-byte value which keeps tracks of how many days have passed since 1970-01-01/epoch vs Datetime is a 8-byte value the numbers mean different things (days vs ms) and ranges are different, Hence picked up this route initially.
But, We do have separate types for data and datatime where date is the super type in the relation. What is the idea here ? Is it that, as long as the user honors and passes in the values of all date's (or) datatime's in the columns consistently semantics doesn't change and we choose the type with highest range ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Date and DateTime types in TransmogrifAI inherit from Integral types, their values are captured by this case case wt if wt <:< weakTypeOf[t.Integral]
and it works, because <:<
operator on type tags check inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would like to mape spark types DateType
and TimestampType
into TransmogrifAI Date
and DateTime
types accordingly then you should move the cases case wt if wt <:< weakTypeOf[t.Date]
and case wt if wt <:< weakTypeOf[t.DateTime]
before the numeric checks.
if (value == null) None else Some(value.asInstanceOf[Double]) | ||
value match { | ||
case null => None | ||
case _: Float => Some(value.asInstanceOf[Float].toDouble) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case v: Float => Some(v.toDouble)
- and the same everywhere.
In any case, the main problem with this approach can result in loosing precision. If spark type is say float, we convert it to double, then user applies the operation on doubles, but then we store it back into Dataframe as float loosing precision. Example: scala> val f = Float.MaxValue
f: Float = 3.4028235E38
scala> val d = f.oDouble + f.toDouble
d: Double = 6.805646932770577E38
scala> val f2 = d.toFloat
f2: Float = Infinity same applies to short, int etc. |
so I think we can start by adding support for |
It is true that dropping of the precision can be a problem introducing vanishing gradients etc..., In your opinion what is the best approach to handle this types?
|
I think for a start we can add nicer error message for these types in case IntegerType => throw new IllegalArgumentException("Spark IntegerType is currently not supported. Please use LongType instead.")
|
I would rather have user explicitly make a conversion from int -> long, float -> double rather than taking care of precision issues. |
Good point. Will update the PR with changes. |
@ajayborra please add test cases for these changes in |
case IntegerType => | ||
throw new IllegalArgumentException("Spark IntegerType is currently not supported. Please use LongType instead.") | ||
case FloatType => | ||
throw new IllegalArgumentException("Spark FloatType is currently not supported. Please use DoubleType instead.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just map these to our types. Once that are mapped into the Transmografai types they will never convert back to the spark primitives so it is ok to map them into the super type.
Codecov Report
@@ Coverage Diff @@
## master #135 +/- ##
========================================
+ Coverage 86.19% 86.3% +0.1%
========================================
Files 299 299
Lines 9723 9749 +26
Branches 340 538 +198
========================================
+ Hits 8381 8414 +33
+ Misses 1342 1335 -7
Continue to review full report at Codecov.
|
@ajayborra are you planning to be working on this one or? ;) |
3744aa4
to
36ab4c6
Compare
case _ => throw new IllegalArgumentException(s"Date type mapping is not defined for ${value.getClass}") | ||
} | ||
|
||
// Numerals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Numerics
weakTypeTag[types.GeolocationMap]) | ||
) | ||
|
||
val sparkNonNullableTypeToTypeTagMappings = Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RealNN is non nullable, why is it Real here?
) | ||
|
||
Spec(FeatureSparkTypes.getClass) should "assign appropriate feature type tags for valid types" in { | ||
sparkTypeToTypeTagMappings.foreach(mapping => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid _1 and _2, better name args:
for { (sparkType, featureTypeTag) <- sparkTypeToTypeTagMappings } {
FeatureSparkTypes.featureTypeTagOf(sparkType, isNullable = false) shouldBe featureTypeTag
}
@@ -52,6 +52,12 @@ class FeatureTypeSparkConverterTest | |||
) | |||
val bogusNames = Gen.alphaNumStr | |||
|
|||
val naturalNumbers = Table("NaturalNumbers", Byte.MaxValue, Short.MaxValue, Int.MaxValue, Long.MaxValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that this generates numbers in these ranges correctly? what about negatives and special values?
property("converts string to text feature type") { | ||
val text = FeatureTypeSparkConverter[Text]().fromSpark("Simple") | ||
text shouldBe a[Text] | ||
text.value.get shouldBe a[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text.value shouldBe Some("Simple")
property("converts a Boolean to Binary feature type") { | ||
val bool = FeatureTypeSparkConverter[Binary]().fromSpark(false) | ||
bool shouldBe a[Binary] | ||
bool.value.get shouldBe a[java.lang.Boolean] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool.value shouldBe Some(false)
Related issues
#115
Describe the proposed solution