-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transmogrify to use smart vectorizer #63
Transmogrify to use smart vectorizer #63
Conversation
Thanks for the contribution! It looks like @sxd929 is an internal user so signing the CLA is not required. However, we need to confirm this. |
@@ -184,12 +184,18 @@ private[op] case object Transmogrifier { | |||
case t if t =:= weakTypeOf[TextAreaMap] => | |||
val (f, other) = castAs[TextAreaMap](g) | |||
// Explicitly set cleanText to false here in order to match behavior of Text vectorization | |||
f.vectorize(shouldPrependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys, | |||
f.smartVectorize(maxCategoricalCardinality = TextTokenizer.MaxCategoricalCardinality, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think we should add MaxCategoricalCardinality
and he rest of the missing defaults to TransmogrifierDefaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sxd929 I also meant TextTokenizer.AutoDetectLanguage
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt? @leahmcguire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! true, thanks, I think this arg is the only added default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked to Leah and fixed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tovbinm I have reverted the changes, but it seems that we can also set default in transmogrify as, for example, AutoDetectLanguage = TextTokenizer.AutoDetectLanguage and limit the use to transmogrify and vectorize, what do you think?
@sxd929 I've invited you to the org: https://github.com/salesforce Once accepted, you can kick the CLA bot: https://cla.salesforce.com/status/salesforce/TransmogrifAI/pull/63 |
Codecov Report
@@ Coverage Diff @@
## master #63 +/- ##
==========================================
+ Coverage 85.88% 85.88% +<.01%
==========================================
Files 294 294
Lines 9521 9530 +9
Branches 320 320
==========================================
+ Hits 8177 8185 +8
- Misses 1344 1345 +1
Continue to review full report at Codecov.
|
val vectorized = Seq(textMap).transmogrify() | ||
it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in { | ||
|
||
val vectorized = textMap.vectorize(trackNulls = TransmogrifierDefaults.TrackNulls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You can just do textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
f.smartVectorize(maxCategoricalCardinality = MaxCategoricalCardinality, | ||
numHashes = DefaultNumOfFeatures, autoDetectLanguage = TextTokenizer.AutoDetectLanguage, | ||
minTokenLength = TextTokenizer.MinTokenLength, toLowercase = TextTokenizer.ToLowercase, | ||
prependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why these weren't following the defaults in TransmogrifierDefaults in the first place? CleanText is set to true there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! it seems that this was an issue with vectorizer but smart vectorizer fixed it, fixed, thanks a lot!
@@ -541,7 +540,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging { | |||
val retrieved = SanityCheckerSummary.fromMetadata(summary.getSummaryMetadata()) | |||
|
|||
// Check that all of the hashed text columns (and the null indicator column itself) are thrown away |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the comments to agree with the new behavior too? The text field is detected as categorical and pivoted now instead of being hashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! thanks!
@@ -575,7 +574,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging { | |||
|
|||
// Drop the whole hash space but not the null indicator column (it has an indicator group, so does not get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! fixed! thanks!
val vectorized = Seq(textMap).transmogrify() | ||
it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in { | ||
|
||
val vectorized = textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line seems redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! let's merge this as it is now.
vector.v.size < TransmogrifierDefaults.DefaultNumOfFeatures + (TransmogrifierDefaults.TopK + 2) * 3 shouldBe true | ||
vector.v.size >= TransmogrifierDefaults.DefaultNumOfFeatures + 6 shouldBe true | ||
vector.v.size < (TransmogrifierDefaults.TopK + 2) * 5 shouldBe true | ||
vector.v.size >= 10 shouldBe true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sxd929 should there be also a computed value instead of 10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sxd929 also please use a better matcher syntax: vector.v.size should be >= 10
, since it surfaces the error better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! thanks!!
Related issues
Refer to issue(s) addressed in this pull request from [Issues]
change transmogrify to use smart text vectorizer
Describe the proposed solution
change transmogrify to use smart text vectorizer
add argument cleanKeys to SmartTextMapVectorizer
set MaxCategoricalCardinality to be 30 and use previous default for other settings
fix test that failed due to the change
Describe alternatives you've considered
N/A
Additional context
N/A