Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transmogrify to use smart vectorizer #63

Merged
merged 13 commits into from
Aug 20, 2018

Conversation

sxd929
Copy link
Contributor

@sxd929 sxd929 commented Aug 16, 2018

Related issues
Refer to issue(s) addressed in this pull request from [Issues]
change transmogrify to use smart text vectorizer

Describe the proposed solution
change transmogrify to use smart text vectorizer
add argument cleanKeys to SmartTextMapVectorizer
set MaxCategoricalCardinality to be 30 and use previous default for other settings
fix test that failed due to the change

Describe alternatives you've considered
N/A

Additional context
N/A

@salesforce-cla
Copy link

Thanks for the contribution! It looks like @sxd929 is an internal user so signing the CLA is not required. However, we need to confirm this.

@@ -184,12 +184,18 @@ private[op] case object Transmogrifier {
case t if t =:= weakTypeOf[TextAreaMap] =>
val (f, other) = castAs[TextAreaMap](g)
// Explicitly set cleanText to false here in order to match behavior of Text vectorization
f.vectorize(shouldPrependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys,
f.smartVectorize(maxCategoricalCardinality = TextTokenizer.MaxCategoricalCardinality,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think we should add MaxCategoricalCardinality and he rest of the missing defaults to TransmogrifierDefaults

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sxd929 I also meant TextTokenizer.AutoDetectLanguage etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt? @leahmcguire

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! true, thanks, I think this arg is the only added default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked to Leah and fixed, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tovbinm I have reverted the changes, but it seems that we can also set default in transmogrify as, for example, AutoDetectLanguage = TextTokenizer.AutoDetectLanguage and limit the use to transmogrify and vectorize, what do you think?

@jamesward
Copy link
Contributor

@sxd929 I've invited you to the org: https://github.com/salesforce

Once accepted, you can kick the CLA bot: https://cla.salesforce.com/status/salesforce/TransmogrifAI/pull/63

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #63 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
+ Coverage   85.88%   85.88%   +<.01%     
==========================================
  Files         294      294              
  Lines        9521     9530       +9     
  Branches      320      320              
==========================================
+ Hits         8177     8185       +8     
- Misses       1344     1345       +1
Impacted Files Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 78.94% <ø> (-5.27%) ⬇️
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.68% <100%> (+0.09%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.58% <0%> (+4.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0d6ecb...ed77626. Read the comment docs.

val vectorized = Seq(textMap).transmogrify()
it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in {

val vectorized = textMap.vectorize(trackNulls = TransmogrifierDefaults.TrackNulls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You can just do textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

f.smartVectorize(maxCategoricalCardinality = MaxCategoricalCardinality,
numHashes = DefaultNumOfFeatures, autoDetectLanguage = TextTokenizer.AutoDetectLanguage,
minTokenLength = TextTokenizer.MinTokenLength, toLowercase = TextTokenizer.ToLowercase,
prependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these weren't following the defaults in TransmogrifierDefaults in the first place? CleanText is set to true there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! it seems that this was an issue with vectorizer but smart vectorizer fixed it, fixed, thanks a lot!

@@ -541,7 +540,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {
val retrieved = SanityCheckerSummary.fromMetadata(summary.getSummaryMetadata())

// Check that all of the hashed text columns (and the null indicator column itself) are thrown away
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the comments to agree with the new behavior too? The text field is detected as categorical and pivoted now instead of being hashed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! thanks!

@@ -575,7 +574,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {

// Drop the whole hash space but not the null indicator column (it has an indicator group, so does not get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! fixed! thanks!

val vectorized = Seq(textMap).transmogrify()
it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in {

val vectorized = textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line seems redundant?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! let's merge this as it is now.

vector.v.size < TransmogrifierDefaults.DefaultNumOfFeatures + (TransmogrifierDefaults.TopK + 2) * 3 shouldBe true
vector.v.size >= TransmogrifierDefaults.DefaultNumOfFeatures + 6 shouldBe true
vector.v.size < (TransmogrifierDefaults.TopK + 2) * 5 shouldBe true
vector.v.size >= 10 shouldBe true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sxd929 should there be also a computed value instead of 10?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sxd929 also please use a better matcher syntax: vector.v.size should be >= 10, since it surfaces the error better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! thanks!!

@tovbinm tovbinm merged commit b816e66 into salesforce:master Aug 20, 2018
ericwayman pushed a commit that referenced this pull request Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants