Transmogrify to use smart vectorizer #63

sxd929 · 2018-08-16T22:17:35Z

Related issues
Refer to issue(s) addressed in this pull request from [Issues]
change transmogrify to use smart text vectorizer

Describe the proposed solution
change transmogrify to use smart text vectorizer
add argument cleanKeys to SmartTextMapVectorizer
set MaxCategoricalCardinality to be 30 and use previous default for other settings
fix test that failed due to the change

Describe alternatives you've considered
N/A

Additional context
N/A

salesforce-cla · 2018-08-16T22:17:37Z

Thanks for the contribution! It looks like @sxd929 is an internal user so signing the CLA is not required. However, we need to confirm this.

tovbinm · 2018-08-16T22:22:18Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

@@ -184,12 +184,18 @@ private[op] case object Transmogrifier {
        case t if t =:= weakTypeOf[TextAreaMap] =>
          val (f, other) = castAs[TextAreaMap](g)
          // Explicitly set cleanText to false here in order to match behavior of Text vectorization
-          f.vectorize(shouldPrependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys,
+          f.smartVectorize(maxCategoricalCardinality = TextTokenizer.MaxCategoricalCardinality,


It think we should add MaxCategoricalCardinality and he rest of the missing defaults to TransmogrifierDefaults

@sxd929 I also meant TextTokenizer.AutoDetectLanguage etc.

wdyt? @leahmcguire

good catch! true, thanks, I think this arg is the only added default

talked to Leah and fixed, thanks!

@tovbinm I have reverted the changes, but it seems that we can also set default in transmogrify as, for example, AutoDetectLanguage = TextTokenizer.AutoDetectLanguage and limit the use to transmogrify and vectorize, what do you think?

jamesward · 2018-08-16T22:32:26Z

@sxd929 I've invited you to the org: https://github.com/salesforce

Once accepted, you can kick the CLA bot: https://cla.salesforce.com/status/salesforce/TransmogrifAI/pull/63

codecov · 2018-08-16T22:47:01Z

Codecov Report

Merging #63 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
+ Coverage   85.88%   85.88%   +<.01%     
==========================================
  Files         294      294              
  Lines        9521     9530       +9     
  Branches      320      320              
==========================================
+ Hits         8177     8185       +8     
- Misses       1344     1345       +1

Impacted Files	Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala	`78.94% <ø> (-5.27%)`	⬇️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`96.68% <100%> (+0.09%)`	⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`89.58% <0%> (+4.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0d6ecb...ed77626. Read the comment docs.

kinfaikan · 2018-08-17T18:10:58Z

core/src/test/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerTest.scala

-    val vectorized = Seq(textMap).transmogrify()
+  it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in {
+
+    val vectorized = textMap.vectorize(trackNulls = TransmogrifierDefaults.TrackNulls,


Minor: You can just do textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText).

Jauntbox · 2018-08-20T17:43:55Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

+          f.smartVectorize(maxCategoricalCardinality = MaxCategoricalCardinality,
+            numHashes = DefaultNumOfFeatures, autoDetectLanguage = TextTokenizer.AutoDetectLanguage,
+            minTokenLength = TextTokenizer.MinTokenLength, toLowercase = TextTokenizer.ToLowercase,
+            prependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys,


Is there a reason why these weren't following the defaults in TransmogrifierDefaults in the first place? CleanText is set to true there.

good point! it seems that this was an issue with vectorizer but smart vectorizer fixed it, fixed, thanks a lot!

Jauntbox · 2018-08-20T17:51:31Z

core/src/test/scala/com/salesforce/op/stages/impl/preparators/BadFeatureZooTest.scala

@@ -541,7 +540,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {
    val retrieved = SanityCheckerSummary.fromMetadata(summary.getSummaryMetadata())

    // Check that all of the hashed text columns (and the null indicator column itself) are thrown away


Can you change the comments to agree with the new behavior too? The text field is detected as categorical and pivoted now instead of being hashed.

fixed! thanks!

Jauntbox · 2018-08-20T17:51:39Z

core/src/test/scala/com/salesforce/op/stages/impl/preparators/BadFeatureZooTest.scala

@@ -575,7 +574,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {

    // Drop the whole hash space but not the null indicator column (it has an indicator group, so does not get


Update comment here too

good catch! fixed! thanks!

tovbinm · 2018-08-20T20:34:01Z

core/src/test/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerTest.scala

-    val vectorized = Seq(textMap).transmogrify()
+  it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in {
+
+    val vectorized = textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText)


this line seems redundant?

tovbinm

lgtm! let's merge this as it is now.

tovbinm · 2018-08-20T22:05:25Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/TextTransmogrifyTest.scala

-      vector.v.size < TransmogrifierDefaults.DefaultNumOfFeatures + (TransmogrifierDefaults.TopK + 2) * 3 shouldBe true
-      vector.v.size >= TransmogrifierDefaults.DefaultNumOfFeatures + 6 shouldBe true
+      vector.v.size < (TransmogrifierDefaults.TopK + 2) * 5  shouldBe true
+      vector.v.size >= 10 shouldBe true


@sxd929 should there be also a computed value instead of 10?

@sxd929 also please use a better matcher syntax: vector.v.size should be >= 10, since it surfaces the error better.

fixed! thanks!!

…/sxd929/TransmogrifAI into xs/transmogrifyToSmartVectorizer

to smart vectorize and fix test

9788ccb

sxd929 requested review from leahmcguire and tovbinm as code owners August 16, 2018 22:17

salesforce-cla bot added the cla:missing label Aug 16, 2018

sxd929 requested a review from kinfaikan August 16, 2018 22:20

tovbinm reviewed Aug 16, 2018

View reviewed changes

sxd929 requested a review from Jauntbox August 16, 2018 22:37

salesforce-cla bot removed the cla:missing label Aug 16, 2018

address comments and move defaults to transmogrify

b3cae1b

kinfaikan reviewed Aug 17, 2018

View reviewed changes

tovbinm and others added 3 commits August 17, 2018 11:34

Merge branch 'master' into xs/transmogrifyToSmartVectorizer

24431b8

Merge branch 'master' into xs/transmogrifyToSmartVectorizer

abaf96d

Merge branch 'master' into xs/transmogrifyToSmartVectorizer

7558891

Jauntbox reviewed Aug 20, 2018

View reviewed changes

sxd929 and others added 4 commits August 20, 2018 11:34

move text tokenizer default to transmogrifier

fea6720

address comments

f0891d2

Merge branch 'master' into xs/transmogrifyToSmartVectorizer

ec65d94

revert tokenizer default changes

9a7dd41

tovbinm reviewed Aug 20, 2018

View reviewed changes

tovbinm approved these changes Aug 20, 2018

View reviewed changes

tovbinm reviewed Aug 20, 2018

View reviewed changes

tovbinm and others added 4 commits August 20, 2018 15:40

Merge branch 'master' into xs/transmogrifyToSmartVectorizer

8adac31

fix test and add comments

ab575ad

Merge branch 'xs/transmogrifyToSmartVectorizer' of https://github.com…

a643144

…/sxd929/TransmogrifAI into xs/transmogrifyToSmartVectorizer

fix test

ed77626

tovbinm merged commit b816e66 into salesforce:master Aug 20, 2018

ericwayman pushed a commit that referenced this pull request Feb 8, 2019

Transmogrify to use smart vectorizer (#63)

35a03b1

salesforce-cla bot added the cla:signed label Jul 1, 2020

salesforce-cla bot added cla:missing and removed cla:signed labels Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transmogrify to use smart vectorizer #63

Transmogrify to use smart vectorizer #63

sxd929 commented Aug 16, 2018 •

edited

Loading

salesforce-cla bot commented Aug 16, 2018

tovbinm Aug 16, 2018

tovbinm Aug 16, 2018

tovbinm Aug 16, 2018

sxd929 Aug 16, 2018

sxd929 Aug 20, 2018

sxd929 Aug 20, 2018

jamesward commented Aug 16, 2018

codecov bot commented Aug 16, 2018 •

edited

Loading

kinfaikan Aug 17, 2018

sxd929 Aug 20, 2018

Jauntbox Aug 20, 2018

sxd929 Aug 20, 2018

Jauntbox Aug 20, 2018

sxd929 Aug 20, 2018

Jauntbox Aug 20, 2018

sxd929 Aug 20, 2018

tovbinm Aug 20, 2018

tovbinm left a comment

tovbinm Aug 20, 2018

tovbinm Aug 20, 2018

sxd929 Aug 20, 2018

		@@ -541,7 +540,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {
		val retrieved = SanityCheckerSummary.fromMetadata(summary.getSummaryMetadata())

		// Check that all of the hashed text columns (and the null indicator column itself) are thrown away

		@@ -575,7 +574,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging {

		// Drop the whole hash space but not the null indicator column (it has an indicator group, so does not get

Transmogrify to use smart vectorizer #63

Transmogrify to use smart vectorizer #63

Conversation

sxd929 commented Aug 16, 2018 • edited Loading

salesforce-cla bot commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesward commented Aug 16, 2018

codecov bot commented Aug 16, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxd929 commented Aug 16, 2018 •

edited

Loading

codecov bot commented Aug 16, 2018 •

edited

Loading