Enable Html stripping #478

michaelweilsalesforce · 2020-05-26T18:26:43Z

Related issues

When engineering features from a Text (and Text-like) raw features, we should strip the text of any html tags, which doesn't add signal to existing tokens (and even pollutes them).

Describe the proposed solution

Enable html stripping via TextTokenizer.AnalyzerHtmlStrip

michaelweilsalesforce · 2020-05-26T18:27:53Z

This PR doesn't introduce options yet

codecov · 2020-05-26T18:49:45Z

Codecov Report

Merging #478 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           master     #478    +/-   ##
========================================
  Coverage   87.00%   87.01%            
========================================
  Files         345      345            
  Lines       11673    11680     +7     
  Branches      388      613   +225     
========================================
+ Hits        10156    10163     +7     
  Misses       1517     1517

Impacted Files	Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala	`67.64% <ø> (ø)`
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`82.19% <100.00%> (+0.24%)`	⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100.00% <100.00%> (ø)`
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.20% <100.00%> (+0.03%)`	⬆️
...esforce/op/stages/impl/feature/TextTokenizer.scala	`97.36% <100.00%> (+0.14%)`	⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`98.05% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eba38a0...97b9ce8. Read the comment docs.

leahmcguire · 2020-05-27T16:34:44Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

+    val rowHashTokenized = rowHash.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText,
+      analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens })
+    val rowIgnoreTokenized = rowIgnore.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText,
+      analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens })


Rather than just changing the default please make this a settable parameter

leahmcguire · 2020-05-27T16:35:06Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

+      val textTokens: Seq[TextList] = textToHash.map(t => tokenize(text = t,
+        analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens)
+      val ignorableTextTokens: Seq[TextList] = textToIgnore.map(t => tokenize(text = t,
+        analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens)


Same here, make this settable

leahmcguire · 2020-05-27T16:37:04Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

@@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults {
  val NullString: String = OpVectorColumnMetadata.NullString
  val OtherString: String = OpVectorColumnMetadata.OtherString
  val DefaultNumOfFeatures: Int = 512
-  val MaxNumOfFeatures: Int = 16384
+  val MaxNumOfFeatures: Int = 1638400000


the above number is 2^14 please choose a number in bytes

i think we don't need to change this number, because i lifted this limit in #477

leahmcguire · 2020-05-27T16:38:26Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizerTest.scala

    val maxCard = 100
-    val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0 - 1e-10).setMaxCardinality(maxCard)
-      .setMinSupport(1)
+    val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0).setMaxCardinality(maxCard).setMinSupport(1)


these seems to defeat the purpose of the test

… into HTMLStrip

TuanNguyen27 · 2020-06-04T00:06:45Z

@leahmcguire @Jauntbox could you take a look at this PR ? I'm not sure how to test my changes :(

gerashegalov · 2020-06-05T16:35:57Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/Transmogrifier.scala

@@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults {
  val NullString: String = OpVectorColumnMetadata.NullString
  val OtherString: String = OpVectorColumnMetadata.OtherString
  val DefaultNumOfFeatures: Int = 512
-  val MaxNumOfFeatures: Int = 16384
+  val MaxNumOfFeatures: Int = 131072 // 2^17


if 2^17 is important we can set it explicitly 1 << 17 instead of a comment

it's not really important...i used 2^17 because @leahmcguire wanted it to be a power of 2.

Well 1 << n is a way to demonstrate that you are definitely using a power of 2

Ok, i'll change it to 1 << n then, @leahmcguire @Jauntbox any objection ?

gerashegalov · 2020-06-05T16:37:37Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizerTest.scala

@@ -137,6 +147,39 @@ class SmartTextVectorizerTest
    smart shouldBe expected
  }

+  it should "detect one categorical and one non-categorical text feature with html data" in {


the test is lagerly copy&paste. Can we make it a function or a behavior https://www.scalatest.org/user_guide/sharing_tests

I'm not entirely sure this is the right way to test this function either... @mweilsalesforce @Jauntbox @leahmcguire any advice ?

look at the text tokenizer test https://github.com/salesforce/TransmogrifAI/blob/master/core/src/test/scala/com/salesforce/op/stages/impl/feature/TextTokenizerTest.scala
it has html stripping tests

gerashegalov · 2020-06-05T16:38:48Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizerTest.scala

+      smartVector -> combined
+    }.unzip
+
+    println(smart)


let us remove println statements

gerashegalov · 2020-06-06T17:54:53Z

Let us actually fill out the form for the PR description to set the context :)

core/src/test/scala/com/salesforce/op/stages/impl/feature/TextTokenizerTest.scala

leahmcguire · 2020-06-09T19:23:52Z

core/src/main/scala/com/salesforce/op/dsl/RichTextFeature.scala

    ): FeatureLike[TextList] = {
-
+      // html stripping won't work here due since LuceneRegexTextAnalyzer


if HTML stripping will not work then please dont put it as an input

please remove the set call too

the set call actually belongs to SmartTextVectorizer so it's separate from tokenizeRegex (which i've removed htmlStripping flag)

leahmcguire

Minor param change and then LGTM

leahmcguire

Just remove the set call on the shortcut that doesnt actually support HTML stripping an then LGTM

This reverts commit e48831a.

mweilsalesforce added 4 commits May 20, 2020 13:34

HTML

8805e3e

increasing limit

ee87945

Not that large

20343a7

Not Near 1 but 1

a5f8591

michaelweilsalesforce requested a review from Jauntbox May 26, 2020 18:26

michaelweilsalesforce requested review from gerashegalov, leahmcguire, tovbinm and wsuchy as code owners May 26, 2020 18:26

Merge branch 'master' into HTMLStrip

0f01904

salesforce-cla bot added the cla:signed label May 26, 2020

michaelweilsalesforce added the work in progress label May 26, 2020

leahmcguire requested changes May 27, 2020

View reviewed changes

TuanNguyen27 added 9 commits June 1, 2020 08:14

Merge branch 'master' into HTMLStrip

59cf931

make html stripping settable

5093c66

Merge branch 'HTMLStrip' of https://github.com/salesforce/TransmogrifAI…

ac0f692

… into HTMLStrip

Update Transmogrifier.scala

7655e2c

bring back the tests

7086050

bring back tests

740b727

scala style

a9f44b2

Update RichTextFeature.scala

262efd4

documentation update

a6e4045

TuanNguyen27 requested a review from leahmcguire June 1, 2020 17:51

TuanNguyen27 added 3 commits June 1, 2020 14:15

try a new test

5b17167

Update SmartTextVectorizerTest.scala

072d9b9

trying some prints..

715e1da

gerashegalov reviewed Jun 5, 2020

View reviewed changes

TuanNguyen27 added 4 commits June 5, 2020 12:38

Update SmartTextVectorizerTest.scala

2a91be0

Update SmartTextVectorizerTest.scala

f974c39

Update SmartTextVectorizerTest.scala

29a343d

fix whitespace & add some more complicated html

e01d3e2

TuanNguyen27 added 6 commits June 8, 2020 10:45

Update SmartTextVectorizerTest.scala

0983191

Update SmartTextVectorizerTest.scala

9143056

Update SmartTextVectorizerTest.scala

b48717a

enable html stripping via flag

6cf6e96

Update TextTokenizerTest.scala

91a7095

Update TextTokenizerTest.scala

d66f976

TuanNguyen27 reviewed Jun 8, 2020

View reviewed changes

core/src/test/scala/com/salesforce/op/stages/impl/feature/TextTokenizerTest.scala Outdated Show resolved Hide resolved

fix tests

49a81fd

TuanNguyen27 added ready for review and removed work in progress labels Jun 8, 2020

Update RichTextFeature.scala

c5c7f8a

leahmcguire reviewed Jun 9, 2020

View reviewed changes

leahmcguire requested changes Jun 9, 2020

View reviewed changes

Update RichTextFeature.scala

97b9ce8

TuanNguyen27 requested a review from leahmcguire June 9, 2020 19:28

leahmcguire approved these changes Jun 10, 2020

View reviewed changes

TuanNguyen27 merged commit e48831a into master Jun 10, 2020

TuanNguyen27 added a commit that referenced this pull request Jun 10, 2020

Revert "Enable Html stripping (#478)"

c3a4443

This reverts commit e48831a.

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

tovbinm deleted the HTMLStrip branch June 12, 2020 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Html stripping #478

Enable Html stripping #478

michaelweilsalesforce commented May 26, 2020 •

edited by TuanNguyen27

Loading

michaelweilsalesforce commented May 26, 2020

codecov bot commented May 26, 2020 •

edited

Loading

leahmcguire May 27, 2020

leahmcguire May 27, 2020

leahmcguire May 27, 2020

TuanNguyen27 May 29, 2020

leahmcguire May 27, 2020

TuanNguyen27 commented Jun 4, 2020

gerashegalov Jun 5, 2020

TuanNguyen27 Jun 5, 2020

gerashegalov Jun 6, 2020

TuanNguyen27 Jun 8, 2020

gerashegalov Jun 5, 2020

TuanNguyen27 Jun 5, 2020

leahmcguire Jun 5, 2020

gerashegalov Jun 5, 2020

gerashegalov commented Jun 6, 2020

leahmcguire Jun 9, 2020

TuanNguyen27 Jun 9, 2020

leahmcguire Jun 10, 2020

TuanNguyen27 Jun 10, 2020

leahmcguire left a comment

leahmcguire left a comment

		): FeatureLike[TextList] = {

		// html stripping won't work here due since LuceneRegexTextAnalyzer

Enable Html stripping #478

Enable Html stripping #478

Conversation

michaelweilsalesforce commented May 26, 2020 • edited by TuanNguyen27 Loading

michaelweilsalesforce commented May 26, 2020

codecov bot commented May 26, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TuanNguyen27 commented Jun 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov commented Jun 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

michaelweilsalesforce commented May 26, 2020 •

edited by TuanNguyen27

Loading

codecov bot commented May 26, 2020 •

edited

Loading