-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Html stripping #478
Conversation
This PR doesn't introduce options yet |
Codecov Report
@@ Coverage Diff @@
## master #478 +/- ##
========================================
Coverage 87.00% 87.01%
========================================
Files 345 345
Lines 11673 11680 +7
Branches 388 613 +225
========================================
+ Hits 10156 10163 +7
Misses 1517 1517
Continue to review full report at Codecov.
|
val rowHashTokenized = rowHash.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText, | ||
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens }) | ||
val rowIgnoreTokenized = rowIgnore.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText, | ||
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than just changing the default please make this a settable parameter
val textTokens: Seq[TextList] = textToHash.map(t => tokenize(text = t, | ||
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens) | ||
val ignorableTextTokens: Seq[TextList] = textToIgnore.map(t => tokenize(text = t, | ||
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, make this settable
@@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults { | |||
val NullString: String = OpVectorColumnMetadata.NullString | |||
val OtherString: String = OpVectorColumnMetadata.OtherString | |||
val DefaultNumOfFeatures: Int = 512 | |||
val MaxNumOfFeatures: Int = 16384 | |||
val MaxNumOfFeatures: Int = 1638400000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above number is 2^14 please choose a number in bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we don't need to change this number, because i lifted this limit in #477
val maxCard = 100 | ||
val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0 - 1e-10).setMaxCardinality(maxCard) | ||
.setMinSupport(1) | ||
val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0).setMaxCardinality(maxCard).setMinSupport(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these seems to defeat the purpose of the test
@leahmcguire @Jauntbox could you take a look at this PR ? I'm not sure how to test my changes :( |
@@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults { | |||
val NullString: String = OpVectorColumnMetadata.NullString | |||
val OtherString: String = OpVectorColumnMetadata.OtherString | |||
val DefaultNumOfFeatures: Int = 512 | |||
val MaxNumOfFeatures: Int = 16384 | |||
val MaxNumOfFeatures: Int = 131072 // 2^17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if 2^17 is important we can set it explicitly 1 << 17
instead of a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not really important...i used 2^17 because @leahmcguire wanted it to be a power of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well 1 << n
is a way to demonstrate that you are definitely using a power of 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i'll change it to 1 << n
then, @leahmcguire @Jauntbox any objection ?
@@ -137,6 +147,39 @@ class SmartTextVectorizerTest | |||
smart shouldBe expected | |||
} | |||
|
|||
it should "detect one categorical and one non-categorical text feature with html data" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test is lagerly copy&paste. Can we make it a function or a behavior https://www.scalatest.org/user_guide/sharing_tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure this is the right way to test this function either... @mweilsalesforce @Jauntbox @leahmcguire any advice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the text tokenizer test https://github.com/salesforce/TransmogrifAI/blob/master/core/src/test/scala/com/salesforce/op/stages/impl/feature/TextTokenizerTest.scala
it has html stripping tests
smartVector -> combined | ||
}.unzip | ||
|
||
println(smart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let us remove println statements
Let us actually fill out the form for the PR description to set the context :) |
core/src/test/scala/com/salesforce/op/stages/impl/feature/TextTokenizerTest.scala
Outdated
Show resolved
Hide resolved
): FeatureLike[TextList] = { | ||
|
||
// html stripping won't work here due since LuceneRegexTextAnalyzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if HTML stripping will not work then please dont put it as an input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the set call too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the set call actually belongs to SmartTextVectorizer
so it's separate from tokenizeRegex
(which i've removed htmlStripping flag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor param change and then LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove the set call on the shortcut that doesnt actually support HTML stripping an then LGTM
This reverts commit e48831a.
Related issues
When engineering features from a
Text
(andText
-like) raw features, we should strip the text of any html tags, which doesn't add signal to existing tokens (and even pollutes them).Describe the proposed solution
Enable html stripping via
TextTokenizer.AnalyzerHtmlStrip