Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Html stripping #478

Merged
merged 30 commits into from
Jun 10, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
8805e3e
HTML
mweilsalesforce May 20, 2020
ee87945
increasing limit
mweilsalesforce May 20, 2020
20343a7
Not that large
mweilsalesforce May 20, 2020
a5f8591
Not Near 1 but 1
mweilsalesforce May 26, 2020
0f01904
Merge branch 'master' into HTMLStrip
michaelweilsalesforce May 26, 2020
59cf931
Merge branch 'master' into HTMLStrip
TuanNguyen27 Jun 1, 2020
5093c66
make html stripping settable
TuanNguyen27 Jun 1, 2020
ac0f692
Merge branch 'HTMLStrip' of https://github.com/salesforce/Transmogrif…
TuanNguyen27 Jun 1, 2020
7655e2c
Update Transmogrifier.scala
TuanNguyen27 Jun 1, 2020
7086050
bring back the tests
TuanNguyen27 Jun 1, 2020
740b727
bring back tests
TuanNguyen27 Jun 1, 2020
a9f44b2
scala style
TuanNguyen27 Jun 1, 2020
262efd4
Update RichTextFeature.scala
TuanNguyen27 Jun 1, 2020
a6e4045
documentation update
TuanNguyen27 Jun 1, 2020
5b17167
try a new test
TuanNguyen27 Jun 1, 2020
072d9b9
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 1, 2020
715e1da
trying some prints..
TuanNguyen27 Jun 1, 2020
2a91be0
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 5, 2020
f974c39
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 5, 2020
29a343d
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 5, 2020
e01d3e2
fix whitespace & add some more complicated html
TuanNguyen27 Jun 5, 2020
0983191
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 8, 2020
9143056
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 8, 2020
b48717a
Update SmartTextVectorizerTest.scala
TuanNguyen27 Jun 8, 2020
6cf6e96
enable html stripping via flag
TuanNguyen27 Jun 8, 2020
91a7095
Update TextTokenizerTest.scala
TuanNguyen27 Jun 8, 2020
d66f976
Update TextTokenizerTest.scala
TuanNguyen27 Jun 8, 2020
49a81fd
fix tests
TuanNguyen27 Jun 8, 2020
c5c7f8a
Update RichTextFeature.scala
TuanNguyen27 Jun 9, 2020
97b9ce8
Update RichTextFeature.scala
TuanNguyen27 Jun 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -381,8 +381,10 @@ final class SmartTextMapVectorizerModel[T <: OPMap[String]] private[op]
val keysText = keysHash + keysIgnore // Go algebird!
val categoricalVector = categoricalPivotFn(rowCategorical)

val rowHashTokenized = rowHash.map(_.value.map { case (k, v) => k -> tokenize(v.toText).tokens })
val rowIgnoreTokenized = rowIgnore.map(_.value.map { case (k, v) => k -> tokenize(v.toText).tokens })
val rowHashTokenized = rowHash.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens })
val rowIgnoreTokenized = rowIgnore.map(_.value.map { case (k, v) => k -> tokenize(text = v.toText,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than just changing the default please make this a settable parameter

val rowTextTokenized = rowHashTokenized + rowIgnoreTokenized // Go go algebird!
val hashVector = hash(rowHashTokenized, keysHash, args.hashingParams)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -342,8 +342,10 @@ final class SmartTextVectorizerModel[T <: Text] private[op]
val textToHash = groups.getOrElse(TextVectorizationMethod.Hash, Array.empty).map(_._1)

val categoricalVector: OPVector = categoricalPivotFn(textToPivot)
val textTokens: Seq[TextList] = textToHash.map(tokenize(_).tokens)
val ignorableTextTokens: Seq[TextList] = textToIgnore.map(tokenize(_).tokens)
val textTokens: Seq[TextList] = textToHash.map(t => tokenize(text = t,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens)
val ignorableTextTokens: Seq[TextList] = textToIgnore.map(t => tokenize(text = t,
analyzer = TextTokenizer.AnalyzerHtmlStrip).tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, make this settable

val textVector: OPVector = hash[TextList](textTokens, getTextTransientFeatures, args.hashingParams)
val textNullIndicatorsVector = if (args.shouldTrackNulls) {
getNullIndicatorsVector(textTokens ++ ignorableTextTokens)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ private[op] trait TransmogrifierDefaults {
val NullString: String = OpVectorColumnMetadata.NullString
val OtherString: String = OpVectorColumnMetadata.OtherString
val DefaultNumOfFeatures: Int = 512
val MaxNumOfFeatures: Int = 16384
val MaxNumOfFeatures: Int = 1638400000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above number is 2^14 please choose a number in bytes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we don't need to change this number, because i lifted this limit in #477

val DateListDefault: DateListPivot = DateListPivot.SinceLast
val ReferenceDate: org.joda.time.DateTime = DateTimeUtils.now()
val TopK: Int = 20
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -607,10 +607,9 @@ class SmartTextMapVectorizerTest
val transformed = new OpWorkflow().setResultFeatures(output).transform(countryMapDF)
assertVectorLength(transformed, output, TransmogrifierDefaults.TopK + 2, Pivot)
}
it should "treat the edge case of coverage being near 1" in {
it should "treat the edge case of coverage being 1" in {
val maxCard = 100
val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0 - 1e-10).setMaxCardinality(maxCard)
.setMinSupport(1)
val vectorizer = new SmartTextMapVectorizer().setCoveragePct(1.0).setMaxCardinality(maxCard).setMinSupport(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seems to defeat the purpose of the test

.setTrackTextLen(true).setInput(rawCatCountryMap)
val output = vectorizer.getOutput()
val transformed = new OpWorkflow().setResultFeatures(output).transform(countryMapDF)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -288,9 +288,9 @@ class SmartTextVectorizerTest
val transformed = new OpWorkflow().setResultFeatures(output).transform(countryDF)
assertVectorLength(transformed, output, TransmogrifierDefaults.TopK + 2, Pivot)
}
it should "treat the edge case of coverage being near 1" in {
it should "treat the edge case of coverage being 1" in {
val maxCard = 100
val vectorizer = new SmartTextVectorizer().setCoveragePct(1.0 - 1e-10).setMaxCardinality(maxCard).setMinSupport(1)
val vectorizer = new SmartTextVectorizer().setCoveragePct(1.0).setMaxCardinality(maxCard).setMinSupport(1)
.setTrackTextLen(true).setInput(rawCatCountry)
val output = vectorizer.getOutput()
val transformed = new OpWorkflow().setResultFeatures(output).transform(countryDF)
Expand Down