Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds options for tracking text length in text vectorizers #195

Merged
merged 17 commits into from
Jan 8, 2019

Conversation

Jauntbox
Copy link
Contributor

Related issues
N/A

Describe the proposed solution
This PR adds options for tracking text length in the relevant text vectorizers:
SmartTextVectorizer, SmartTextMapVectorizer, TextMapHashingVectorizer, as well as the vectorize and smartVectorize shortcuts for Text, TextArea, TextMap, and TextAreaMap

Describe alternatives you've considered
N/A

Additional context
This is the second part of #187

@codecov
Copy link

codecov bot commented Dec 14, 2018

Codecov Report

Merging #195 into master will decrease coverage by 0.12%.
The diff coverage is 60.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #195      +/-   ##
==========================================
- Coverage   86.39%   86.26%   -0.13%     
==========================================
  Files         310      310              
  Lines       10019    10058      +39     
  Branches      351      548     +197     
==========================================
+ Hits         8656     8677      +21     
- Misses       1363     1381      +18
Impacted Files Coverage Δ
.../com/salesforce/op/features/TransientFeature.scala 63.15% <0%> (-7.44%) ⬇️
...s/impl/feature/OPCollectionHashingVectorizer.scala 96.47% <100%> (+0.1%) ⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 97.67% <100%> (+0.27%) ⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.79% <100%> (+0.04%) ⬆️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100% <100%> (ø) ⬆️
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 73.23% <14.28%> (-8.3%) ⬇️
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 84.61% <28.57%> (-5.71%) ⬇️
...cala/com/salesforce/op/OpWorkflowModelReader.scala 89.13% <0%> (-2.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 421dc9e...d6e17ea. Read the comment docs.

@tovbinm tovbinm changed the title Km/text len defaults Adds options for tracking text length in text vectorizers Dec 15, 2018
case (true, true) =>
val textLengths = new TextMapLenEstimator[TextMap]().setInput(f +: others).getOutput()
val nullIndicators = new TextMapNullEstimator[TextMap]().setInput(f +: others).getOutput()
new VectorsCombiner().setInput(Seq(hashedFeatures, textLengths, nullIndicators): _*).getOutput()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not .setInput(hashedFeatures, textLengths, nullIndicators)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no idea why I did that - fixed

@@ -204,6 +204,7 @@ trait RichMapFeature {
blackListKeys: Array[String] = Array.empty,
others: Array[FeatureLike[TextMap]] = Array.empty,
trackNulls: Boolean = TransmogrifierDefaults.TrackNulls,
trackTextLen: Boolean = TransmogrifierDefaults.TrackTextLen,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


categoricalVector.combine(textVector, textNullIndicatorsVector: _*)
categoricalVector.combine(textVector, Seq(textLenVector, textNullIndicatorsVector): _*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categoricalVector.combine(textVector, textLenVector, textNullIndicatorsVector)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

val unseen = Option($(unseenName))

val categoricalColumns = if (categoricalFeatures.nonEmpty) {
makeVectorColumnMetadata(shouldTrackNulls, unseen, smartTextParams.categoricalTopValues, categoricalFeatures)
} else Array.empty[OpVectorColumnMetadata]
val textColumns = if (textFeatures.nonEmpty) {
makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++ textFeatures.map(_.toColumnMetaData(isNull = true))
if (shouldTrackLen) {
makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be DRYed out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, best I can come up with is

val textColumns = if (textFeatures.nonEmpty) {
      
      makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++
        (if (shouldTrackLen) textFeatures.map(_.toColumnMetaData(descriptorValue =
          OpVectorColumnMetadata.TextLenString)) else Array.empty[OpVectorColumnMetadata]) ++
        (if (shouldTrackNulls) textFeatures.map(_.toColumnMetaData(isNull = true))
      else Array.empty[OpVectorColumnMetadata])
      
    } else Array.empty[OpVectorColumnMetadata]

which looks less readable to me...


categoricalVector.combine(textVector, textNullIndicatorsVector: _*)
categoricalVector.combine(textVector, Seq(textLenVector, textNullIndicatorsVector): _*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same categoricalVector.combine(textVector, textLenVector, textNullIndicatorsVector)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

/**
* Param that decides whether or not lengths of text are tracked during vectorization
*/
trait TrackTextLenParam extends Params {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once this param is added - will existing models fail to load or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes existing models will fail to load. That's why I had to re-generate the old model that we test loading with

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets indicate this in the pr description so we wont forget to include it in our release notes

meta.history.keys shouldBe Set(f1.name, f2.name)
meta.columns.length shouldBe 12
meta.columns.foreach { col =>
if (col.index < 4) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg, these if/else are horrible. any better ideas?! ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment. An alternative would be to do explicit comparisons on all the array indices, eg.

meta.columns(1).parentFeatureName shouldBe Seq(f1.name)
meta.columns(1).grouping shouldBe None

which I think is even worse. I don't think the if/elses are that bad - they're just checking certain ranges of the feature vector. I think those explicit comparisons need to be there regardless since it's a unit test checking the output of a specific input.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, see comments

@tovbinm
Copy link
Collaborator

tovbinm commented Jan 7, 2019

@Jauntbox are you planning to address the comments?

@tovbinm tovbinm merged commit e47b042 into master Jan 8, 2019
@tovbinm tovbinm deleted the km/text-len-defaults branch January 8, 2019 21:03
@Jauntbox Jauntbox mentioned this pull request Feb 8, 2019
@tovbinm tovbinm mentioned this pull request Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants