-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds options for tracking text length in text vectorizers #195
Conversation
…aps), and tests for each
…ut feature vector and tested this
…st loading since we changed the arguments to SmartTextVectorizerModel
…text-len-defaults
Codecov Report
@@ Coverage Diff @@
## master #195 +/- ##
==========================================
- Coverage 86.39% 86.26% -0.13%
==========================================
Files 310 310
Lines 10019 10058 +39
Branches 351 548 +197
==========================================
+ Hits 8656 8677 +21
- Misses 1363 1381 +18
Continue to review full report at Codecov.
|
case (true, true) => | ||
val textLengths = new TextMapLenEstimator[TextMap]().setInput(f +: others).getOutput() | ||
val nullIndicators = new TextMapNullEstimator[TextMap]().setInput(f +: others).getOutput() | ||
new VectorsCombiner().setInput(Seq(hashedFeatures, textLengths, nullIndicators): _*).getOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not .setInput(hashedFeatures, textLengths, nullIndicators)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no idea why I did that - fixed
@@ -204,6 +204,7 @@ trait RichMapFeature { | |||
blackListKeys: Array[String] = Array.empty, | |||
others: Array[FeatureLike[TextMap]] = Array.empty, | |||
trackNulls: Boolean = TransmogrifierDefaults.TrackNulls, | |||
trackTextLen: Boolean = TransmogrifierDefaults.TrackTextLen, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
categoricalVector.combine(textVector, textNullIndicatorsVector: _*) | ||
categoricalVector.combine(textVector, Seq(textLenVector, textNullIndicatorsVector): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categoricalVector.combine(textVector, textLenVector, textNullIndicatorsVector)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
val unseen = Option($(unseenName)) | ||
|
||
val categoricalColumns = if (categoricalFeatures.nonEmpty) { | ||
makeVectorColumnMetadata(shouldTrackNulls, unseen, smartTextParams.categoricalTopValues, categoricalFeatures) | ||
} else Array.empty[OpVectorColumnMetadata] | ||
val textColumns = if (textFeatures.nonEmpty) { | ||
makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++ textFeatures.map(_.toColumnMetaData(isNull = true)) | ||
if (shouldTrackLen) { | ||
makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be DRYed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, best I can come up with is
val textColumns = if (textFeatures.nonEmpty) {
makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++
(if (shouldTrackLen) textFeatures.map(_.toColumnMetaData(descriptorValue =
OpVectorColumnMetadata.TextLenString)) else Array.empty[OpVectorColumnMetadata]) ++
(if (shouldTrackNulls) textFeatures.map(_.toColumnMetaData(isNull = true))
else Array.empty[OpVectorColumnMetadata])
} else Array.empty[OpVectorColumnMetadata]
which looks less readable to me...
|
||
categoricalVector.combine(textVector, textNullIndicatorsVector: _*) | ||
categoricalVector.combine(textVector, Seq(textLenVector, textNullIndicatorsVector): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same categoricalVector.combine(textVector, textLenVector, textNullIndicatorsVector)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
/** | ||
* Param that decides whether or not lengths of text are tracked during vectorization | ||
*/ | ||
trait TrackTextLenParam extends Params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once this param is added - will existing models fail to load or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes existing models will fail to load. That's why I had to re-generate the old model that we test loading with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets indicate this in the pr description so we wont forget to include it in our release notes
meta.history.keys shouldBe Set(f1.name, f2.name) | ||
meta.columns.length shouldBe 12 | ||
meta.columns.foreach { col => | ||
if (col.index < 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg, these if/else are horrible. any better ideas?! ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment. An alternative would be to do explicit comparisons on all the array indices, eg.
meta.columns(1).parentFeatureName shouldBe Seq(f1.name)
meta.columns(1).grouping shouldBe None
which I think is even worse. I don't think the if/elses are that bad - they're just checking certain ranges of the feature vector. I think those explicit comparisons need to be there regardless since it's a unit test checking the output of a specific input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, see comments
@Jauntbox are you planning to address the comments? |
…rifAI into km/text-len-defaults
Related issues
N/A
Describe the proposed solution
This PR adds options for tracking text length in the relevant text vectorizers:
SmartTextVectorizer, SmartTextMapVectorizer, TextMapHashingVectorizer, as well as the vectorize and smartVectorize shortcuts for Text, TextArea, TextMap, and TextAreaMap
Describe alternatives you've considered
N/A
Additional context
This is the second part of #187