-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow TextStats length distribution to be token-based and refactor for testability #464
Conversation
Codecov Report
@@ Coverage Diff @@
## master #464 +/- ##
=======================================
Coverage 86.98% 86.99%
=======================================
Files 345 345
Lines 11575 11616 +41
Branches 376 376
=======================================
+ Hits 10069 10105 +36
- Misses 1506 1511 +5
Continue to review full report at Codecov.
|
core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala
Show resolved
Hide resolved
core/src/main/scala/com/salesforce/op/stages/impl/feature/TextTokenizer.scala
Outdated
Show resolved
Hide resolved
def tokenize( | ||
text: Text, | ||
def tokenizeString( | ||
textString: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now this function can explode with NullPointerException is textString is null, while before it could not have happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm not sure if it's possible for the textString
argument to be null
in practice though. When this function used for tokenizing the map entries, a value that was originally null there will just not show up as an entry in the map. When it's used for tokenizing a normal Text entry, then we should have already safely converted any nulls or missing elements into an Option[String]
, right?
The actual tokenize call during vectorization is still tokenize(v.toText)
where v
is the value in a text map. I'd actually argue that that should be changed to tokenizeString(v)
to save time converting it to Text and back again.
I agree it's technically less safe, but I don't think it's necessary to have null checking at this point in the flow. We should make sure the data gets created in a safe way, which I think we already do. Are there some specific edge cases I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplest is to add a null check tokenizeString
and return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it more, I'm pretty sure the old tokenize function would also give a NPE if you fed it a sneaky null value. The SomeValue.unapply
function explicitly calls v.isEmpty
which would also fail if v was null
.
I put back the old tokenize function as oldTokenize
and tried
val sneakyStringOpt: Option[String] = null
val myText = Text(sneakyStringOpt)
val res = TextTokenizer.oldTokenize(myText)
which did indeed throw a NPE.
We have tests all over the place (eg. our vectorizer tests and FeatureTypeSparkConverterTest) that make sure we can handle null
values in dataframes and safely convert them into our types. I'm not aware of any explicit null checks in our functions elsewhere, so it just feels weird to put one here.
@leahmcguire any opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SomeValue.unapply operates on value which is Option[String]. Null check is done during the construction of Text when the values are extracted from Dataframe / RDD. NullPointerException is indeed unlikely to be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example you provided is not currently possible and also not a fair one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jauntbox is this only called from the Option[String] version below? if so make it private and it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact make them both private please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions on length distribution and token filtering.
case Right(doubleSeq) => doubleSeq.map(_.toString) | ||
} | ||
stringVals.foldLeft(TextStats.empty)((acc, el) => acc + SmartTextVectorizer.computeTextStats( | ||
Option(el), shouldCleanText = false, maxCardinality = RawFeatureFilter.MaxCardinality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should shouldCleanText = true
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change it, but I don't think it matters much here. These values aren't used in SmartTextVectorizer. They're the ones that show up in the ModelInsights.
.foldLeft(Map.empty[Int, Long])( | ||
(acc, el) => TextStats.additionHelper(acc, Map(el.length -> 1L), maxCardinality) | ||
) | ||
val (valueCounts, lengthCounts) = text match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we reach RawFeatureFilter.MaxCardinality
for valueCounts
, will lengthCounts
also stop accumulating ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, this is taken care of by val newLengthCounts = additionHelper(l.lengthCounts, r.lengthCounts, maxCardinality)
, pls disregard this comment :D
.getOrElse(Seq(lowerTxt)) | ||
.map { sentence => | ||
val tokens = analyzer.analyze(sentence, language) | ||
tokens.filter(_.length >= minTokenLength).toTextList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only keeping tokens with length > minTokenLength
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was existing behavior. It's a configurable parameter (defaulting to 1), so is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Just a heads up on a few more commits - adding a toggle for tokenization in text lengths. Will cause a problems with Chinese/Korean text based on our current tokenizers. |
…ts and refactored a bit
…lity in SmartTextVectorizer
Ok - ready for a final look. Sorry for the last-minute refactoring, but realized we needed this toggle exposed for experiments. Final refactoring:
|
shouldCleanText = shouldCleanText, | ||
shouldTokenize = tokenizeForLengths, | ||
maxCardinality = RawFeatureFilter.MaxCardinality) | ||
) | ||
} | ||
|
||
private def countStringValues[T](seq: Seq[T]): Map[String, Long] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not relevant to this PR but i think countStringValues
is no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I can remove it then
@@ -169,7 +169,8 @@ private[filters] object PreparedFeatures { | |||
case SomeValue(v: DenseVector) => Map((name, None) -> Right(v.toArray.toSeq)) | |||
case SomeValue(v: SparseVector) => Map((name, None) -> Right(v.indices.map(_.toDouble).toSeq)) | |||
case ft@SomeValue(_) => ft match { | |||
case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize))) | |||
// case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are no longer tokenzing text during data prep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, that was for testing - forgot to take it out.
lgtm ! |
…tionModelSelector to see if this speeds up tests significantly
@@ -322,6 +328,8 @@ trait RichMapFeature { | |||
.setHashSpaceStrategy(hashSpaceStrategy) | |||
.setHashAlgorithm(hashAlgorithm) | |||
.setBinaryFreq(binaryFreq) | |||
.setTokenizeForLengths(tokenizeForLengths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this an enum rather than a boolean? then we have room to expend in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets switch to an enum for the new flag to stem the proliferation of booleans and then LGTM
Related issues
n/a
Describe the proposed solution
Tests did not catch that the token length distributions added to TextStats were actually entry length distributions. This PR refactors some of the functions in TextTokenizer, SmartTextVectorizer, and SmartTextMapVectorizer so that they are directly testable. It also adds more robust tests to check desired behavior of the TextStats object.
Describe alternatives you've considered
n/a
Additional context
n/a