Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow TextStats length distribution to be token-based and refactor for testability #464

Merged
merged 28 commits into from
Mar 26, 2020

Conversation

Jauntbox
Copy link
Contributor

@Jauntbox Jauntbox commented Mar 5, 2020

Related issues
n/a

Describe the proposed solution
Tests did not catch that the token length distributions added to TextStats were actually entry length distributions. This PR refactors some of the functions in TextTokenizer, SmartTextVectorizer, and SmartTextMapVectorizer so that they are directly testable. It also adds more robust tests to check desired behavior of the TextStats object.

Describe alternatives you've considered
n/a

Additional context
n/a

@codecov
Copy link

codecov bot commented Mar 5, 2020

Codecov Report

Merging #464 into master will increase coverage by 0.00%.
The diff coverage is 92.30%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #464   +/-   ##
=======================================
  Coverage   86.98%   86.99%           
=======================================
  Files         345      345           
  Lines       11575    11616   +41     
  Branches      376      376           
=======================================
+ Hits        10069    10105   +36     
- Misses       1506     1511    +5     
Impacted Files Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 67.64% <ø> (ø)
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 81.94% <ø> (ø)
...a/com/salesforce/op/filters/RawFeatureFilter.scala 92.97% <ø> (ø)
...main/scala/com/salesforce/op/test/TestCommon.scala 40.90% <0.00%> (-9.10%) ⬇️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.58% <96.15%> (-0.03%) ⬇️
...om/salesforce/op/filters/FeatureDistribution.scala 98.70% <100.00%> (+0.03%) ⬆️
...sification/BinaryClassificationModelSelector.scala 98.24% <100.00%> (ø)
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (ø)
...esforce/op/stages/impl/feature/TextTokenizer.scala 97.22% <100.00%> (+0.07%) ⬆️
...sforce/op/stages/OpPipelineStageReaderWriter.scala 87.50% <100.00%> (+0.40%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed4abfd...d78868d. Read the comment docs.

def tokenize(
text: Text,
def tokenizeString(
textString: String,
Copy link
Collaborator

@tovbinm tovbinm Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now this function can explode with NullPointerException is textString is null, while before it could not have happened.

Copy link
Contributor Author

@Jauntbox Jauntbox Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm not sure if it's possible for the textString argument to be null in practice though. When this function used for tokenizing the map entries, a value that was originally null there will just not show up as an entry in the map. When it's used for tokenizing a normal Text entry, then we should have already safely converted any nulls or missing elements into an Option[String], right?

The actual tokenize call during vectorization is still tokenize(v.toText) where v is the value in a text map. I'd actually argue that that should be changed to tokenizeString(v) to save time converting it to Text and back again.

I agree it's technically less safe, but I don't think it's necessary to have null checking at this point in the flow. We should make sure the data gets created in a safe way, which I think we already do. Are there some specific edge cases I'm missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest is to add a null check tokenizeString and return

Copy link
Contributor Author

@Jauntbox Jauntbox Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it more, I'm pretty sure the old tokenize function would also give a NPE if you fed it a sneaky null value. The SomeValue.unapply function explicitly calls v.isEmpty which would also fail if v was null.

I put back the old tokenize function as oldTokenize and tried

val sneakyStringOpt: Option[String] = null
val myText = Text(sneakyStringOpt)
val res = TextTokenizer.oldTokenize(myText)

which did indeed throw a NPE.

We have tests all over the place (eg. our vectorizer tests and FeatureTypeSparkConverterTest) that make sure we can handle null values in dataframes and safely convert them into our types. I'm not aware of any explicit null checks in our functions elsewhere, so it just feels weird to put one here.

@leahmcguire any opinions on this?

Copy link
Collaborator

@tovbinm tovbinm Mar 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SomeValue.unapply operates on value which is Option[String]. Null check is done during the construction of Text when the values are extracted from Dataframe / RDD. NullPointerException is indeed unlikely to be thrown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example you provided is not currently possible and also not a fair one :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jauntbox is this only called from the Option[String] version below? if so make it private and it is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact make them both private please

Copy link
Collaborator

@TuanNguyen27 TuanNguyen27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions on length distribution and token filtering.

case Right(doubleSeq) => doubleSeq.map(_.toString)
}
stringVals.foldLeft(TextStats.empty)((acc, el) => acc + SmartTextVectorizer.computeTextStats(
Option(el), shouldCleanText = false, maxCardinality = RawFeatureFilter.MaxCardinality)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should shouldCleanText = true instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it, but I don't think it matters much here. These values aren't used in SmartTextVectorizer. They're the ones that show up in the ModelInsights.

.foldLeft(Map.empty[Int, Long])(
(acc, el) => TextStats.additionHelper(acc, Map(el.length -> 1L), maxCardinality)
)
val (valueCounts, lengthCounts) = text match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we reach RawFeatureFilter.MaxCardinality for valueCounts, will lengthCounts also stop accumulating ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, this is taken care of by val newLengthCounts = additionHelper(l.lengthCounts, r.lengthCounts, maxCardinality) , pls disregard this comment :D

.getOrElse(Seq(lowerTxt))
.map { sentence =>
val tokens = analyzer.analyze(sentence, language)
tokens.filter(_.length >= minTokenLength).toTextList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only keeping tokens with length > minTokenLength ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was existing behavior. It's a configurable parameter (defaulting to 1), so is not required.

Copy link
Collaborator

@TuanNguyen27 TuanNguyen27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Jauntbox
Copy link
Contributor Author

Just a heads up on a few more commits - adding a toggle for tokenization in text lengths. Will cause a problems with Chinese/Korean text based on our current tokenizers.

@Jauntbox
Copy link
Contributor Author

Jauntbox commented Mar 19, 2020

Ok - ready for a final look. Sorry for the last-minute refactoring, but realized we needed this toggle exposed for experiments.

Final refactoring:

  • Removed tokenizeStringOpt method since we can get by without it (should be more readable too)
  • Moved methods to create TextStats instances from the SmartTextVectorizer objects and into the TextStats object since they make more sense there
  • Added toggle to SmartTextVectorizer and SmartTextMapVectorizer to enable/disable tokenization when calculating length distributions in TextStats
  • Added tests to check that this toggle does what we want
  • Added logging of derived quantities and vectorization method used in SmartTextVectorizer and SmartTextMapVectorizer

shouldCleanText = shouldCleanText,
shouldTokenize = tokenizeForLengths,
maxCardinality = RawFeatureFilter.MaxCardinality)
)
}

private def countStringValues[T](seq: Seq[T]): Map[String, Long] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant to this PR but i think countStringValues is no longer used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can remove it then

@@ -169,7 +169,8 @@ private[filters] object PreparedFeatures {
case SomeValue(v: DenseVector) => Map((name, None) -> Right(v.toArray.toSeq))
case SomeValue(v: SparseVector) => Map((name, None) -> Right(v.indices.map(_.toDouble).toSeq))
case ft@SomeValue(_) => ft match {
case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize)))
// case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are no longer tokenzing text during data prep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, that was for testing - forgot to take it out.

@TuanNguyen27
Copy link
Collaborator

lgtm !

@@ -322,6 +328,8 @@ trait RichMapFeature {
.setHashSpaceStrategy(hashSpaceStrategy)
.setHashAlgorithm(hashAlgorithm)
.setBinaryFreq(binaryFreq)
.setTokenizeForLengths(tokenizeForLengths)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this an enum rather than a boolean? then we have room to expend in the future

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets switch to an enum for the new flag to stem the proliferation of booleans and then LGTM

@Jauntbox Jauntbox changed the title Make TextStats length distribution token-based and refactor for testability Allow TextStats length distribution to be token-based and refactor for testability Mar 26, 2020
@Jauntbox Jauntbox merged commit da52ad9 into master Mar 26, 2020
@Jauntbox Jauntbox deleted the km/token-lens3 branch March 26, 2020 17:16
@nicodv nicodv mentioned this pull request Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants