-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504
Conversation
A more complex tokenizer that extracts tokens based on a regex. It also allows to turn lowerCasing on and off, adding a minimum token length and a list of stop words to exclude.
A more complex tokenizer that extracts tokens based on a regex. It also allows to turn lowerCasing on and off, adding a minimum token length and a list of stop words to exclude.
Can one of the admins verify this patch? |
Just FYI, there's some similar code in LDAExample which can be of reference. |
This is not meant to be a standalone tokenizer but rather part of a pipeline. |
Cool, just didn't saw that in the PR description. |
Do you mean restricting tokens to a predetermined set of words? |
@aborsu985 Please follow the steps in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark to prepare a PR. If this is called |
Regex and stopwords parameters are now part of the parametergrid
@mengxr |
def setLowercase(value: Boolean) = set(lowerCase, value) | ||
def getLowercase: Boolean = get(lowerCase) | ||
|
||
val minLength = new IntParam(this, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc and update code style. What's the case when we match a token with regex but its length is zero? Should we control it in the regex, e.g., \d{5,}
?
Btw, it is not intuitive that the min value is excluded. Could we remove "excluded" and set the default value to 1
? And it might be better to call it minTokenLength
, if we want to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed excluded as it is indeed unusual and set the default value to 1 which is standard
case Row(tokens: Seq[Any], wantedTokens: Seq[Any]) => | ||
assert(tokens === wantedTokens) | ||
case e => | ||
throw new SparkException(s"Row $e should contain only tokens and wantedTokens columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkException
should happen on workers. Since data is already collected, we can use fail("...")
. For this test, maybe the following is sufficient:
.collect()
.foreach { case Row(actual, expected) =>
assert(actual === expected)
}
I don't know a formatter that can do everything correctly. I use intellij and with the default Scala code style (except indent 2). I need to manually adjust the indentation while chopping down the args. |
Plus some cosmetic changes.
Test build #28990 has started for PR 4504 at commit
|
Test build #28990 has finished for PR 4504 at commit
|
Test PASSed. |
Test build #29059 has started for PR 4504 at commit
|
Test build #29059 has finished for PR 4504 at commit
|
Test PASSed. |
@aborsu985 I sent you a PR with some updates at https://github.com/aborsu985/spark/pull/1. Please merge the current master and check the diff. Thanks! |
Test build #29153 has started for PR 4504 at commit
|
Test build #29154 has started for PR 4504 at commit
|
Test build #29153 has finished for PR 4504 at commit
|
Test PASSed. |
Test build #29154 has finished for PR 4504 at commit
|
Test PASSed. |
@mengxr Thank you for your help with the Java unit tests. As you may have guessed, I'm new to both Scala and Java and I was drowning in it. |
LGTM. Merged into master. Thanks for contributing! |
Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.