Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

Closed
wants to merge 33 commits into from
Closed

[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

wants to merge 33 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Feb 10, 2015

Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.

Augustin Borsu added 2 commits February 10, 2015 10:52
A more complex tokenizer that extracts tokens based on a regex. It also allows
to turn lowerCasing on and off, adding a minimum token length and a list of
stop words to exclude.
A more complex tokenizer that extracts tokens based on a regex. It also allows
to turn lowerCasing on and off, adding a minimum token length and a list of
stop words to exclude.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@hhbyyh
Copy link
Contributor

hhbyyh commented Feb 10, 2015

Just FYI, there's some similar code in LDAExample which can be of reference.

@ghost
Copy link
Author

ghost commented Feb 10, 2015

This is not meant to be a standalone tokenizer but rather part of a pipeline.
In that aim it has parameters that can be made to vary in order to decide which are best.
I did tthis in response to this issue.
https://issues.apache.org/jira/browse/SPARK-5566

@hhbyyh
Copy link
Contributor

hhbyyh commented Feb 10, 2015

Cool, just didn't saw that in the PR description.
And if possible, apart from the stopwords, an extra parameter representing the vocab will be handy. In some cases, the vocabulary has been previously settled. Just my thought.

@ghost
Copy link
Author

ghost commented Feb 10, 2015

Do you mean restricting tokens to a predetermined set of words?

@ghost ghost changed the title RegEx Tokenizer for mllib RegEx Tokenizer for ml Feb 10, 2015
@mengxr
Copy link
Contributor

mengxr commented Feb 10, 2015

@aborsu985 Please follow the steps in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark to prepare a PR.

If this is called RegexTokenizer, should the regex used to be configurable?

Augustin Borsu added 2 commits February 11, 2015 09:34
Regex and stopwords parameters are now part of the parametergrid
@ghost ghost changed the title RegEx Tokenizer for ml [ML][FEATURE]RegEx Tokenizer for ml Feb 11, 2015
@ghost ghost changed the title [ML][FEATURE]RegEx Tokenizer for ml [ML][FEATURE]RegEx Tokenizer Feb 11, 2015
@ghost
Copy link
Author

ghost commented Feb 11, 2015

@mengxr
I changed the title to be more specific about the change and enabled the regex to be configurable (as well as the stopwords). There is an issue on the Jira about this (https://issues.apache.org/jira/browse/SPARK-5566) and the code passes the tests when I run it on my machine. The change is quite small, adds no dependencies and should be useful to anyone who wants to validate a NLP model using pipelines.

@ghost ghost changed the title [ML][FEATURE]RegEx Tokenizer [ML][FEATURE] SPARK-5566: RegEx Tokenizer Feb 11, 2015
def setLowercase(value: Boolean) = set(lowerCase, value)
def getLowercase: Boolean = get(lowerCase)

val minLength = new IntParam(this,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc and update code style. What's the case when we match a token with regex but its length is zero? Should we control it in the regex, e.g., \d{5,}?

Btw, it is not intuitive that the min value is excluded. Could we remove "excluded" and set the default value to 1? And it might be better to call it minTokenLength, if we want to keep it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed excluded as it is indeed unusual and set the default value to 1 which is standard

case Row(tokens: Seq[Any], wantedTokens: Seq[Any]) =>
assert(tokens === wantedTokens)
case e =>
throw new SparkException(s"Row $e should contain only tokens and wantedTokens columns")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkException should happen on workers. Since data is already collected, we can use fail("..."). For this test, maybe the following is sufficient:

.collect()
.foreach { case Row(actual, expected) =>
  assert(actual === expected)
}

@mengxr
Copy link
Contributor

mengxr commented Mar 20, 2015

I don't know a formatter that can do everything correctly. I use intellij and with the default Scala code style (except indent 2). I need to manually adjust the indentation while chopping down the args.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28990 has started for PR 4504 at commit 148126f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28990 has finished for PR 4504 at commit 148126f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]
    • class NaiveBayesModel(Saveable, Loader):
    • class SqlParser extends AbstractSparkSQLParser with DataTypeParser
    • case class CombineSum(child: Expression) extends AggregateExpression
    • case class CombineSumFunction(expr: Expression, base: AggregateExpression)
    • protected[sql] class DataTypeException(message: String) extends Exception(message)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28990/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29059 has started for PR 4504 at commit 2338da5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29059 has finished for PR 4504 at commit 2338da5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ExecutorCacheTaskLocation(override val host: String, executorId: String)
    • class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29059/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Mar 24, 2015

@aborsu985 I sent you a PR with some updates at https://github.com/aborsu985/spark/pull/1. Please merge the current master and check the diff. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29153 has started for PR 4504 at commit 5f09434.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29154 has started for PR 4504 at commit 716d257.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29153 has finished for PR 4504 at commit 5f09434.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29153/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29154 has finished for PR 4504 at commit 716d257.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29154/
Test PASSed.

@ghost
Copy link
Author

ghost commented Mar 25, 2015

@mengxr Thank you for your help with the Java unit tests. As you may have guessed, I'm new to both Scala and Java and I was drowning in it.

@asfgit asfgit closed this in 982952f Mar 25, 2015
@mengxr
Copy link
Contributor

mengxr commented Mar 25, 2015

LGTM. Merged into master. Thanks for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants