[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

ghost · 2015-02-10T10:45:00Z

Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.

A more complex tokenizer that extracts tokens based on a regex. It also allows to turn lowerCasing on and off, adding a minimum token length and a list of stop words to exclude.

AmplabJenkins · 2015-02-10T10:47:10Z

Can one of the admins verify this patch?

hhbyyh · 2015-02-10T10:55:07Z

Just FYI, there's some similar code in LDAExample which can be of reference.

ghost · 2015-02-10T10:57:55Z

This is not meant to be a standalone tokenizer but rather part of a pipeline.
In that aim it has parameters that can be made to vary in order to decide which are best.
I did tthis in response to this issue.
https://issues.apache.org/jira/browse/SPARK-5566

hhbyyh · 2015-02-10T11:03:48Z

Cool, just didn't saw that in the PR description.
And if possible, apart from the stopwords, an extra parameter representing the vocab will be handy. In some cases, the vocabulary has been previously settled. Just my thought.

ghost · 2015-02-10T12:46:33Z

Do you mean restricting tokens to a predetermined set of words?

mengxr · 2015-02-10T18:16:13Z

@aborsu985 Please follow the steps in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark to prepare a PR.

If this is called RegexTokenizer, should the regex used to be configurable?

Regex and stopwords parameters are now part of the parametergrid

ghost · 2015-02-11T09:56:53Z

@mengxr
I changed the title to be more specific about the change and enabled the regex to be configurable (as well as the stopwords). There is an issue on the Jira about this (https://issues.apache.org/jira/browse/SPARK-5566) and the code passes the tests when I run it on my machine. The change is quite small, adds no dependencies and should be useful to anyone who wants to validate a NLP model using pipelines.

mengxr · 2015-02-24T07:14:31Z

mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala

+  def setLowercase(value: Boolean) = set(lowerCase, value)
+  def getLowercase: Boolean = get(lowerCase)
+
+  val minLength = new IntParam(this, 


Add doc and update code style. What's the case when we match a token with regex but its length is zero? Should we control it in the regex, e.g., \d{5,}?

Btw, it is not intuitive that the min value is excluded. Could we remove "excluded" and set the default value to 1? And it might be better to call it minTokenLength, if we want to keep it.

I removed excluded as it is indeed unusual and set the default value to 1 which is standard

mengxr · 2015-03-20T19:19:51Z

mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala

+        case Row(tokens: Seq[Any], wantedTokens: Seq[Any]) =>
+          assert(tokens === wantedTokens)
+        case e =>
+          throw new SparkException(s"Row $e should contain only tokens and wantedTokens columns")


SparkException should happen on workers. Since data is already collected, we can use fail("..."). For this test, maybe the following is sufficient:

.collect() .foreach { case Row(actual, expected) => assert(actual === expected) }

mengxr · 2015-03-20T19:22:14Z

I don't know a formatter that can do everything correctly. I use intellij and with the default Scala code style (except indent 2). I need to manually adjust the indentation while chopping down the args.

Plus some cosmetic changes.

SparkQA · 2015-03-23T10:22:36Z

Test build #28990 has started for PR 4504 at commit 148126f.

This patch merges cleanly.

SparkQA · 2015-03-23T11:44:00Z

Test build #28990 has finished for PR 4504 at commit 148126f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]
- class NaiveBayesModel(Saveable, Loader):
- class SqlParser extends AbstractSparkSQLParser with DataTypeParser
- case class CombineSum(child: Expression) extends AggregateExpression
- case class CombineSumFunction(expr: Expression, base: AggregateExpression)
- protected[sql] class DataTypeException(message: String) extends Exception(message)

AmplabJenkins · 2015-03-23T11:44:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28990/
Test PASSed.

SparkQA · 2015-03-24T06:58:15Z

Test build #29059 has started for PR 4504 at commit 2338da5.

This patch merges cleanly.

SparkQA · 2015-03-24T08:15:12Z

Test build #29059 has finished for PR 4504 at commit 2338da5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ExecutorCacheTaskLocation(override val host: String, executorId: String)
- class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

AmplabJenkins · 2015-03-24T08:15:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29059/
Test PASSed.

mengxr · 2015-03-24T17:35:34Z

@aborsu985 I sent you a PR with some updates at https://github.com/aborsu985/spark/pull/1. Please merge the current master and check the diff. Thanks!

SparkQA · 2015-03-25T07:18:19Z

Test build #29153 has started for PR 4504 at commit 5f09434.

This patch merges cleanly.

…SPARK-5566

SparkQA · 2015-03-25T07:23:18Z

Test build #29154 has started for PR 4504 at commit 716d257.

This patch merges cleanly.

SparkQA · 2015-03-25T08:41:08Z

Test build #29153 has finished for PR 4504 at commit 5f09434.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

AmplabJenkins · 2015-03-25T08:41:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29153/
Test PASSed.

SparkQA · 2015-03-25T08:46:32Z

Test build #29154 has finished for PR 4504 at commit 716d257.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

AmplabJenkins · 2015-03-25T08:46:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29154/
Test PASSed.

ghost · 2015-03-25T08:54:08Z

@mengxr Thank you for your help with the Java unit tests. As you may have guessed, I'm new to both Scala and Java and I was drowning in it.

mengxr · 2015-03-25T17:18:19Z

LGTM. Merged into master. Thanks for contributing!

Augustin Borsu added 2 commits February 10, 2015 10:52

RegExTokenizer

01cd26f

A more complex tokenizer that extracts tokens based on a regex. It also allows to turn lowerCasing on and off, adding a minimum token length and a list of stop words to exclude.

RegEx Tokenizer

9547e9d

A more complex tokenizer that extracts tokens based on a regex. It also allows to turn lowerCasing on and off, adding a minimum token length and a list of stop words to exclude.

ghost changed the title ~~RegEx Tokenizer for mllib~~ RegEx Tokenizer for ml Feb 10, 2015

Augustin Borsu added 2 commits February 11, 2015 09:34

Merge remote-tracking branch 'upstream/master'

9e07a78

RegexTokenizer

9f8685a

Regex and stopwords parameters are now part of the parametergrid

ghost changed the title ~~RegEx Tokenizer for ml~~ [ML][FEATURE]RegEx Tokenizer for ml Feb 11, 2015

ghost changed the title ~~[ML][FEATURE]RegEx Tokenizer for ml~~ [ML][FEATURE]RegEx Tokenizer Feb 11, 2015

ghost changed the title ~~[ML][FEATURE]RegEx Tokenizer~~ [ML][FEATURE] SPARK-5566: RegEx Tokenizer Feb 11, 2015

Augustin Borsu added 4 commits February 12, 2015 11:52

Merge remote-tracking branch 'upstream/master'

11ca50f

Merge remote-tracking branch 'upstream/master'

196cd7a

Merge remote-tracking branch 'upstream/master'

2e89719

Merge remote-tracking branch 'upstream/master'

77ff9ca

mengxr reviewed Feb 24, 2015
View reviewed changes

mengxr reviewed Mar 20, 2015
View reviewed changes

sagacifyTestUser added 2 commits March 23, 2015 09:19

Merge remote-tracking branch 'upstream/master'

12dddb4

Added return type to public functions

148126f

Plus some cosmetic changes.

mengxr and others added 2 commits March 23, 2015 14:17

change pattern to a StringParameter; update tests

e88d7b8

Merge remote-tracking branch 'upstream/master'

2338da5

mengxr added 4 commits March 24, 2015 10:14

Merge remote-tracking branch 'apache/master' into SPARK-5566

f96526d

update test

9651aec

Merge branch 'aborsu985-master' into SPARK-5566

556aa27

remove tabs

a164800

Merge remote-tracking branch 'upstream/master'

5f09434

sagacifyTestUser added 2 commits March 25, 2015 08:18

Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-…

cb07021

…SPARK-5566

Merge branch 'mengxr-SPARK-5566'

716d257

asfgit closed this in 982952f Mar 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

ghost commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

hhbyyh commented Feb 10, 2015

ghost commented Feb 10, 2015

hhbyyh commented Feb 10, 2015

ghost commented Feb 10, 2015

mengxr commented Feb 10, 2015

ghost commented Feb 11, 2015

mengxr Feb 24, 2015

ghost Mar 3, 2015

mengxr Mar 20, 2015

mengxr commented Mar 20, 2015

SparkQA commented Mar 23, 2015

SparkQA commented Mar 23, 2015

AmplabJenkins commented Mar 23, 2015

SparkQA commented Mar 24, 2015

SparkQA commented Mar 24, 2015

AmplabJenkins commented Mar 24, 2015

mengxr commented Mar 24, 2015

SparkQA commented Mar 25, 2015

SparkQA commented Mar 25, 2015

SparkQA commented Mar 25, 2015

AmplabJenkins commented Mar 25, 2015

SparkQA commented Mar 25, 2015

AmplabJenkins commented Mar 25, 2015

ghost commented Mar 25, 2015

mengxr commented Mar 25, 2015

[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

[ML][FEATURE] SPARK-5566: RegEx Tokenizer #4504

Conversation

ghost commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

hhbyyh commented Feb 10, 2015

ghost commented Feb 10, 2015

hhbyyh commented Feb 10, 2015

ghost commented Feb 10, 2015

mengxr commented Feb 10, 2015

ghost commented Feb 11, 2015

mengxr Feb 24, 2015

Choose a reason for hiding this comment

ghost Mar 3, 2015

Choose a reason for hiding this comment

mengxr Mar 20, 2015

Choose a reason for hiding this comment

mengxr commented Mar 20, 2015

SparkQA commented Mar 23, 2015

SparkQA commented Mar 23, 2015

AmplabJenkins commented Mar 23, 2015

SparkQA commented Mar 24, 2015

SparkQA commented Mar 24, 2015

AmplabJenkins commented Mar 24, 2015

mengxr commented Mar 24, 2015

SparkQA commented Mar 25, 2015

SparkQA commented Mar 25, 2015

SparkQA commented Mar 25, 2015

AmplabJenkins commented Mar 25, 2015

SparkQA commented Mar 25, 2015

AmplabJenkins commented Mar 25, 2015

ghost commented Mar 25, 2015

mengxr commented Mar 25, 2015