[SPARK-8269][SQL]string function: initcap #7208

hujy · 2015-07-03T07:25:43Z

Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace.

rxin · 2015-07-03T07:27:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

+
+  override def inputTypes: Seq[DataType] = Seq(StringType)
+
+  override def eval(input: InternalRow): Any = {


can you just implement one in UTF8String directly on the bytes?

Probably very difficult, there are some special cases. See java.lang.ConditionalSpecialCasing. Not sure if there any third-party library can work with bytes.

i just tried to implement one in UTF8String. The difficulty is i need to enum all specific alphabets used in Europe. Though converting the uppercase by char ascii numbers works, i'm not sure if i list all the available alphabets correctly. i've tested in scala console, the toUpper of stringbuffer can convert the specific char(s).

test case
scala> val sb = new StringBuffer()
sb: StringBuffer =

scala> val string = 'δ'
string: Char = δ

scala> sb.append(string)
res0: StringBuffer = δ

scala> sb.charAt(0).toUpper
res1: Char = Δ

tarekbecker · 2015-07-03T18:27:30Z

~~Can you update the title and the soundex Jira ticket name. I guess you have mixed up the branches. This pr contains code of #7115~~

tarekbecker · 2015-07-06T09:31:13Z

unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

@@ -165,6 +165,20 @@ public UTF8String toLowerCase() {
    return UTF8String.fromString(toString().toLowerCase());
  }

+  public UTF8String initCap(final byte[] byteArr) {
+    if (byteArr == null) return null;


Please don't do the null check here. @chenghao-intel explained it very well in #6804

SparkQA · 2015-07-06T15:22:49Z

Test build #1001 has finished for PR 7208 at commit 1826c16.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):
- class FlumeUtils(object):
- trait ExpectsInputTypes
- abstract class BinaryExpression extends Expression with trees.BinaryNode[Expression]
- abstract class BinaryOperator extends BinaryExpression
- abstract class BinaryArithmetic extends BinaryOperator
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression
- case class Factorial(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class ShiftLeft(left: Expression, right: Expression) extends BinaryExpression
- case class ShiftRight(left: Expression, right: Expression) extends BinaryExpression
- case class UnHex(child: Expression) extends UnaryExpression with Serializable
- case class Md5(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Sha1(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Crc32(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Not(child: Expression) extends UnaryExpression with Predicate with ExpectsInputTypes
- abstract class BinaryComparison extends BinaryOperator with Predicate
- trait StringRegexExpression extends ExpectsInputTypes
- trait CaseConversionExpression extends ExpectsInputTypes
- trait StringComparison extends ExpectsInputTypes
- case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes
- protected[sql] abstract class AtomicType extends DataType
- abstract class NumericType extends AtomicType
- abstract class DataType extends AbstractDataType

tarekbecker · 2015-07-06T15:31:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

+      val sb = new StringBuilder
+      sb.append(str)
+      sb.setCharAt(0, sb(0).toUpper)
+


old comment: This crashes if str = "" (in outdated diff)

scala> val sb = new StringBuilder; sb.append(""); sb.setCharAt(0, sb(0).toUpper) java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.AbstractStringBuilder.charAt(AbstractStringBuilder.java:210) at java.lang.StringBuilder.charAt(StringBuilder.java:76) at scala.collection.mutable.StringBuilder.apply(StringBuilder.scala:117) ... 33 elided

Then you should return an empty UTF8String, not null.

tarekauel: I think this check should be implemented in stringbuilder function. On the other side, the exception is reasonable. We cannot hide all the exceptions.

SparkQA · 2015-07-07T03:42:50Z

Test build #1002 has finished for PR 7208 at commit 30967df.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

tarekbecker · 2015-07-07T03:49:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

+    if (string == null) {
+      null
+    }
+    else if (string.toString.length == 0) {


toString isn't necessary: string.asInstanceOf[UTF8String].length == 0

liancheng · 2015-07-07T07:40:22Z

ok to test

SparkQA · 2015-07-07T07:48:18Z

Test build #36655 has finished for PR 7208 at commit e5c8de0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InSet(child: Expression, hset: Set[Any])
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2015-07-08T03:21:11Z

Test build #36752 has finished for PR 7208 at commit 9f91343.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2015-07-08T10:12:40Z

Test build #36770 has finished for PR 7208 at commit f3b56fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2015-07-08T10:49:13Z

Test build #36775 has finished for PR 7208 at commit 2738b00.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2015-07-09T02:51:59Z

Test build #36864 has finished for PR 7208 at commit 17551ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

tarekbecker · 2015-07-09T08:13:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

@@ -182,8 +184,9 @@ trait StringComparison extends ExpectsInputTypes {
 * A function that returns true if the string `left` contains the string `right`.
 */
 case class Contains(left: Expression, right: Expression)
-    extends BinaryExpression with Predicate with StringComparison {
+  extends BinaryExpression with Predicate with StringComparison {


Why have you removed the space? See style-guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation
I would say that extends should be indented like a parameter (4 spaces)

@tarekauel actually 2 space here

SparkQA · 2015-07-20T02:20:40Z

Test build #37791 has finished for PR 7208 at commit 7dcd57c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-20T04:08:26Z

Test build #37793 has finished for PR 7208 at commit b2590a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ConcatWs(children: Seq[Expression])
- case class InitCap(child: Expression) extends UnaryExpression

SparkQA · 2015-07-22T03:29:36Z

Test build #38010 has finished for PR 7208 at commit 7ce416b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringFormat(children: Expression*) extends Expression with ImplicitCastInputTypes
- case class InitCap(child: Expression) extends UnaryExpression

chenghao-intel · 2015-07-22T14:10:16Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group string_funcs
+   * @since 1.5.0
+   */
+  def initcap(columnName: String): Column = initcap(Column(columnName))


As @rxin made some clean up for the DF function, we'd better remove the columnName version of API.

SparkQA · 2015-07-23T01:47:05Z

Test build #38145 has finished for PR 7208 at commit c79482d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression

SparkQA · 2015-07-23T05:06:46Z

Test build #38155 has finished for PR 7208 at commit 6a0b958.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression

chenghao-intel · 2015-07-23T05:53:43Z

@rxin, is that OK to be merged? I can create follow up PR for the python API and the codegen.

hujy · 2015-07-23T05:56:52Z

Chenghao, I'll follow.

SparkQA · 2015-07-24T09:17:17Z

Test build #38343 has finished for PR 7208 at commit 1f5a0ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression

hujy · 2015-07-27T01:09:20Z

I tested the codegen works from my local computer and ask for a review retest. The central build failed at:
Single command with --database *** FAILED *** (1 minute)
[info] java.util.concurrent.TimeoutException: Futures timed out after [1 minute]

rxin · 2015-07-27T01:32:09Z

Jenkins, retest this please.

SparkQA · 2015-07-31T06:50:28Z

Test build #39176 has finished for PR 7208 at commit b616c0e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression

SparkQA · 2015-07-31T10:01:13Z

Test build #39193 has finished for PR 7208 at commit 2cd43e5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InitCap(child: Expression) extends UnaryExpression

SparkQA · 2015-07-31T16:33:25Z

Test build #39221 has finished for PR 7208 at commit 8b2506a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin reviewed Jul 3, 2015
View reviewed changes

tarekbecker mentioned this pull request Jul 6, 2015

[SPARK-8271][SQL]string function: soundex #7115

Closed

tarekbecker reviewed Jul 6, 2015
View reviewed changes

hujy force-pushed the initcap branch from 1826c16 to 30967df Compare July 7, 2015 01:30

tarekbecker reviewed Jul 7, 2015
View reviewed changes

tarekbecker reviewed Jul 9, 2015
View reviewed changes

hujy force-pushed the initcap branch from b2590a9 to 7ce416b Compare July 22, 2015 01:42

support initcap rebase code

7ce416b

chenghao-intel reviewed Jul 22, 2015
View reviewed changes

support soundex

c79482d

add column

6a0b958

hujy added 2 commits July 23, 2015 14:12

Merge branch 'master' of https://github.com/apache/spark into initcap

7e0c604

add codegen

1f5a0ef

add python api

b616c0e

fix python style check

2cd43e5

Update functions.py

8b2506a

davies mentioned this pull request Aug 1, 2015

[SPARK-8269][SQL] string function: initcap #7850

Closed

asfgit closed this in 00cd92f Aug 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8269][SQL]string function: initcap #7208

[SPARK-8269][SQL]string function: initcap #7208

hujy commented Jul 3, 2015

rxin Jul 3, 2015

chenghao-intel Jul 5, 2015

hujy Jul 13, 2015

tarekbecker commented Jul 3, 2015

tarekbecker Jul 6, 2015

SparkQA commented Jul 6, 2015

tarekbecker Jul 6, 2015

davies Jul 6, 2015

hujy Jul 7, 2015

SparkQA commented Jul 7, 2015

tarekbecker Jul 7, 2015

liancheng commented Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 9, 2015

tarekbecker Jul 9, 2015

rxin Jul 9, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 22, 2015

chenghao-intel Jul 22, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

chenghao-intel commented Jul 23, 2015

hujy commented Jul 23, 2015

SparkQA commented Jul 24, 2015

hujy commented Jul 27, 2015

rxin commented Jul 27, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015


		override def inputTypes: Seq[DataType] = Seq(StringType)

		override def eval(input: InternalRow): Any = {

[SPARK-8269][SQL]string function: initcap #7208

[SPARK-8269][SQL]string function: initcap #7208

Conversation

hujy commented Jul 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekbecker commented Jul 3, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 7, 2015

Choose a reason for hiding this comment

liancheng commented Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 22, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

chenghao-intel commented Jul 23, 2015

hujy commented Jul 23, 2015

SparkQA commented Jul 24, 2015

hujy commented Jul 27, 2015

rxin commented Jul 27, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015