Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8269][SQL]string function: initcap #7208

Closed
wants to merge 8 commits into from
Closed

Conversation

hujy
Copy link
Contributor

@hujy hujy commented Jul 3, 2015

Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace.


override def inputTypes: Seq[DataType] = Seq(StringType)

override def eval(input: InternalRow): Any = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just implement one in UTF8String directly on the bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably very difficult, there are some special cases. See java.lang.ConditionalSpecialCasing. Not sure if there any third-party library can work with bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just tried to implement one in UTF8String. The difficulty is i need to enum all specific alphabets used in Europe. Though converting the uppercase by char ascii numbers works, i'm not sure if i list all the available alphabets correctly. i've tested in scala console, the toUpper of stringbuffer can convert the specific char(s).

test case
scala> val sb = new StringBuffer()
sb: StringBuffer =

scala> val string = 'δ'
string: Char = δ

scala> sb.append(string)
res0: StringBuffer = δ

scala> sb.charAt(0).toUpper
res1: Char = Δ

@tarekbecker
Copy link
Contributor

Can you update the title and the soundex Jira ticket name. I guess you have mixed up the branches. This pr contains code of #7115

@@ -165,6 +165,20 @@ public UTF8String toLowerCase() {
return UTF8String.fromString(toString().toLowerCase());
}

public UTF8String initCap(final byte[] byteArr) {
if (byteArr == null) return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do the null check here. @chenghao-intel explained it very well in #6804

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #1001 has finished for PR 7208 at commit 1826c16.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):
    • class FlumeUtils(object):
    • trait ExpectsInputTypes
    • abstract class BinaryExpression extends Expression with trees.BinaryNode[Expression]
    • abstract class BinaryOperator extends BinaryExpression
    • abstract class BinaryArithmetic extends BinaryOperator
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression
    • case class Factorial(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class ShiftLeft(left: Expression, right: Expression) extends BinaryExpression
    • case class ShiftRight(left: Expression, right: Expression) extends BinaryExpression
    • case class UnHex(child: Expression) extends UnaryExpression with Serializable
    • case class Md5(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Sha1(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Crc32(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Not(child: Expression) extends UnaryExpression with Predicate with ExpectsInputTypes
    • abstract class BinaryComparison extends BinaryOperator with Predicate
    • trait StringRegexExpression extends ExpectsInputTypes
    • trait CaseConversionExpression extends ExpectsInputTypes
    • trait StringComparison extends ExpectsInputTypes
    • case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • protected[sql] abstract class AtomicType extends DataType
    • abstract class NumericType extends AtomicType
    • abstract class DataType extends AbstractDataType

val sb = new StringBuilder
sb.append(str)
sb.setCharAt(0, sb(0).toUpper)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old comment: This crashes if str = "" (in outdated diff)

scala> val sb = new StringBuilder; sb.append(""); sb.setCharAt(0, sb(0).toUpper)
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
  at java.lang.AbstractStringBuilder.charAt(AbstractStringBuilder.java:210)
  at java.lang.StringBuilder.charAt(StringBuilder.java:76)
  at scala.collection.mutable.StringBuilder.apply(StringBuilder.scala:117)
  ... 33 elided

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you should return an empty UTF8String, not null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tarekauel: I think this check should be implemented in stringbuilder function. On the other side, the exception is reasonable. We cannot hide all the exceptions.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #1002 has finished for PR 7208 at commit 30967df.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

if (string == null) {
null
}
else if (string.toString.length == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString isn't necessary: string.asInstanceOf[UTF8String].length == 0

@liancheng
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36655 has finished for PR 7208 at commit e5c8de0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSet(child: Expression, hset: Set[Any])
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36752 has finished for PR 7208 at commit 9f91343.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36770 has finished for PR 7208 at commit f3b56fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36775 has finished for PR 7208 at commit 2738b00.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36864 has finished for PR 7208 at commit 17551ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes

@@ -182,8 +184,9 @@ trait StringComparison extends ExpectsInputTypes {
* A function that returns true if the string `left` contains the string `right`.
*/
case class Contains(left: Expression, right: Expression)
extends BinaryExpression with Predicate with StringComparison {
extends BinaryExpression with Predicate with StringComparison {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you removed the space? See style-guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation
I would say that extends should be indented like a parameter (4 spaces)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarekauel actually 2 space here

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37791 has finished for PR 7208 at commit 7dcd57c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37793 has finished for PR 7208 at commit b2590a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ConcatWs(children: Seq[Expression])
    • case class InitCap(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38010 has finished for PR 7208 at commit 7ce416b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFormat(children: Expression*) extends Expression with ImplicitCastInputTypes
    • case class InitCap(child: Expression) extends UnaryExpression

* @group string_funcs
* @since 1.5.0
*/
def initcap(columnName: String): Column = initcap(Column(columnName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @rxin made some clean up for the DF function, we'd better remove the columnName version of API.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38145 has finished for PR 7208 at commit c79482d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38155 has finished for PR 7208 at commit 6a0b958.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression

@chenghao-intel
Copy link
Contributor

@rxin, is that OK to be merged? I can create follow up PR for the python API and the codegen.

@hujy
Copy link
Contributor Author

hujy commented Jul 23, 2015

Chenghao, I'll follow.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38343 has finished for PR 7208 at commit 1f5a0ef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression

@hujy
Copy link
Contributor Author

hujy commented Jul 27, 2015

I tested the codegen works from my local computer and ask for a review retest. The central build failed at:
Single command with --database *** FAILED *** (1 minute)
[info] java.util.concurrent.TimeoutException: Futures timed out after [1 minute]

@rxin
Copy link
Contributor

rxin commented Jul 27, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39176 has finished for PR 7208 at commit b616c0e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39193 has finished for PR 7208 at commit 2cd43e5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InitCap(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39221 has finished for PR 7208 at commit 8b2506a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants