-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8269][SQL]string function: initcap #7208
Conversation
|
||
override def inputTypes: Seq[DataType] = Seq(StringType) | ||
|
||
override def eval(input: InternalRow): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just implement one in UTF8String directly on the bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably very difficult, there are some special cases. See java.lang.ConditionalSpecialCasing
. Not sure if there any third-party library can work with bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just tried to implement one in UTF8String. The difficulty is i need to enum all specific alphabets used in Europe. Though converting the uppercase by char ascii numbers works, i'm not sure if i list all the available alphabets correctly. i've tested in scala console, the toUpper of stringbuffer can convert the specific char(s).
test case
scala> val sb = new StringBuffer()
sb: StringBuffer =
scala> val string = 'δ'
string: Char = δ
scala> sb.append(string)
res0: StringBuffer = δ
scala> sb.charAt(0).toUpper
res1: Char = Δ
|
@@ -165,6 +165,20 @@ public UTF8String toLowerCase() { | |||
return UTF8String.fromString(toString().toLowerCase()); | |||
} | |||
|
|||
public UTF8String initCap(final byte[] byteArr) { | |||
if (byteArr == null) return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't do the null
check here. @chenghao-intel explained it very well in #6804
Test build #1001 has finished for PR 7208 at commit
|
val sb = new StringBuilder | ||
sb.append(str) | ||
sb.setCharAt(0, sb(0).toUpper) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old comment: This crashes if str = ""
(in outdated diff)
scala> val sb = new StringBuilder; sb.append(""); sb.setCharAt(0, sb(0).toUpper)
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.AbstractStringBuilder.charAt(AbstractStringBuilder.java:210)
at java.lang.StringBuilder.charAt(StringBuilder.java:76)
at scala.collection.mutable.StringBuilder.apply(StringBuilder.scala:117)
... 33 elided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you should return an empty UTF8String, not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tarekauel: I think this check should be implemented in stringbuilder function. On the other side, the exception is reasonable. We cannot hide all the exceptions.
Test build #1002 has finished for PR 7208 at commit
|
if (string == null) { | ||
null | ||
} | ||
else if (string.toString.length == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toString
isn't necessary: string.asInstanceOf[UTF8String].length == 0
ok to test |
Test build #36655 has finished for PR 7208 at commit
|
Test build #36752 has finished for PR 7208 at commit
|
Test build #36770 has finished for PR 7208 at commit
|
Test build #36775 has finished for PR 7208 at commit
|
Test build #36864 has finished for PR 7208 at commit
|
@@ -182,8 +184,9 @@ trait StringComparison extends ExpectsInputTypes { | |||
* A function that returns true if the string `left` contains the string `right`. | |||
*/ | |||
case class Contains(left: Expression, right: Expression) | |||
extends BinaryExpression with Predicate with StringComparison { | |||
extends BinaryExpression with Predicate with StringComparison { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have you removed the space? See style-guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation
I would say that extends
should be indented like a parameter (4 spaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarekauel actually 2 space here
Test build #37791 has finished for PR 7208 at commit
|
Test build #37793 has finished for PR 7208 at commit
|
Test build #38010 has finished for PR 7208 at commit
|
* @group string_funcs | ||
* @since 1.5.0 | ||
*/ | ||
def initcap(columnName: String): Column = initcap(Column(columnName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @rxin made some clean up for the DF function, we'd better remove the columnName
version of API.
Test build #38145 has finished for PR 7208 at commit
|
Test build #38155 has finished for PR 7208 at commit
|
@rxin, is that OK to be merged? I can create follow up PR for the python API and the codegen. |
Chenghao, I'll follow. |
Test build #38343 has finished for PR 7208 at commit
|
I tested the codegen works from my local computer and ask for a review retest. The central build failed at: |
Jenkins, retest this please. |
Test build #39176 has finished for PR 7208 at commit
|
Test build #39193 has finished for PR 7208 at commit
|
Test build #39221 has finished for PR 7208 at commit
|
Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace.