-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22825][SQL] Fix incorrect results of Casting Array to String #20024
Conversation
Test build #85118 has finished for PR 20024 at commit
|
After the approach to solve the issue is fixed, I'll also fix the |
Test build #85119 has finished for PR 20024 at commit
|
val df1 = sql("SELECT CAST(ARRAY(1, 2, 3, 4) AS STRING)") | ||
checkAnswer(df1, Row("[1, 2, 3, 4]")) | ||
val df2 = sql("SELECT CAST(ARRAY(ARRAY(1, 2), ARRAY(3, 4, 5), ARRAY(6, 7)) AS STRING)") | ||
checkAnswer(df2, Row("[WrappedArray(1, 2), WrappedArray(3, 4, 5), WrappedArray(6, 7)]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @maropu .
Could you put the result after this PR into PR description? So far, only before result is described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
Could you check? Thanks! @gatorsmile @viirya |
ping |
cc @cloud-fan |
I feel it's not the most effcient way to cast array to string by deserializing the catalyst array to java array. Instead, I think we should have a schema-aware string casting function, i.e. using |
ok, I'll brush up based on your suggestion. |
@cloud-fan How about the current impl.? (not finished yet though) |
Test build #85630 has finished for PR 20024 at commit
|
*/ | ||
public class StringBuffer { | ||
|
||
private BufferHolder buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I reused BufferHolder
for the string buffer though, probably we'd better to make another buffer implementation for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, just use java.lang.StringBuffer
?
Test build #85631 has finished for PR 20024 at commit
|
Test build #85632 has finished for PR 20024 at commit
|
570f13b
to
91a60c3
Compare
Test build #85633 has finished for PR 20024 at commit
|
ca6519e
to
9e13bb9
Compare
Test build #85634 has finished for PR 20024 at commit
|
Test build #85640 has finished for PR 20024 at commit
|
3f075d8
to
c2e6757
Compare
} | ||
|
||
public void append(String value) { | ||
append(value.getBytes(StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be append(UTF8String.fromString(value))
, then we can remove append(byte[] value)
return UTF8String.fromBytes(bytes); | ||
} | ||
|
||
public int totalSize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be public
public UTF8String toUTF8String() { | ||
final int len = totalSize(); | ||
final byte[] bytes = new byte[len]; | ||
Platform.copyMemory(buffer, Platform.BYTE_ARRAY_OFFSET, bytes, Platform.BYTE_ARRAY_OFFSET, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why copy? we can do UTF8String.fromBytes(buffer, 0, totalSize)
cursor += value.length; | ||
} | ||
|
||
public UTF8String toUTF8String() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: public UTF8String build()
@@ -206,6 +206,23 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
case DateType => buildCast[Int](_, d => UTF8String.fromString(DateTimeUtils.dateToString(d))) | |||
case TimestampType => buildCast[Long](_, | |||
t => UTF8String.fromString(DateTimeUtils.timestampToString(t, timeZone))) | |||
case ar: ArrayType => | |||
buildCast[ArrayData](_, array => { | |||
val res = new UTF8StringBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: builder
@@ -206,6 +206,23 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
case DateType => buildCast[Int](_, d => UTF8String.fromString(DateTimeUtils.dateToString(d))) | |||
case TimestampType => buildCast[Long](_, | |||
t => UTF8String.fromString(DateTimeUtils.timestampToString(t, timeZone))) | |||
case ar: ArrayType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: case ArrayType(et, _)
|$bufferClass $bufferTerm = new $bufferClass(); | ||
|$writeArrayToBuffer($c, $bufferTerm); | ||
|$evPrim = $bufferTerm.toUTF8String(); | ||
""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this too
val elementToStringCode = castToStringCode(et, ctx)
val funcName = ctx.freshName("elementToString")
val elementToStringFunc = ctx.addNewFunction(funcName,
s"""
private UTF8String $funcName(${ctx.dataType(et)} element) {
UTF8String elementStr = null;
${elementToStringCode("element", "elementStr", null /* resultIsNull won't be touched */)}
return elementStr;
}
""")
...
$bufferClass $bufferTerm = new $bufferClass();
$bufferTerm.append("[");
if ($c.numElements > 0) {
if (!$c.isNullAt(0)) {
$buffer.append($elementToStringFunc(${ctx.getValue(array, et, "0")}))
}
for (int $loopIndex = 1; $loopIndex < $arTerm.numElements(); $loopIndex++) ...
}
Test build #85676 has finished for PR 20024 at commit
|
import org.apache.spark.unsafe.types.UTF8String; | ||
|
||
/** | ||
* A helper class to write `UTF8String`, `String`, and `byte[]` data into an internal byte buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A helper class to write {@link UTF8String}s to an internal buffer and build the concatenated {@link UTF8String} at the end.
@@ -597,6 +619,44 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
""" | |||
} | |||
|
|||
private[this] def codegenWriteArrayElemCode(et: DataType, ctx: CodegenContext): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns a function to write the array elements, maybe a better name is: writeArrayToStringBuilderFunc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, the returned function is only called once, I think we don't need to make it a function, but just return the code, e.g.
def writeArrayToStringBuilder(ctx: CodegenContext, et: DataType, arr: String, builder: String): String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elementToString
needs a function because it's called twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll update soon.
@@ -2775,4 +2773,53 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-22825 Cast array to string") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the unit test is good enough, we don't need this end-to-end test.
LGTM except a few minor comments |
Test build #85681 has finished for PR 20024 at commit
|
Thanks for the kindly checks in the year start and all the comments done. |
Test build #85701 has finished for PR 20024 at commit
|
retest this please |
@@ -608,6 +665,17 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
val tz = ctx.addReferenceObj("timeZone", timeZone) | |||
(c, evPrim, evNull) => s"""$evPrim = UTF8String.fromString( | |||
org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));""" | |||
case ArrayType(et, _) => | |||
(c, evPrim, evNull) => { | |||
val bufferTerm = ctx.freshName("bufferTerm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: In codegen we usually don't add a term
postfix, just call it buffer
, array
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #85710 has finished for PR 20024 at commit
|
Test build #85714 has finished for PR 20024 at commit
|
thanks, merging to master/2.3! |
as a follow-up, we should do the same thing for struct and map type too. |
## What changes were proposed in this pull request? This pr fixed the issue when casting arrays into strings; ``` scala> val df = spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids)) scala> df.write.saveAsTable("t") scala> sql("SELECT cast(ids as String) FROM t").show(false) +------------------------------------------------------------------+ |ids | +------------------------------------------------------------------+ |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData8bc285df| +------------------------------------------------------------------+ ``` This pr modified the result into; ``` +------------------------------+ |ids | +------------------------------+ |[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]| +------------------------------+ ``` ## How was this patch tested? Added tests in `CastSuite` and `SQLQuerySuite`. Author: Takeshi Yamamuro <[email protected]> Closes #20024 from maropu/SPARK-22825. (cherry picked from commit 52fc5c1) Signed-off-by: Wenchen Fan <[email protected]>
Thanks! I'll do. |
What changes were proposed in this pull request?
This pr fixed the issue when casting arrays into strings;
This pr modified the result into;
How was this patch tested?
Added tests in
CastSuite
andSQLQuerySuite
.