Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support casting Decimal to String #2046

Merged
merged 8 commits into from
Apr 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,13 @@ Name | Description | Default Value
<a name="shuffle.ucx.managementServerHost"></a>spark.rapids.shuffle.ucx.managementServerHost|The host to be used to start the management server|null
<a name="shuffle.ucx.useWakeup"></a>spark.rapids.shuffle.ucx.useWakeup|When set to true, use UCX's event-based progress (epoll) in order to wake up the progress thread when needed, instead of a hot loop.|true
<a name="sql.batchSizeBytes"></a>spark.rapids.sql.batchSizeBytes|Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column.|2147483647
<a name="sql.castDecimalToString.enabled"></a>spark.rapids.sql.castDecimalToString.enabled|When set to true, casting from decimal to string is supported on the GPU. The GPU does NOT produce exact same string as spark produces, but producing strings which are semantically equal. For instance, given input BigDecimal(123, -2), the GPU produces "12300", which spark produces "1.23E+4".|false
<a name="sql.castFloatToDecimal.enabled"></a>spark.rapids.sql.castFloatToDecimal.enabled|Casting from floating point types to decimal on the GPU returns results that have tiny difference compared to results returned from CPU.|false
<a name="sql.castFloatToIntegralTypes.enabled"></a>spark.rapids.sql.castFloatToIntegralTypes.enabled|Casting from floating point types to integral types on the GPU supports a slightly different range of values when using Spark 3.1.0 or later. Refer to the CAST documentation for more details.|false
<a name="sql.castFloatToString.enabled"></a>spark.rapids.sql.castFloatToString.enabled|Casting from floating point types to string on the GPU returns results that have a different precision than the default results of Spark.|false
<a name="sql.castStringToDecimal.enabled"></a>spark.rapids.sql.castStringToDecimal.enabled|When set to true, enables casting from strings to decimal type on the GPU. Currently string to decimal type on the GPU might produce results which slightly differed from the correct results when the string represents any number exceeding the max precision that CAST_STRING_TO_FLOAT can keep. For instance, the GPU returns 99999999999999987 given input string "99999999999999999". The cause of divergence is that we can not cast strings containing scientific notation to decimal directly. So, we have to cast strings to floats firstly. Then, cast floats to decimals. The first step may lead to precision loss.|false
<a name="sql.castStringToFloat.enabled"></a>spark.rapids.sql.castStringToFloat.enabled|When set to true, enables casting from strings to float types (float, double) on the GPU. Currently hex values aren't supported on the GPU. Also note that casting from string to float types on the GPU returns incorrect results when the string represents any number "1.7976931348623158E308" <= x < "1.7976931348623159E308" and "-1.7976931348623158E308" >= x > "-1.7976931348623159E308" in both these cases the GPU returns Double.MaxValue while CPU returns "+Infinity" and "-Infinity" respectively|false
<a name="sql.castStringToInteger.enabled"></a>spark.rapids.sql.castStringToInteger.enabled|When set to true, enables casting from strings to integer types (byte, short, int, long) on the GPU. Casting from string to integer types on the GPU returns incorrect results when the string represents a number larger than Long.MaxValue or smaller than Long.MinValue.|false
<a name="sql.castStringToTimestamp.enabled"></a>spark.rapids.sql.castStringToTimestamp.enabled|When set to true, casting from string to timestamp is supported on the GPU. The GPU only supports a subset of formats when casting strings to timestamps. Refer to the CAST documentation for more details.|false
<a name="sql.concurrentGpuTasks"></a>spark.rapids.sql.concurrentGpuTasks|Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors.|1
<a name="sql.csvTimestamps.enabled"></a>spark.rapids.sql.csvTimestamps.enabled|When set to true, enables the CSV parser to read timestamps. The default output format for Spark includes a timezone at the end. Anything except the UTC timezone is not supported. Timestamps after 2038 and before 1902 are also not supported.|false
Expand Down
4 changes: 2 additions & 2 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -18049,7 +18049,7 @@ and the accelerator produces the same result.
<td><b>NS</b></td>
<td> </td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td>S</td>
<td>S*</td>
<td> </td>
<td> </td>
Expand Down Expand Up @@ -18453,7 +18453,7 @@ and the accelerator produces the same result.
<td><b>NS</b></td>
<td> </td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td>S</td>
<td>S*</td>
<td> </td>
<td> </td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ class Spark311Shims extends Spark301Shims {

// stringChecks are the same
// binaryChecks are the same
override val decimalChecks: TypeSig = none
override val decimalChecks: TypeSig = DECIMAL + STRING
override val sparkDecimalSig: TypeSig = numeric + BOOLEAN + STRING

// calendarChecks are the same
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,9 @@ case class GpuCast(
castDecimalToDecimal(inputVector, from, to)
}

case (_: DecimalType, StringType) =>
revans2 marked this conversation as resolved.
Show resolved Hide resolved
input.castTo(DType.STRING)

case _ =>
input.castTo(GpuColumnVector.getNonNestedRapidsType(dataType))
}
Expand Down
18 changes: 18 additions & 0 deletions sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -582,6 +582,22 @@ object RapidsConf {
.booleanConf
.createWithDefault(false)

val ENABLE_CAST_STRING_TO_INTEGER = conf("spark.rapids.sql.castStringToInteger.enabled")
.doc("When set to true, enables casting from strings to integer types (byte, short, " +
"int, long) on the GPU. Casting from string to integer types on the GPU returns incorrect " +
"results when the string represents a number larger than Long.MaxValue or smaller than " +
"Long.MinValue.")
.booleanConf
.createWithDefault(false)

val ENABLE_CAST_DECIMAL_TO_STRING = conf("spark.rapids.sql.castDecimalToString.enabled")
.doc("When set to true, casting from decimal to string is supported on the GPU. The GPU " +
"does NOT produce exact same string as spark produces, but producing strings which are " +
"semantically equal. For instance, given input BigDecimal(123, -2), the GPU produces " +
"\"12300\", which spark produces \"1.23E+4\".")
.booleanConf
.createWithDefault(false)

val ENABLE_CSV_TIMESTAMPS = conf("spark.rapids.sql.csvTimestamps.enabled")
.doc("When set to true, enables the CSV parser to read timestamps. The default output " +
"format for Spark includes a timezone at the end. Anything except the UTC timezone is not " +
Expand Down Expand Up @@ -1189,6 +1205,8 @@ class RapidsConf(conf: Map[String, String]) extends Logging {

lazy val isCastFloatToIntegralTypesEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_INTEGRAL_TYPES)

lazy val isCastDecimalToStringEnabled: Boolean = get(ENABLE_CAST_DECIMAL_TO_STRING)

lazy val isCsvTimestampEnabled: Boolean = get(ENABLE_CSV_TIMESTAMPS)

lazy val isParquetEnabled: Boolean = get(ENABLE_PARQUET)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -772,7 +772,7 @@ class CastChecks extends ExprChecks {
val binaryChecks: TypeSig = none
val sparkBinarySig: TypeSig = STRING + BINARY

val decimalChecks: TypeSig = DECIMAL
val decimalChecks: TypeSig = DECIMAL + STRING
val sparkDecimalSig: TypeSig = numeric + BOOLEAN + TIMESTAMP + STRING

val calendarChecks: TypeSig = none
Expand Down
14 changes: 14 additions & 0 deletions tests/src/test/scala/com/nvidia/spark/rapids/AnsiCastOpSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,20 @@ class AnsiCastOpSuite extends GpuExpressionTestSuite {
comparisonFunc = Some(compareStringifiedFloats))
}

test("ansi_cast decimal to string") {
val sqlCtx = SparkSession.getActiveSession.get.sqlContext
sqlCtx.setConf("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
sqlCtx.setConf("spark.rapids.sql.castDecimalToString.enabled", "true")

Seq(10, 15, 18).foreach { precision =>
Seq(-precision, -5, 0, 5, precision).foreach { scale =>
testCastToString(DataTypes.createDecimalType(precision, scale),
ansiMode = true,
comparisonFunc = Some(compareStringifiedDecimalsInSemantic))
}
}
}

private def castToStringExpectedFun[T]: T => Option[String] = (d: T) => Some(String.valueOf(d))

private def testCastToString[T](dataType: DataType, ansiMode: Boolean,
Expand Down
15 changes: 15 additions & 0 deletions tests/src/test/scala/com/nvidia/spark/rapids/CastOpSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,19 @@ class CastOpSuite extends GpuExpressionTestSuite {
testCastToString[Double](DataTypes.DoubleType, comparisonFunc = Some(compareStringifiedFloats))
}

test("cast decimal to string") {
val sqlCtx = SparkSession.getActiveSession.get.sqlContext
sqlCtx.setConf("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
sqlCtx.setConf("spark.rapids.sql.castDecimalToString.enabled", "true")

Seq(10, 15, 18).foreach { precision =>
Seq(-precision, -5, 0, 5, precision).foreach { scale =>
testCastToString(DataTypes.createDecimalType(precision, scale),
comparisonFunc = Some(compareStringifiedDecimalsInSemantic))
}
}
}

private def testCastToString[T](
dataType: DataType,
comparisonFunc: Option[(String, String) => Boolean] = None) {
Expand Down Expand Up @@ -481,6 +494,7 @@ class CastOpSuite extends GpuExpressionTestSuite {
customRandGenerator = Some(new scala.util.Random(1234L)))
testCastToDecimal(DataTypes.createDecimalType(18, 2),
scale = 2,
ansiEnabled = true,
customRandGenerator = Some(new scala.util.Random(1234L)))

// fromScale > toScale
Expand All @@ -489,6 +503,7 @@ class CastOpSuite extends GpuExpressionTestSuite {
customRandGenerator = Some(new scala.util.Random(1234L)))
testCastToDecimal(DataTypes.createDecimalType(18, 10),
scale = 2,
ansiEnabled = true,
customRandGenerator = Some(new scala.util.Random(1234L)))
testCastToDecimal(DataTypes.createDecimalType(18, 18),
scale = 15,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
* Copyright (c) 2020-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,7 +16,7 @@

package com.nvidia.spark.rapids

import org.apache.spark.sql.types.{DataType, DataTypes, DecimalType, StructType}
import org.apache.spark.sql.types.{DataType, DataTypes, Decimal, DecimalType, StructType}

abstract class GpuExpressionTestSuite extends SparkQueryCompareTestSuite {

Expand Down Expand Up @@ -172,6 +172,11 @@ abstract class GpuExpressionTestSuite extends SparkQueryCompareTestSuite {
}
}

def compareStringifiedDecimalsInSemantic(expected: String, actual: String): Boolean = {
(expected == null && actual == null) ||
(expected != null && actual != null && Decimal(expected) == Decimal(actual))
sperlingxx marked this conversation as resolved.
Show resolved Hide resolved
}

private def getAs(column: RapidsHostColumnVector, index: Int, dataType: DataType): Option[Any] = {
if (column.isNullAt(index)) {
None
Expand Down