[BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false #1986

mattf · 2021-03-22T21:15:55Z

Describe the bug

input...

empty-0space,,end
empty-0space-quoted,"",end
empty-1space, ,end
empty-1space-quoted," ",end
empty-2space,  ,end
empty-2space-quoted,"  ",end
no-null,3.14,end

schema...

StructType([StructField("firstField", StringType()),
            StructField("dblField", DoubleType(), True),
            StructField("lastField", StringType())])

config spark.rapids.sql.format.csv.enabled = true ...

+-------------------+--------+---------+
|         firstField|dblField|lastField|
+-------------------+--------+---------+
|       empty-0space|    null|      end|
|empty-0space-quoted|    null|      end|
|       empty-1space|     0.0|      end|
|empty-1space-quoted|    null|      end|
|       empty-2space|     0.0|      end|
|empty-2space-quoted|    null|      end|
|            no-null|    3.14|      end|
+-------------------+--------+---------+

config spark.rapids.sql.format.csv.enabled = false ...

+-------------------+--------+---------+
|         firstField|dblField|lastField|
+-------------------+--------+---------+
|       empty-0space|    null|      end|
|empty-0space-quoted|    null|      end|
|       empty-1space|    null|      end|
|empty-1space-quoted|    null|      end|
|       empty-2space|    null|      end|
|empty-2space-quoted|    null|      end|
|            no-null|    3.14|      end|
+-------------------+--------+---------+

Expected behavior
results should be the same

Environment details (please complete the following information)

Environment location: local
Spark configuration settings related to the issue: 0.4 w/ libcudf 0.18.1

The text was updated successfully, but these errors were encountered:

mattf · 2021-03-22T21:18:15Z

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

INPUT = \
"""
empty-0space,,end
empty-0space-quoted,"",end
empty-1space, ,end
empty-1space-quoted," ",end
empty-2space,  ,end
empty-2space-quoted,"  ",end
no-null,3.14,end
"""

with open("repro-data.csv", "w") as fp:
    fp.write(INPUT)

schema = StructType([StructField("firstField", StringType()), StructField("dblField", DoubleType(), True), StructField("lastField", StringType())])

spark = SparkSession.builder.config("spark.plugins", "com.nvidia.spark.SQLPlugin").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

print("input...")
print(INPUT)
print("schema:", schema)

print("spark.rapids.sql.format.csv.enabled =", spark.conf.get("spark.rapids.sql.format.csv.enabled"))
spark.read.csv("repro-data.csv", schema=schema).show()

revans2 · 2021-03-23T21:52:32Z

So just to clarify a bit. By default Spark treats an empty string as null. When Spark parses " " or " " it sees it as a string value with one or two spaces in it. Then it tries to parse that string as a floating point value. That fails because they do not match the correct pattern for a float so Spark inserts a null. cudf is not doing this because... Well we turned on CSV parsing knowing that there are a lot of incompatibility in it, but because it is so fast on the GPU vs the CPU we wanted to do it anyways. I think in the short term we are going to disable CSV parsing by default, and longer term look at trying to enable it selectively for types where we know that we can either have cudf parse it correctly or we can work around it after the fact.

sameerz · 2021-03-23T22:31:35Z

Adding to 0.5 to set spark.rapids.sql.format.csv.enabled to false by default.

sameerz · 2021-04-13T20:52:29Z

The PR to disable csv by default is #2072 . There are other longer term fixes required here, so leaving this open.

mattf added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 22, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Mar 23, 2021

sameerz assigned revans2 Mar 23, 2021

andygrove mentioned this issue Jan 26, 2022

Improve support for reading CSV and JSON floating-point values #4637

Merged

7 tasks

revans2 removed their assignment Feb 1, 2022

andygrove mentioned this issue Feb 15, 2022

Improve JSON and CSV parsing of integer values #4790

Merged

4 tasks

andygrove self-assigned this Feb 15, 2022

andygrove added this to the Feb 14 - Feb 25 milestone Feb 15, 2022

andygrove linked a pull request Feb 15, 2022 that will close this issue

Improve JSON and CSV parsing of integer values #4790

Merged

4 tasks

andygrove closed this as completed in #4790 Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false #1986

[BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false #1986

mattf commented Mar 22, 2021

mattf commented Mar 22, 2021

revans2 commented Mar 23, 2021

sameerz commented Mar 23, 2021

sameerz commented Apr 13, 2021

[BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false #1986

[BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false #1986

Comments

mattf commented Mar 22, 2021

mattf commented Mar 22, 2021

revans2 commented Mar 23, 2021

sameerz commented Mar 23, 2021

sameerz commented Apr 13, 2021