Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false #1986

Closed
mattf opened this issue Mar 22, 2021 · 4 comments · Fixed by #4790
Closed
Assignees
Labels
bug Something isn't working

Comments

@mattf
Copy link
Collaborator

mattf commented Mar 22, 2021

Describe the bug

input...

empty-0space,,end
empty-0space-quoted,"",end
empty-1space, ,end
empty-1space-quoted," ",end
empty-2space,  ,end
empty-2space-quoted,"  ",end
no-null,3.14,end

schema...

StructType([StructField("firstField", StringType()),
            StructField("dblField", DoubleType(), True),
            StructField("lastField", StringType())])

config spark.rapids.sql.format.csv.enabled = true ...

+-------------------+--------+---------+
|         firstField|dblField|lastField|
+-------------------+--------+---------+
|       empty-0space|    null|      end|
|empty-0space-quoted|    null|      end|
|       empty-1space|     0.0|      end|
|empty-1space-quoted|    null|      end|
|       empty-2space|     0.0|      end|
|empty-2space-quoted|    null|      end|
|            no-null|    3.14|      end|
+-------------------+--------+---------+

config spark.rapids.sql.format.csv.enabled = false ...

+-------------------+--------+---------+
|         firstField|dblField|lastField|
+-------------------+--------+---------+
|       empty-0space|    null|      end|
|empty-0space-quoted|    null|      end|
|       empty-1space|    null|      end|
|empty-1space-quoted|    null|      end|
|       empty-2space|    null|      end|
|empty-2space-quoted|    null|      end|
|            no-null|    3.14|      end|
+-------------------+--------+---------+

Expected behavior
results should be the same

Environment details (please complete the following information)

  • Environment location: local
  • Spark configuration settings related to the issue: 0.4 w/ libcudf 0.18.1
@mattf mattf added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 22, 2021
@mattf
Copy link
Collaborator Author

mattf commented Mar 22, 2021

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

INPUT = \
"""
empty-0space,,end
empty-0space-quoted,"",end
empty-1space, ,end
empty-1space-quoted," ",end
empty-2space,  ,end
empty-2space-quoted,"  ",end
no-null,3.14,end
"""

with open("repro-data.csv", "w") as fp:
    fp.write(INPUT)

schema = StructType([StructField("firstField", StringType()), StructField("dblField", DoubleType(), True), StructField("lastField", StringType())])

spark = SparkSession.builder.config("spark.plugins", "com.nvidia.spark.SQLPlugin").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

print("input...")
print(INPUT)
print("schema:", schema)

print("spark.rapids.sql.format.csv.enabled =", spark.conf.get("spark.rapids.sql.format.csv.enabled"))
spark.read.csv("repro-data.csv", schema=schema).show()

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 23, 2021
@revans2
Copy link
Collaborator

revans2 commented Mar 23, 2021

So just to clarify a bit. By default Spark treats an empty string as null. When Spark parses " " or " " it sees it as a string value with one or two spaces in it. Then it tries to parse that string as a floating point value. That fails because they do not match the correct pattern for a float so Spark inserts a null. cudf is not doing this because... Well we turned on CSV parsing knowing that there are a lot of incompatibility in it, but because it is so fast on the GPU vs the CPU we wanted to do it anyways. I think in the short term we are going to disable CSV parsing by default, and longer term look at trying to enable it selectively for types where we know that we can either have cudf parse it correctly or we can work around it after the fact.

@sameerz
Copy link
Collaborator

sameerz commented Mar 23, 2021

Adding to 0.5 to set spark.rapids.sql.format.csv.enabled to false by default.

@sameerz
Copy link
Collaborator

sameerz commented Apr 13, 2021

The PR to disable csv by default is #2072 . There are other longer term fixes required here, so leaving this open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants