[BUG] Plugin does not support CSV Header Checking #2862

revans2 · 2021-07-02T13:27:51Z

Describe the bug
When the schema of a CSV file does not match the headers in the file a warning is output.

scala> val schema = StructType(Seq(StructField("INPUT", StringType), StructField("INPUT1", StringType), StructField("MORE", StringType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(INPUT,StringType,true), StructField(INPUT1,StringType,true), StructField(MORE,StringType,true))
scala> val df = spark.read.option("header", true).schema(schema).csv("duplicate.csv")
df: org.apache.spark.sql.DataFrame = [INPUT: string, INPUT1: string ... 1 more field]
scala> df.show
21/07/02 13:02:08 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: INPUT, INPUT, MORE
 Schema: INPUT, INPUT1, MORE
Expected: INPUT1 but found: INPUT
CSV file: file:///home/roberte/src/rapids-plugin-4-spark/duplicate.csv
+-----+------+----+
|INPUT|INPUT1|MORE|
+-----+------+----+
|    1|     2|   3|
|    1|     2|   3|
|    1|     2|   3|
+-----+------+----+

But when the plugin is enabled there is no warning form CSVHeaderChecker.

Steps/Code to reproduce bug
Have a CSV file with a different header than the schema passed in. Read it, preferably in local mode because the warning is logged by the process that reads the file, so it will not get back to the end user very easily.

Expected behavior
The plugin also outputs a warning.

Additional context
This was found because the cudf team asked us about requirements for duplicate header names. Apparently Pandas and Spark will create different unique header names when there are duplicates. They were in the process of making cudf do the right thing for the Pandas case, and wanted to be sure it would not cause issues with Spark. We need to be sure that when we do implement this feature that we test it with duplicate column names so that we are sure that cudf is doing the right thing for the warnings.

The text was updated successfully, but these errors were encountered:

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 2, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 6, 2021

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Plugin does not support CSV Header Checking #2862

[BUG] Plugin does not support CSV Header Checking #2862

revans2 commented Jul 2, 2021

[BUG] Plugin does not support CSV Header Checking #2862

[BUG] Plugin does not support CSV Header Checking #2862

Comments

revans2 commented Jul 2, 2021