You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When the schema of a CSV file does not match the headers in the file a warning is output.
scala> val schema = StructType(Seq(StructField("INPUT", StringType), StructField("INPUT1", StringType), StructField("MORE", StringType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(INPUT,StringType,true), StructField(INPUT1,StringType,true), StructField(MORE,StringType,true))
scala> val df = spark.read.option("header", true).schema(schema).csv("duplicate.csv")
df: org.apache.spark.sql.DataFrame = [INPUT: string, INPUT1: string ... 1 more field]
scala> df.show
21/07/02 13:02:08 WARN CSVHeaderChecker: CSV header does not conform to the schema.
Header: INPUT, INPUT, MORE
Schema: INPUT, INPUT1, MORE
Expected: INPUT1 but found: INPUT
CSV file: file:///home/roberte/src/rapids-plugin-4-spark/duplicate.csv
+-----+------+----+
|INPUT|INPUT1|MORE|
+-----+------+----+
| 1| 2| 3|
| 1| 2| 3|
| 1| 2| 3|
+-----+------+----+
But when the plugin is enabled there is no warning form CSVHeaderChecker.
Steps/Code to reproduce bug
Have a CSV file with a different header than the schema passed in. Read it, preferably in local mode because the warning is logged by the process that reads the file, so it will not get back to the end user very easily.
Expected behavior
The plugin also outputs a warning.
Additional context
This was found because the cudf team asked us about requirements for duplicate header names. Apparently Pandas and Spark will create different unique header names when there are duplicates. They were in the process of making cudf do the right thing for the Pandas case, and wanted to be sure it would not cause issues with Spark. We need to be sure that when we do implement this feature that we test it with duplicate column names so that we are sure that cudf is doing the right thing for the warnings.
The text was updated successfully, but these errors were encountered:
Describe the bug
When the schema of a CSV file does not match the headers in the file a warning is output.
But when the plugin is enabled there is no warning form CSVHeaderChecker.
Steps/Code to reproduce bug
Have a CSV file with a different header than the schema passed in. Read it, preferably in local mode because the warning is logged by the process that reads the file, so it will not get back to the end user very easily.
Expected behavior
The plugin also outputs a warning.
Additional context
This was found because the cudf team asked us about requirements for duplicate header names. Apparently Pandas and Spark will create different unique header names when there are duplicates. They were in the process of making cudf do the right thing for the Pandas case, and wanted to be sure it would not cause issues with Spark. We need to be sure that when we do implement this feature that we test it with duplicate column names so that we are sure that cudf is doing the right thing for the warnings.
The text was updated successfully, but these errors were encountered: