[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389

sandip-db · 2023-10-16T22:22:43Z

What changes were proposed in this pull request?

User can specify rowTag option that is the name of the XML element that maps to a DataFrame Row. A non-existent rowTag will not infer any schema or generate any DataFrame rows. Currently, not specifying rowTag option results in picking up its default value of ROW, which won't match a real XML element in most scenarios. This results in an empty dataframe and confuse new users.

This PR makes rowTag a required option for both read and write. XML built-in functions (from_xml/schema_of_xml) ignore rowTag option.

Why are the changes needed?

See above

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

… is not needed for from_xml/schema_of_xml.

HyukjinKwon · 2023-10-17T11:37:53Z

Merged to master.

beliefer · 2023-10-17T12:17:59Z

sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java

@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException {
    public void testXmlParser() {
        Map<String, String> options = new HashMap<>();
        options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);


Why change this line?

Switched to shortened version for better code readability.

Because this change is not related to the theme, it is not recommended to modify it. But it does simplify the code a bit and is also OK.

beliefer · 2023-10-17T12:18:10Z

sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java

@@ -92,7 +92,7 @@ public void testXmlParser() {
    public void testLoad() {
        Map<String, String> options = new HashMap<>();
        options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);


same as above.

…sensitive ### What changes were proposed in this pull request? [PR 43389](#43389) made `rowTag` option required for XML read and write. However, the option check was done in a case sensitive manner. This PR makes the check case-insensitive. ### Why are the changes needed? Options are case-insensitive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43416 from sandip-db/xml-rowTagCaseInsensitive. Authored-by: Sandip Agarwala <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ption ### What changes were proposed in this pull request? #43389 makes `rowTag` a required option. But the xml API (please see https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1) is unrelated to `rowTag`. This PR also improves some code and remove one line of unused code. ### Why are the changes needed? Restore test case not require `rowTag` option. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43455 from beliefer/SPARK-45562_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

[SPARK-45562][SQL] XML: Make 'rowTag' a required option

65f8d31

github-actions bot added the SQL label Oct 16, 2023

HyukjinKwon approved these changes Oct 17, 2023

View reviewed changes

Fix unit tests and move the 'rowTag` check to XmlFileFormat as rowTag…

c9a62cc

… is not needed for from_xml/schema_of_xml.

HyukjinKwon closed this in 4d63ca6 Oct 17, 2023

beliefer reviewed Oct 17, 2023

View reviewed changes

sandip-db mentioned this pull request Oct 18, 2023

[SPARK-45562][SQL][FOLLOW-UP] XML: Make 'rowTag' option check case insensitive #43416

Closed

beliefer mentioned this pull request Oct 19, 2023

[SPARK-45562][SQL][FOLLOWUP] Restore test case not require rowTag option. #43455

Closed

sandip-db mentioned this pull request Nov 8, 2023

[SPARK-45562][SQL][FOLLOW-UP] XML: Add SQL error class for missing rowTag option #43710

Closed

winningsix mentioned this pull request Jan 29, 2024

[AUDIT][Spark 4.0]XML: Make 'rowTag' a required option NVIDIA/spark-rapids#10311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389

[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389

sandip-db commented Oct 16, 2023 •

edited

Loading

HyukjinKwon commented Oct 17, 2023

beliefer Oct 17, 2023

sandip-db Oct 17, 2023

beliefer Oct 18, 2023

sandip-db Oct 18, 2023

beliefer Oct 17, 2023

sandip-db Oct 17, 2023

[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389

[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389

Conversation

sandip-db commented Oct 16, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Oct 17, 2023

beliefer Oct 17, 2023

Choose a reason for hiding this comment

sandip-db Oct 17, 2023

Choose a reason for hiding this comment

beliefer Oct 18, 2023

Choose a reason for hiding this comment

sandip-db Oct 18, 2023

Choose a reason for hiding this comment

beliefer Oct 17, 2023

Choose a reason for hiding this comment

sandip-db Oct 17, 2023

Choose a reason for hiding this comment

sandip-db commented Oct 16, 2023 •

edited

Loading