Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389

Closed
wants to merge 2 commits into from

Conversation

sandip-db
Copy link
Contributor

@sandip-db sandip-db commented Oct 16, 2023

What changes were proposed in this pull request?

User can specify rowTag option that is the name of the XML element that maps to a DataFrame Row. A non-existent rowTag will not infer any schema or generate any DataFrame rows. Currently, not specifying rowTag option results in picking up its default value of ROW, which won't match a real XML element in most scenarios. This results in an empty dataframe and confuse new users.

This PR makes rowTag a required option for both read and write. XML built-in functions (from_xml/schema_of_xml) ignore rowTag option.

Why are the changes needed?

See above

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 16, 2023
@HyukjinKwon
Copy link
Member

Merged to master.

@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException {
public void testXmlParser() {
Map<String, String> options = new HashMap<>();
options.put("rowTag", booksFileTag);
Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
Dataset<Row> df = spark.read().options(options).xml(booksFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to shortened version for better code readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this change is not related to the theme, it is not recommended to modify it. But it does simplify the code a bit and is also OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@@ -92,7 +92,7 @@ public void testXmlParser() {
public void testLoad() {
Map<String, String> options = new HashMap<>();
options.put("rowTag", booksFileTag);
Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
Dataset<Row> df = spark.read().options(options).xml(booksFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

HyukjinKwon pushed a commit that referenced this pull request Oct 19, 2023
…sensitive

### What changes were proposed in this pull request?
[PR 43389](#43389) made `rowTag` option required for XML read and write. However, the option check was done in a case sensitive manner. This PR makes the check case-insensitive.

### Why are the changes needed?
Options are case-insensitive.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43416 from sandip-db/xml-rowTagCaseInsensitive.

Authored-by: Sandip Agarwala <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Oct 20, 2023
…ption

### What changes were proposed in this pull request?
#43389 makes `rowTag` a required option. But the xml API (please see https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1) is unrelated to `rowTag`.

This PR also improves some code and remove one line of unused code.

### Why are the changes needed?
Restore test case not require `rowTag` option.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43455 from beliefer/SPARK-45562_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants