-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45562][SQL] XML: Make 'rowTag' a required option #43389
Conversation
… is not needed for from_xml/schema_of_xml.
Merged to master. |
@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException { | |||
public void testXmlParser() { | |||
Map<String, String> options = new HashMap<>(); | |||
options.put("rowTag", booksFileTag); | |||
Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile); | |||
Dataset<Row> df = spark.read().options(options).xml(booksFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to shortened version for better code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this change is not related to the theme, it is not recommended to modify it. But it does simplify the code a bit and is also OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
@@ -92,7 +92,7 @@ public void testXmlParser() { | |||
public void testLoad() { | |||
Map<String, String> options = new HashMap<>(); | |||
options.put("rowTag", booksFileTag); | |||
Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile); | |||
Dataset<Row> df = spark.read().options(options).xml(booksFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
…sensitive ### What changes were proposed in this pull request? [PR 43389](#43389) made `rowTag` option required for XML read and write. However, the option check was done in a case sensitive manner. This PR makes the check case-insensitive. ### Why are the changes needed? Options are case-insensitive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43416 from sandip-db/xml-rowTagCaseInsensitive. Authored-by: Sandip Agarwala <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ption ### What changes were proposed in this pull request? #43389 makes `rowTag` a required option. But the xml API (please see https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1) is unrelated to `rowTag`. This PR also improves some code and remove one line of unused code. ### Why are the changes needed? Restore test case not require `rowTag` option. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43455 from beliefer/SPARK-45562_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
User can specify
rowTag
option that is the name of the XML element that maps to aDataFrame Row
. A non-existentrowTag
will not infer any schema or generate anyDataFrame
rows. Currently, not specifyingrowTag
option results in picking up its default value ofROW
, which won't match a real XML element in most scenarios. This results in an empty dataframe and confuse new users.This PR makes
rowTag
a required option for both read and write. XML built-in functions (from_xml/schema_of_xml) ignorerowTag
option.Why are the changes needed?
See above
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit tests
Was this patch authored or co-authored using generative AI tooling?
No