Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8453: Add XSD Support to XML Reader (Part 1) #2824

Merged
merged 19 commits into from
Aug 29, 2023

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Aug 22, 2023

DRILL-8453: Add XSD Support to XML Reader (Part 1)

Description

This PR is a part of a series to add better support for reading XML data to Drill. One of the main challenges is that XML data does not have a way of inferring data types, nor does it have a way of detecting arrays.
The only way to do this really well is to have a schema. Some XML files link a schema definition file to the data. This PR adds the capability for Drill to map XSD schema files into Drill schemas.

The current plan is as follows:

  1. Part 1 of this PR simply adds the reader but adds no new user detectable functionality.
  2. Part 2 will include the actual integration with the XML reader.
  3. Part 3 will include the ability to read arrays in the actual XML reader.

Documentation

No user facing changes.

Testing

Added new unit tests.

@cgivre cgivre requested a review from jnturton August 22, 2023 05:25
@cgivre cgivre self-assigned this Aug 22, 2023
@cgivre
Copy link
Contributor Author

cgivre commented Aug 22, 2023

@mbeckerle

@mbeckerle
Copy link
Contributor

Sorry bogged down. Will review soon.

@mbeckerle
Copy link
Contributor

I'm ok with merging this. It's still a bit of a work-in-progress (hence the Part 1)

Some TODOs in here are mine. I do intend to get to them, but no reason to hold up this change set for that.

I highly recommend that you squash these 15 commits together into one coherent commit rather than commit all 15 as is.

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cgivre
Copy link
Contributor Author

cgivre commented Aug 28, 2023

@jnturton Are we good to go?

@cgivre
Copy link
Contributor Author

cgivre commented Aug 28, 2023

@mbeckerle We always squash commits for Drill PRs :-)
I think the TODOs are ok here since this is part 1.

cgivre and others added 15 commits August 27, 2023 22:46
The test was not constructing the right expectedSchema.
Test enhanced to use attributes and global attributeGroup definition and reference as well as local attribute declarations.

Does not detect, nor handle,  the fact that an attribute is allowed to have the exact same name as an element in XML Schema.

Currently does not distinguish optional/required attributes (they're all addNullable), and does not implement XSD default/fixed.

Note that attribute definitions in XSD come last in the element definition lexically; however, they come out first in Drill field order.
This is consistent with the way XML text looks where the element tag carries attributes first, in the opening tag, then child elements follow after the opening tag.
Added test of complex element with only attributes.

Added numerous assertion checks to be sure invariants are being maintained.
pom.xml Outdated Show resolved Hide resolved
Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@cgivre
Copy link
Contributor Author

cgivre commented Aug 29, 2023

@jnturton I fixed imports.
@mbeckerle I added one exception which removed a TODO.

@cgivre cgivre merged commit aa52f05 into apache:master Aug 29, 2023
8 checks passed
@cgivre cgivre deleted the xsd_reader branch August 29, 2023 15:44
cgivre added a commit to cgivre/drill that referenced this pull request Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants