Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-756: Add Union Logical type #44

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

julienledem
Copy link
Member

No description provided.

* A Union type
* This type annotates data stored as a Group
* this shows the intent to have heterogenous types under the same field name
* the names of the fields in the annotated Group are not important in such a case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should include a few more details for implementing the Union type. For example, all fields of a union group must be optional and for each value, writers must set all but one option/field to null. It should also state that if more than one option is set, only one will be returned and which one is implementation-specific (because this is affected by column projection).

As we talked about in the sync-up, we should state that whether a union value can be null is determined by the union group's repetition: optional union groups can contain null and required union groups cannot. That way, we can always distinguish between a null value in the union (group level is null) and a non-null is a union option that wasn't projected.

Could you also update the logical types documentation?

@julienledem
Copy link
Member Author

@rdblue updated

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clarify the handling of null values a bit more. Otherwise this is looking good.

A Union can not contain null but can be null itself if in an optional field.

// Union<String, Integer, Boolean> (nullable union of either String, Integer or Boolean)
optional group my_union (Union) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Union should be UNION

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

If more than one is defined the behavior is undefined and may changed depending on the projection applied.
A Union can not contain null but can be null itself if in an optional field.

// Union<String, Integer, Boolean> (nullable union of either String, Integer or Boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear to use "Union of null, String, Integer, or Boolean" because the intent is not to have a Union container that may itself be null, although that is how it is stored. If we're going with Java-ish types for clarity, then maybe it's more accurate to use "Void | String | Integer | Boolean".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Avro, in Parquet Null is not a type. Just like Avro Records do not have optional fields.
So I think that this spec should not treat Null as a type.
It is clearer to have one-to-one mapping between the types here and the fields in the Group since this describes how it is stored.

If we want to clarify this, I can add the following:
Mapping to Avro types:

  • an Avro Union that contains Null and at least 2 other types will map to an optional Parquet Union (of the remaining types).
  • an Avro Union that does not contain null will map to a required Parquet Union.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting we treat null as a type. I just want this to be clear that the group is not exposed. If the group level is defined, then one (and only one) branch is non-null and that is the value. If the group level is not defined, then the value is null.

optional boolean bool;
}

// Union<String, Integer, Boolean> (required union of either String, Integer or Boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change the wording here, too. The union isn't required, one of the branches is required to be non-null.

* The names of the fields in the annotated Group are not important in such a case.
* All fields of the Group must be optional and exactly one is defined for each instance of the group.
* If more than one is defined the behavior is undefined and may changed depending on the projection applied.
* A Union can not contain null but can be null itself if it's an optional field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this say "Union groups that are required must contain at least one non-null field. Union groups that are optional can be null, which is used to encode the case where none of the branches of the union are non-null."? I want to be very clear how the union group's definition level is used to encode a null value, and how implementations should use that bit to distinguish between a null union value and a missing non-null branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "at least one non-null field" ? It should be exactly one non-null field.
"the case where none of the branches of the union are non-null." => the case where all of the branches of the union are null.
I will clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should be one non-null field, not at least one.

@julienledem
Copy link
Member Author

@rdblue I have updated the spec per your comments.


- If the union is nullable then at most one field is non-null and the field containing the union is optional
```
// Optional<Union<String, Integer, Boolean>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear if this was Union<String, Integer, Boolean> and noted "the value of the union may be null" rather than adding the Optional level. That makes it clear that it isn't actually wrapped, it is just that the value can be null in this case and that is encoded by the definition level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

optional boolean bool;
}
```
The union field is used to differentiate a null value (the field was null to start with) from a projection that excludes the non-null field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "the definition level of the UNION group is used to differentiate a null value (the union was null to start with) . . .". I'd just avoid using the term "field" to refer to the whole group for clarity.

Same thing in the next couple of sentences, I think it is more clear to remove "field" and refer to whether the group is optional or required: "If the union group is null, then the value was null" / "If the union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@isnotinvain
Copy link
Contributor

I think we might need more discussion of how we want projects into unions to work.

I don't think all object models can return an "empty" union (eg thrift). Additionally, because we don't store group data, only data for each primitive column, in practice in order to do projections into unions you have to do things like pick an arbitrary child field of the union and use it's definition levels to figure out if the union "wing" is null or not. We've got a lot of these workarounds in the thrift implementation, but it'd be good to iron out how we really want that to work.

Another option would be to add a push down filter for any "branch" of a union that hasn't been selected in a projection.

@rdblue
Copy link
Contributor

rdblue commented Nov 3, 2016

@isnotinvain, is the solution to the empty union simply that thrift can't project a subset of the union's branches?

I don't really like the idea of filtering rows that don't have non-null values, but I do see the problem. What about requiring the repetition of a union to be optional when projecting a subset of the branches?

@isnotinvain
Copy link
Contributor

isnotinvain commented Nov 4, 2016

I'm not sure if I'm being clear, let me use an example of the issues we've seen with thrift union support.

  1. Selecting only columns from 1 "wing" of a union
    Lets say we have schema:
union Animal {
  1: optional Dog dog
  2: optional Cat cat
  3: optional Turtle turtle
}

struct Dog {
  1: required string name
  2: required string bark
}

struct Cat {
  1: required string name
  2: required int numLives
}

struct Turtle {
  1: required string name
}

And now, the user uses projection to select only the columns animal.cat.numLives
In any of the type safe implementations of parquet, we still have to return the user an instance of Animal. In the case of a record that is a dog, what should we return them?

One option is to return an 'empty' Dog. This has two issues:

  1. Dog's fields are required -- thrift doesn't really let you get away with not setting those fields
  • this is solved by either putting null-like values (0, null, false, etc) in those fields
  • or by throwing an exception on access
  1. When parquet encounters this record it is unknown whether this record is a Dog or a Turtle. All we know is that it's not a Cat (it's cat field is known to be null because animal.cat.numLives's definition levels tell us that. But we didn't load up any Dog/Turtle columns so we don't know if the Animal's optional dog / turtle fields are null or not (we don't store group level data, it's only in the primitive columns). So what we've done here in the past is pick at least one arbitrary column from each "wing" of the union so that we can look at the definition levels of all "wings"

Does that make sense?

@isnotinvain
Copy link
Contributor

Oh, and, that's why filtering the "unknowns" (the is this a dog or a turtle case) away sort of makes sense? Like, the user didn't ask for any dog/turtle columns so I guess they don't want those records? (probably more confusing than helpful)

@isnotinvain
Copy link
Contributor

isnotinvain commented Nov 4, 2016

One more thing I should probably clarify -- this would be pretty easy if thrift represented unions the way they look in the above schema definition, we could just return something like:

Animal x = new Animal()
x.dog = null
x.cat = null
x.turtle = null
return x

The problem in thrift is there will be no Animal concrete class. Just an interface called Animal and 3 subclasses. There is an "unknown" subclass -- but that is usually reserved for when we know, based on the data on disk, that we don'k know what kind of union this is (because our schema is older than the writer's schema). Turning a known record into an unknown based on projections seems incorrect too.

@rdblue
Copy link
Contributor

rdblue commented Nov 4, 2016

When selecting Cat in your example, we shouldn't create an empty Dog or Turtle, but instead return null because Cat was null. That's why I think we may want to require the union to be optional when projecting some of its branches: we will need to fill in with null for branches that aren't projected. For Thrift -- and Avro when there is no NULL option -- I think the consequence is not allowing the user to project a subset of the union's branches because it violates the contract of the requested schema to return null.

It is better to fail early in this projection case than to fake the other branches. Like you said, we have a problem if we don't know which other union branch was non-null. We also can't add a filter. Say I want to know the percentage of PetOwners in California that have cats. Then I filter PetOwner on state = 'CA' and project pet.cat. The null values (not cats) are relevant.

In terms of the spec for a UNION logical type, I think we end up with this: "When projecting a proper subset of the union's branches, the union itself must be optional. The union value should be null for branches that are not projected."

@julienledem
Copy link
Member Author

Where do we want to document this?
I'd separate what is defined by the format and what is specific to each model integration.
To me it sounds like we'd want to formalize the details in parquet-thrift and parquet-avro that have their own specificities in the object model (and their could be more than one way of doing it per model).
In LogicalTypes.md I'd add the following statements:

  • To know if a union is null you need to keep at least one column from one the branches.
  • if you project out some branches of the union the type of the union will be "unknown" for those at read time. The way to expose this is dependent on the object model integration (avro, thrift, ...). To know that type you need to keep at least one column from each branch.
  • Filtering "unknown" things out is defined by the model as well.
    And add the details about Thrift and Avro in their respective directory (possibly link from here)

@julienledem
Copy link
Member Author

I rebased and addressed @rdblue latest feedback before the union projection discussion.
I'm planning to add the details I mention in #44 (comment)
pending feedback from @rdblue and @isnotinvain .

@rdblue
Copy link
Contributor

rdblue commented Nov 4, 2016

Implications in the object model should go in each model. Avro should document that it can't project unless the requested schema includes a NULL branch. However, the overall requirement that Parquet won't project a subset of the branches unless the union group is optional should be documented in the spec since that's a general requirement for how Parquet projection works.

@julienledem
Copy link
Member Author

@rdblue Agreed except for the last bit: "Parquet won't project a subset of the branches unless the union group is optional"
I don't think it should. Users should be able to project whatever they want and we should not add artificial constraints like this.
If the union is required then a missing value just means "in one of the branches that you projected out". This is a totally valid use case. Typically you could want to read all cats and filter out all other branches of the union and you don't care whether the rows you ignored were turtles or dogs.

@julienledem
Copy link
Member Author

@rdblue but this constraint can be defined in avro and thrift. Just not as a global rule.

@rdblue
Copy link
Contributor

rdblue commented Nov 4, 2016

Users should be able to project whatever they want and we should not add artificial constraints like this.

I think you're right. I wasn't thinking that it was an artificial constraint, but there's no reason for Parquet to require this because its structure is clear.

@isnotinvain
Copy link
Contributor

I'm not at a computer now but I will reply tonight. We can't return NULL
for an option union even, as that's not really correct -- that union wasn't
null it was a present Cat/Turtle -- claiming it was null is not really
accurate I think.

On Friday, November 4, 2016, Ryan Blue [email protected] wrote:

Users should be able to project whatever they want and we should not add
artificial constraints like this.

I think you're right. I wasn't thinking that it was an artificial
constraint, but there's no reason for Parquet to require this because its
structure is clear.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#44 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWw6SOg3g-JADRVXTFIqP1SDhqnFApGks5q65crgaJpZM4KhhQQ
.

@rdblue
Copy link
Contributor

rdblue commented Nov 4, 2016

I think it's for the object model to decide how to handle the projection, but that null is a reasonable option. The object model could expose the union as individual branches, or could create objects like Dog and Turtle with no fields, or have a "not project" sentinel. It's really up to the model implementers to determine. I think for Avro, we'll go with null and document that null will be returned when non-null branches aren't projected.

@julienledem
Copy link
Member Author

julienledem commented Nov 8, 2016

@rdblue @isnotinvain I have updated the doc with more details about projecting unions.
Could you create each a PR for the following?

thank you.

@isnotinvain
Copy link
Contributor

I still have some reservations about this. I think an expected contract (but maybe not an explicit contract) that we have is that changing your projection shouldn't change your results, only your efficiency?

Lets say a user's query is something like:

long dogs = 0;
long cats = 0;
long others = 0;
Animal a = parquetData.next();

if (a.isDog()) {
  dogs++;
} else if (a.isCat()) {
  cats++;
} else {
  others++;
}

If the users selects all the columns, they'll get an accurate count of dogs and cats, and others.
If they select only some columns from Dog, then a.isCat() will return false, and that cat will get counted as an 'other'

This seems pretty surprising to me, and is why in the thrift integration we went through a lot of hoops to try to avoid it. I think if parquet is going to support unions as a first class concept, instead of pushing these complicated decisions to each object model, it'd be nice if parquet could handle this for us for all object models.

We could for example, write in parquet-core an efficient way to read on the definition levels of one child of each union branch, so that we can tell each object model which wing of the union an object belongs to. Then the object model can do w/e it wants with that info, but I think handeling this belongs in parquet-core ideally, and isn't super difficult nor inefficient (assuming we can read definition levels only).

@isnotinvain
Copy link
Contributor

Another thing to consider about adding first class support for unions is efficiency.
We've found that parquet pays a pretty high cost reading nulls, especially in large schemas made up of lots of unions (in this case, each record is fairly small, but the schema is very large because there are so may fields total, but each record only populates a small number of them)

If we take advantage of knowing about unions in parquet-core, the read state machine / record assembly could skip a huge amount of asking column readers "are you null?" when it knows that they will all be null due to the fact that they are children of a union and that only one of the branches will ever contain non-nulls. Does that make sense?

```
// Union<String, Integer, Boolean> (where the value of the union may be null)
// at most one of either String, Integer or Boolean is non-null
// if they are all null then the field my_union itself must be null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already stored in the definition levels of each branch as well though right? So there's no need to check even.

```
The definition level of the UNION group is used to differentiate a null value (the union was null to start with) from a projection that excludes the non-null field.
If the Union group is null then the value was null.
If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I understand now, yes we can use this to tell the difference, but unfortunately we can't really tell the user what kind of branch this was, which I don't think is great.

If the Union group is null then the value was null.
If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected.

- If - despite the spec - a group instance contains more than one non-null field the behavior is undefined and may change depending on the projection applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define this as it will throw an exception in parquet-core? It should be completely detectable, any reason to leave it undefined instead of explicitly fatal?

lekv pushed a commit to lekv/parquet-format that referenced this pull request Jul 31, 2017
This PR implements
1) PARQUET-505: Column reader should automatically handle large data pages
2) Adds support for Serialization
3) Test case for Serialization and Deserialization
4) Test case for SerializedPageReader and PARQUET-505

Author: Deepak Majeti <[email protected]>

Closes apache#44 from majetideepak/PARQUET-505 and squashes the following commits:

4f754ba [Deepak Majeti] changed type of page header size defaults
4345812 [Deepak Majeti] PARQUET-505: Column reader should automatically handle large data pages
@dbtsai
Copy link
Member

dbtsai commented Jan 25, 2019

@rdblue @julienledem We're looking to use union type in my company, and I found this JIRA. Wondering the status of this PR and why it's not merged in the end? Thanks!

@rdblue
Copy link
Contributor

rdblue commented Jan 25, 2019

@dbtsai, the problem with Union is that its behavior isn't well-defined. It is difficult to decide what is correct, and support in processing engines is bad. I don't think it is a good idea to add them when you can instead get predictable behavior using optional objects. (Also, see the discussion on the Iceberg spec.)

@matlarsen
Copy link

Hi @rdblue @julienledem, I'm working with @dbtsai on this feature. We are scoping out support of UNION through AVRO -> Spark -> Parquet and its interesting to read this spec and the concerns that come out of it.

  • Some concerns here seem to revolve around the user-understood semantics of querying fields in UNIONS; I'd suggest that these semantics may already be solidified via other paradigms; for instance a SparkSQL selector that selects fields from an optional nested type will return null if the parent type does not exist.
  • It seems a little chicken-and-egg in terms of tooling support; in relation to the above could these behaviors be codified in the spec? It seems to me that the starting point of adding UNION support in the tooling would be to ensure that the formats support it. In terms of Parquet and interop with other formats this behavior is already undefined somewhat (in that the conversion code decides the embedding of UNIONS from one format to another); moving this behavior into the spec/format I think would help to close the gap on this slightly undefined behavior.

Please let me know your thoughts!

@wesm
Copy link
Member

wesm commented Feb 1, 2019

I'd be interested in having unions in the Parquet format. It would have to annotate a STRUCT having a first field that is an INT32 indicating which of the subsequent fields should be selected for each row in the dataset. Other fields can simply be null when they are unselected in the union

@julienledem
Copy link
Member Author

I have not been active on this recently. If someone wants to push this to the finish line they should feel free to take over this PR

@codeinred
Copy link

What needs to be done to push it over the finish line?

@Bernolt
Copy link

Bernolt commented Jun 8, 2024

Hello @julienledem @xhochy @isnotinvain @rdblue,
Are there any plans to integrate this change?

@raj-nimble
Copy link

raj-nimble commented Aug 6, 2024

Any estimate of when this might merge? In general, having Union types available would be very useful, especially coming from libraries that are Rust-based, as the Rust enum type with variants containing data is very common, and right now none of the crates leveraging parquet can write out structures containing enums.

They all link to apache/arrow-rs#73 which then links here.

If there is anything I can do to assist in getting this feature merged, I would be happy to help.

@wgtmac
Copy link
Member

wgtmac commented Aug 6, 2024

@Bernolt @raj-nimble Sorry for late reply. It seems this proposal has been sleeping for years. As a convention, the Parquet community requires a formal vote on the [email protected] with two reference implementations (parquet-java and another, usually parquet-cpp) to move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.