[BEAM-6701] Add logical types to schema #7865

reuvenlax · 2019-02-17T18:59:50Z

Beam schemas define a limited number of fundamental types, and often users of schemas have slightly different needs for field types. An example is Beam SQL: SQL has many different date/time/timestamp types, while Beam has only the one DATETIME field type. Today SQL stuffs magic strings into the FieldType metadata to identify which type it is really using, which is quite ad hoc.

Logical types introduce a principled way of doing this. A new LogicalType class can be defined that uses one of the fundamental Beam field types as storage. The user can then add this logical type to their schema, and Beam will use the underlying field type and record the logical type in the field as well. Logical types have globally unique identifiers that are understood by the system, so this provides a much more principled way of storing custom types in schemas.

This PR adds LogicalType and converts Avro and Beam SQL over to the new framework. Follow-on PRs will add CoderLogicalType (so any existing Coder can be used as a Schema field) as well as adding logical-type support to POJOs.

reuvenlax · 2019-02-17T19:02:02Z

R: @amaliujia

ryan-williams · 2019-02-17T19:50:17Z

I think we are seeing the same Java_Examples_Dataflow failures: #1753 (on my #7823) and #1754 here.

> Task :beam-runners-google-cloud-dataflow-java-examples-streaming:preCommit

org.apache.beam.examples.WordCountIT > testE2EWordCount FAILED
    java.lang.RuntimeException at WordCountIT.java:69

> Task :beam-runners-google-cloud-dataflow-java-examples:preCommitLegacyWorker

org.apache.beam.examples.WindowedWordCountIT > testWindowedWordCountInStreamingStaticSharding FAILED
    java.lang.RuntimeException at WindowedWordCountIT.java:188

Build scan doesn't appear to have captured anything useful

Think it's broken independently of our PRs?

i# Please enter the commit message for your changes. Lines starting

reuvenlax · 2019-02-17T20:25:00Z

It appears that an exception is being throw from StreamingDataflowWorker.java:1952. The following line is failing with NullPointerException: this.transformUserNameToStateFamily = ImmutableMap. copyOf(transformUserNameToStateFamily); I'm adding drieber who authored this commit. Reuven

…

On Sun, Feb 17, 2019 at 11:50 AM Ryan Williams ***@***.***> wrote: I think we are seeing the same Java_Examples_Dataflow failures: #1753 <https://builds.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/1753/console> (on my #7823 <#7823>) and #1754 <https://builds.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/1754/console> here. Think it's broken independently of our PRs? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7865 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1fNZMBqKKtZqI1eJo6Qf5zv6AnwOks5vObJ_gaJpZM4a_ucY> .

reuvenlax · 2019-02-17T20:27:23Z

The commit that added this appears to be 6571c83#diff-3129049a776d7e086c73789d6772a9b6

…

On Sun, Feb 17, 2019 at 12:24 PM Reuven Lax ***@***.***> wrote: It appears that an exception is being throw from StreamingDataflowWorker.java:1952. The following line is failing with NullPointerException: this.transformUserNameToStateFamily = ImmutableMap. copyOf(transformUserNameToStateFamily); I'm adding drieber who authored this commit. Reuven On Sun, Feb 17, 2019 at 11:50 AM Ryan Williams ***@***.***> wrote: > I think we are seeing the same Java_Examples_Dataflow failures: #1753 > <https://builds.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/1753/console> > (on my #7823 <#7823>) and #1754 > <https://builds.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/1754/console> > here. > > Think it's broken independently of our PRs? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#7865 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AUGE1fNZMBqKKtZqI1eJo6Qf5zv6AnwOks5vObJ_gaJpZM4a_ucY> > . >

ryan-williams · 2019-02-17T20:51:20Z

(cc @drieber, cf. above, any ideas?)

ryan-williams · 2019-02-17T21:11:28Z

btw @reuvenlax where did you get this info?

It appears that an exception is being throw from StreamingDataflowWorker.java:1952.
The following line is failing with NullPointerException:
this.transformUserNameToStateFamily = ImmutableMap.copyOf(transformUserNameToStateFamily);

drieber · 2019-02-18T00:44:27Z

Sorry about this. My PR did not handle properly the case when the proto field is not set. I will send a fix within an hour or so.

amaliujia

The implementation generally looks good. Can we have some tests on logical type construction and equality checking?

amaliujia · 2019-02-19T09:23:20Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java

    /** Returns a copy of the descriptor with metadata set. */
-    public FieldType withMetadata(@Nullable byte[] metadata) {
-      return toBuilder().setMetadata(metadata).build();
+    public FieldType withMetadata(String key, byte[] metadata) {


As metadata is Map now, can we have a withMetadata(Map<String, byte[]>) as well?

Is some changes not pushed? Not seeing withMetadata(Map) implementations, I can see comment is updated though.

Sorry, it got messed up. Fixed now.

amaliujia · 2019-02-19T09:36:33Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java

+   * #toBaseType} and {@link #toInputType} should convert back and forth between the Java type for
+   * the LogicalType (InputT) and the Java type appropriate for the underlying base type (BaseT).
+   *
+   * <p>{@link #getIdentifier} must define a globally unique identifier for this LogicalType. A


It seems that there is no verification on global uniqueness of identifiers. What's the effect if identifiers are not unique (seems to me that there is no any effect)?

Good question: I originally had a registration for LogicalTypes (all had to be registered and registration checks uniqueness). I removed it from this PR because I was trying to keep the PR smaller. If you think it's better I can add it back in, or I could keep it for a future PR.

Since identifier is the identity of the logical type, Beam code is allowed to check the identifier to see what the type is; if multiple types had the same identifier, this logic will break. In fact SQL does this in several places, and if a different LogicalType had the same identifier as the ones registered in CalciteUtils, things would break in strange ways.

Thanks. I didn't notice IDENTIFIERs are used in Map as key. That explains uniqueness requirement.

Leaving it to future PR definitely is ok.

amaliujia · 2019-02-19T09:55:01Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/LogicalTypes.java

+    }
+
+    @Override
+    public byte[] toInputType(byte[] base) {


I am not quite understand the intention of toBaseType and toInputType of LogicalType, especially when I see this implantation allows a partial copy while toBaseType checks equality on input.length and byteArraySize.

Why toInputTypes allows base.length <= byteArraySize but not strictly check base.length == byteArraySize?

So, the main point is to actually convert types. For example, imagine if we got rid of DATETIME as a primitive schema type and replaced it with a LogicalType (which we may do). In this case the underlying Beam type would be INT64, but we want users to be able to use Joda DateType objects. In that case, the logical type would have the following signature:

class DateTimeType implements LogicalType<DateTime, Long>

But since the user is passing in DateTime objects, we need to know how to convert them back and forth to the Long object that Row expects for an INT64. So you would also override:

Long toBaseType(DateTime dateTime) {
return dateTime.toMillis();
}

DateTime toInputType(Long millis) {
return new DateTime(millis);
}

In this particular case we use toBaseType to verify the input, since the java types are the same. However we also allow passing in an array that is smaller than the fixed size, in which case we extend it (possibly we should do this in both codepaths though).

Gotcha. Thanks for explanation. So this two functions are util functions for conversions.

reuvenlax · 2019-02-19T17:32:57Z

Also added unit test coverage for schema equality in the case of logicaltype

amaliujia · 2019-02-19T18:23:10Z

LGTM assuming tests pass

kennknowles · 2019-02-20T14:36:04Z

Looks like this broke SQL postcommits

apilloud · 2019-02-22T22:23:01Z

...va/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java

      Expression field = Expressions.call(expression, getter, Expressions.constant(index));
-      if (fromType.getTypeName().isDateType()) {
+      if (fromType.getTypeName().isLogicalType()) {
        field = Expressions.call(field, "getMillis");


This isn't going to work on just any LogicalType only DateType. Something is broken here.

This should be inside the following if clauses I think (or alternatively wrapped with CalciteUtils.isDateTimeType). Do we have no tests that test CHAR type?

reuvenlax added 8 commits February 17, 2019 12:23

Change schema field metadata to be tagged.

f0aaa5b

i# Please enter the commit message for your changes. Lines starting

Move AvroSchema over to logical types.

332dc0b

More stuff.

c1e03e0

Convert custom SQL date/time types to logical types.

9e974d9

Add CHAR type, and fix handling of Map keys.

83531cc

Add argument and javadoc.

383d3aa

Fix checks.

7d6edf7

Fix CheckStyle.

8911133

ryan-williams mentioned this pull request Feb 17, 2019

Make use of per-computation maps of transform username to state family #7846

Merged

3 tasks

ryan-williams mentioned this pull request Feb 18, 2019

Fix NPE in ComputationState constructor introduced by PR/7846 #7869

Merged

3 tasks

amaliujia reviewed Feb 19, 2019

View reviewed changes

Address comments.

7a7a88c

reuvenlax force-pushed the schema_logical_type branch from 50c5f8f to 7a7a88c Compare February 19, 2019 17:32

reuvenlax added 2 commits February 19, 2019 10:39

Fix javadoc.

73a8284

Add back withMetadata.

477145b

reuvenlax merged commit b2fa119 into apache:master Feb 19, 2019

apilloud reviewed Feb 22, 2019

View reviewed changes

Juta pushed a commit to Juta/beam that referenced this pull request Feb 25, 2019

Merge pull request apache#7865: [BEAM-6701] Add logical types to schema

fcfadca

This was referenced Sep 26, 2022

Support DECIMAL logical type in python SDK #23014

Merged

[Feature Request]: Design and Implement CoderLogicalType #23374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-6701] Add logical types to schema #7865

[BEAM-6701] Add logical types to schema #7865

reuvenlax commented Feb 17, 2019

reuvenlax commented Feb 17, 2019

ryan-williams commented Feb 17, 2019 •

edited

Loading

reuvenlax commented Feb 17, 2019 via email

reuvenlax commented Feb 17, 2019 via email

ryan-williams commented Feb 17, 2019

ryan-williams commented Feb 17, 2019 •

edited

Loading

drieber commented Feb 18, 2019

amaliujia left a comment

amaliujia Feb 19, 2019

reuvenlax Feb 19, 2019

amaliujia Feb 19, 2019

reuvenlax Feb 19, 2019

amaliujia Feb 19, 2019 •

edited

Loading

reuvenlax Feb 19, 2019

amaliujia Feb 19, 2019

amaliujia Feb 19, 2019

reuvenlax Feb 19, 2019

amaliujia Feb 19, 2019

reuvenlax commented Feb 19, 2019

amaliujia commented Feb 19, 2019

kennknowles commented Feb 20, 2019

apilloud Feb 22, 2019

reuvenlax Feb 22, 2019

[BEAM-6701] Add logical types to schema #7865

[BEAM-6701] Add logical types to schema #7865

Conversation

reuvenlax commented Feb 17, 2019

reuvenlax commented Feb 17, 2019

ryan-williams commented Feb 17, 2019 • edited Loading

reuvenlax commented Feb 17, 2019 via email

reuvenlax commented Feb 17, 2019 via email

ryan-williams commented Feb 17, 2019

ryan-williams commented Feb 17, 2019 • edited Loading

drieber commented Feb 18, 2019

amaliujia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Feb 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Feb 19, 2019

amaliujia commented Feb 19, 2019

kennknowles commented Feb 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-williams commented Feb 17, 2019 •

edited

Loading

ryan-williams commented Feb 17, 2019 •

edited

Loading

amaliujia Feb 19, 2019 •

edited

Loading