Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-2359: Support Logical Types in C# #492

Merged
merged 15 commits into from
Feb 5, 2020

Conversation

timjroberts
Copy link
Contributor

@timjroberts timjroberts commented Mar 28, 2019

This PR brings in C# support for logical types.

The Schema, Readers and CodeGen elements have been completed, and the "date" logical type has an implementation. As a reference, the logical types implemented in the Java implementation will be moved across.

@mhowlett
Copy link

mhowlett commented Apr 3, 2019

On first glance, this is looking great to me. I'm going to review it carefully over in the @confluentinc fork confluentinc#14 and get it merged there as we see a lot of demand for this.

side note: @confluentinc started a fork of this project since it allowed us to move fast on avro support in the .net kafka client. the big changes we've made relate to changing all the project files / references / etc for modern .NET, which would cause problems for any existing users of this library. That said, I believe it would be best to get our fork merged up into the official avro repo for the greatest net benefit and also ideally, @confluentinc would prefer to be not maintaining a separate fork (we'd prefer to contribute back to the official repo).

@timjroberts
Copy link
Contributor Author

timjroberts commented Apr 4, 2019

Fantastic. I think the code around the Schema changes and the reader/writer side is OK. I think the implementation of the logical types themselves will need more careful review because I may be a little naive with the implementations compared to what the 1.8 spec says.

I've just noticed that I haven't done duration so I'll try and get that one done too because it's just an encoded TimeSpan. I notice that the Java Avro implementation also has uuid based on string. Could also possibly bring that in, but the spec doesn't define it.

@nandorKollar
Copy link
Contributor

This sounds like a great addition, however I have almost zero C# experience, so I'm afraid I can't provide meaningful feedback about the implementation. Unfortunately I don't know who'd be able to review this PR from the Avro community, who is maintaining the C# part. Would you mind sending an email to the dev list to (hopefully) get more attention and call for review? Also, I see that there are several test failures, which looks related to this PR, could you please have a look at those?

@timjroberts
Copy link
Contributor Author

timjroberts commented Apr 24, 2019

Thanks @nandorKollar. These changes are now being reviewed via the Confluent fork: confluentinc#14
They maintain the C# side of the Kafka and Avro libraries.

@CurtHagenlocher
Copy link

FWIW, not all Avro durations can be represented as C# TimeSpan because the duration includes a number of months.

@timjroberts
Copy link
Contributor Author

You're right @CurtHagenlocher. The Avro specification for a duration does include months and days. The specification also implies that it is relative to a 'moment in time from which the duration is measured'. That would make mapping it difficult because you would need to know what that relation is. However, maybe we could include a new AvroDuration type into the library to capture the same intent.

@adprit
Copy link

adprit commented Oct 22, 2019

@timjroberts is the intention still to merge this PR?

@timjroberts
Copy link
Contributor Author

I'd like to see it merged @adprit.

@Fokko Fokko requested a review from blachniet October 23, 2019 10:42
@shawnallen85
Copy link

Any update on supporting this?

@Fokko
Copy link
Contributor

Fokko commented Oct 23, 2019

I'm open to merge this, but currently there are merge conflicts that need to be resolved. Also, I have very little experience with C# or .Net, so I'd like to have @blachniet his opinion on this :-)

Also, it would be great to give the code a try and check if it works for you as is.

@shawnallen85
Copy link

So I'm not a GitHub expert, but I believe I reconciled the differences here: https://github.com/shawnallen85/avro ... what needs to be done to bring them together?

@shawnallen85
Copy link

To provide some additional bits -- the differences were mainly around syntactic sugar, XML documentation, and the Logical schema implementation. The repo I listed above takes this PR, updates the documentation bits and syntactic sugar.

@blachniet
Copy link
Member

This will be a great addition to the C# library! I'll start taking a look at this during the coming weekend.

@timjroberts
Copy link
Contributor Author

I'd look at confluentinc#14. This is where I completed the logical type work after realizing that we were using the Confluent fork of Avro. It is my understanding that Confluent are now looking at merging their fork back into Apache's and presumably that would include my PR.

@mhowlett
Copy link

@timjroberts update on this - @confluentinc would ideally like to migrate to the official apache avro packages, and discontinue our fork. this is under discussion here: confluentinc/confluent-kafka-dotnet#1033

@timjroberts
Copy link
Contributor Author

I may some time over the weekend. I'll pull down the latest Avro branch, and then look at replaying the commits that were made in the Confluent fork over the top. Hopefully as suggested, any conflicts will be easily resolvable.

@timjroberts
Copy link
Contributor Author

This PR is now identical to confluentinc#14.

@Fokko
Copy link
Contributor

Fokko commented Oct 28, 2019

Thanks @timjroberts for the contribution. Great to see Confluent merging back to Avro master.

However, the CI is still sad. It looks like there is a file without a (proper) license:

[WARNING] Files with unapproved licenses:
  lang/csharp/src/apache/main/AvroDecimal.cs

@timjroberts
Copy link
Contributor Author

I've added the missing license. However, it looks like the solution has 'Warnings as Errors' turned on. I'll need to go through the AvroDecimal class and add the the XML Docs as well as address the other warnings that are being picked up.

@timjroberts timjroberts force-pushed the AVRO-2359-logical-types branch 2 times, most recently from 9f2fbe9 to 29397bf Compare October 29, 2019 12:23
/// </summary>
/// <param name="value">The <see cref="AvroDecimal"/>.</param>
/// <returns>A byte.</returns>
public static byte ToByte(AvroDecimal value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why expose all these static methods (in addition to the cast operators). Is there any precedent in the std lib? Doesn't seem consistent with how other numeric types work afaict. It seems to me it would be better to consolidate all of these into just the cast operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are the implementation of IConvertible. VB.NET doesn't invoke the conversion operators and instead relies on the type implementing this interface.

/// </summary>
/// <param name="value">The byte.</param>
/// <returns>An <see cref="AvroDecimal"/>.</returns>
public static AvroDecimal FromByte(byte value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the same with the froms (just have the cast operators), unless there's a compelling reason to have them I can't think of. as a general rule, smaller api surface area is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. If I remove them you get CA2225.

Happy to suppress?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you i think.

@mhowlett
Copy link

mhowlett commented Feb 4, 2020

this is looking good to me now. the last time I looked through this i did it quite closely and I think that feedback has all been covered off now. what i'm going to do next (hopefully get it done over the next few days) is to integrate this with the confluent kafka serdes, and do some integration tests with java. i'm just going to do this on an adhoc basis since multiple languages are going to make it difficult to set up properly. after that, i'll give my thumbs up, fwiw.

@Fokko
Copy link
Contributor

Fokko commented Feb 4, 2020

Would be great @mhowlett, please keep us updated!

{ TimeMillisecond.LogicalTypeName, new TimeMillisecond() },
{ TimeMicrosecond.LogicalTypeName, new TimeMicrosecond() },
{ TimestampMillisecond.LogicalTypeName, new TimestampMillisecond() },
{ TimestampMicrosecond.LogicalTypeName, new TimestampMicrosecond() }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity there's no support for fixed types (but the gaps shouldn't prevent this pr being merged).

public override object ConvertToBaseValue(object logicalValue, LogicalSchema schema)
{
var date = ((DateTime)logicalValue).ToUniversalTime();
return (long)((date - UnixEpochDateTime).TotalMilliseconds * 1000);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looses more precision than required. Better to use Ticks in this calculation which have 0.1ms precision. It's a pity using DateTime necessitates loosing microsecond precision, but if someone really wants that they can use the underlying type.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! the problem is not here, it's in ConvretToLogicalValue.

/// <inheritdoc/>
public override object ConvertToLogicalValue(object baseValue, LogicalSchema schema)
{
var noMs = (long)baseValue / 1000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looses precision. the way to do this i think is also take baseValue module 1000, work out the fractional ms from that and then add this to noMs. if you use double there may be precision issues.

if (time > _maxTime)
throw new ArgumentOutOfRangeException(nameof(logicalValue), "A 'time-millis' value can only have the range '00:00:00' to '23:59:59'.");

return (int)(time - UnixEpochDateTime.TimeOfDay).TotalMilliseconds;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the time-millis logical type represents a time of day, with no reference to a particular calendar, time zone or date.

by subtracting unix time-of-day you're implying both that:

  1. the stored time of day is in reference to universal time.
  2. the passed in timespan is in reference to the timezone.

I think subtracting unix time of day here is both very confusing and misinterpreting the spec.

I assume/hope the java implementation doesn't work like this!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that UnixEpochDateTime.TimeOfDay is 00:00:00 so the value won't be incorrect, but the logic is.

if (time > _maxTime)
throw new ArgumentOutOfRangeException(nameof(logicalValue), "A 'time-micros' value can only have the range '00:00:00' to '23:59:59'.");

return (long)(time - UnixEpochDateTime.TimeOfDay).TotalMilliseconds * 1000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in TimeMillisecond.

/// <summary>
/// The date and time of the Unix Epoch.
/// </summary>
protected static readonly DateTime UnixEpochDateTime = new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be DateTimeKind.Unspecified.

The spec says:

The timestamp-millis logical type represents an instant on the global timeline, independent of a particular time zone or calendar,

by specifying a kind of UTC, the returned values are explicitly not independent of a particular timezone.

the spec does go on to say:

where the long stores the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC

(i.e. includes time zone in the epoch), but I don't think that is intended or what we want.

@mhowlett
Copy link

mhowlett commented Feb 5, 2020

I've now sanity checked this by reading data written with java (via kafka) and vice versa for each of the logical types (including some edge cases). A few more comments related to time types to address.

It seems using the 'fixed' underlying type is not supported yet. I suspect people will want to use the Duration logical type in particular, so it would be good if this could be added to this PR, but I don't see that as a prerequisite for merging.

In the interests of time, I didn't set my tests up in a reproducible way, but I'm satisfied enough that we'll be happy to link to this in Confluent.Kafka instead of our fork once released.

Logical type support in .NET would have to be the most requested feature we see at Confluent - if there's any chance of getting this in 1.9.2 we'd be very happy :-).

@Fokko
Copy link
Contributor

Fokko commented Feb 5, 2020

Thanks for the effort @timjroberts. Let's move this PR forward as is then. Thanks for the testing and review @mhowlett.

@Fokko Fokko merged commit 75e8f58 into apache:master Feb 5, 2020
Fokko added a commit that referenced this pull request Feb 5, 2020
* AVRO-2359: Support Logical Types in CSharp

* fix more warnings

* remove timezone change in TestTimestamp tests

* update logical Date and Time tests to be local time zone aware

* ensure TimeZone is set relative to a source TimeZone

* reparse Dates assuming local if they're not UTC

* try setting specific TimeZone offset

* actions for code review comments

* seperate the Timestamp* tests

* add missing license to AvroDecimalTests

* modify LogicalType to return .NET type rather than string

* add Register method to LogicalTypeFactory

* Fixup code-review comments

* Make Sign internal and add xmldocs

Co-authored-by: Fokko Driesprong <[email protected]>
@Fokko
Copy link
Contributor

Fokko commented Feb 5, 2020

I've cherry-picked this onto the 1.9.2 branch @blachniet @RyanSkraba

@mhowlett
Copy link

mhowlett commented Feb 5, 2020

thanks @Fokko !

@timjroberts - if you want to do a PR addressing the last round of comments, that'd be great (else I'm happy to do it). i think they should be addressed, but don't need to get them in 1.9.2.

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2020

Ryan published the first RC of 1.9.2: http://mail-archives.apache.org/mod_mbox/avro-dev/202002.mbox/browser

Would be great if y'all could check the .Net artifacts :)

@timjroberts
Copy link
Contributor Author

@mhowlett I'm happy for you to pick those final pieces up if it'll move things along quicker. I can't commit to things in the evenings at the moment. Please do let me know if you need anything from me though.

Thanks for all the help in getting this. Much appreciated.

@mhowlett
Copy link

mhowlett commented Feb 6, 2020

can confirm I've successfully serialized and deserialized a record with a logical type field using the nuget package from the above reference.

re: additional changes, I'll open a PR in due course.

@JDLessl
Copy link

JDLessl commented Apr 2, 2020

Is there some trick to using logical types as part of a record? The testing works when the entire schema consists of, for instance, a int/date type, but when it's just one field of many, the schema parser consistently ignores the logicalType annotation and treats the field as a pure int.

Looked in the code, should Schema.cs#206 be passing jtok instead of jtype to LogicalSchema.NewInstance()?

@DannyBoyk
Copy link

DannyBoyk commented Dec 15, 2023

EDIT - False alarm, sorry! Regenerated our classes with v1.11.3 and then somehow reverted the nuget update, so I was trying to serialize with v1.9.1 writer.

@mhowlett - I'm trying to update to Apache.Avro 1.11.3 in our C# project and am getting serialization issues with the logical type changes. A record that uses date now has its C# generated class with a DateTime instead of an int. When I go to serialize it, I get the error Avro.AvroException: 'System.Int32 required to write against Int schema but found System.DateTime'.

I feel like I might be missing something, but I'm not sure what. The generated schema has:
{ "type":"int", "logicalType":"date"}
So, I assumed the serialization code would handle the conversion itself. We are using a SpecificDatumWriter, in case that makes a difference.

Any thoughts or help is appreciated.

@KalleOlaviNiemitalo
Copy link
Contributor

@DannyBoyk can you post a sample program that hits the exception?

@DannyBoyk
Copy link

DannyBoyk commented Dec 15, 2023

Well, have to apologize, this was a PEBKAC error. I somehow managed to revert the new Avro nuget update after I did a code gen with the new package, so I was trying to serialize v1.11.3 C# Avro objects with a v1.9.1 writer. After reverting the revert, everything is working as it should. Sorry for the false alarm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build C++ Pull Requests for C++ binding C# website
Projects
None yet
Development

Successfully merging this pull request may close these issues.