Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

AVRO-2359: Support Logical Types in C# #14

Closed
wants to merge 9 commits into from
Closed

AVRO-2359: Support Logical Types in C# #14

wants to merge 9 commits into from

Conversation

timjroberts
Copy link

@timjroberts timjroberts commented Apr 2, 2019

This PR brings in C# support for logical types.

The code was originally based on the apache/avro repository (apache#492), but I've also applied the patch here for your consideration.

@mhowlett
Copy link

thanks - i should be able to look at this soon after we get 1.0 of the kafka client out the door (I have been focussed on that). I notice that there are more than 32000 lines of code though, most of which shouldn't be here, so at the very least i'll need to tidy it up and open again (which would loose attribution to you). is that ok, or do you want to tidy it up?

Also, I noticed we don't have the contributor license agreement bot set up on this repo, I think you can sign it by going here: http://clabot.confluent.io/cla (that just allows us to confidently do what we want with the code, probably ultimately merging upstream to the apache foundation).

@timjroberts
Copy link
Author

It looks like most of those 32K lines are project.assets.json files. Perhaps they're not in the .gitignore file. I'll take a look at tidying that up.

@timjroberts
Copy link
Author

I've also just complete the CLA.

@timjroberts
Copy link
Author

Around ~1K lines now. That seems more like it. :-)

@mhowlett
Copy link

great, thanks! i hope I can get this in 1.0.1.

@timjroberts
Copy link
Author

There were a few missing switch branches for testing the processing of a logicalType. These have now been added.

/// <param name="logicalValue">The logical value to convert.</param>
/// <param name="schema">The schema that represents the target of the conversion.</param>
/// <returns>An object representing the encoded value of the base type.</returns>
public override object ConvertToBaseValue(object logicalValue, LogicalSchema schema)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timjroberts I've been diving into avro decimal serialisation in a recent project, and I believe your conversion to/from decimals & bytes doesn't correctly account for 1) endianness and 2) the specific nature of 'valBits[3]' in the decimal type's data structure (according to the docs https://docs.microsoft.com/en-us/dotnet/api/system.decimal it contains a sign flag & the scale).

I've implemented some conversion code here: https://github.com/petersilverwood/avro-decimal-tools/tree/master/lang/csharp with some test fixtures to verify the same serialisation in C# and in java. There are probably some edge cases to handle and I haven't tested performance, but it should be good starting point.

Is this useful to integrate into your PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the implementations of the logical types were naive and likely not fully meeting with the current Avro specification. It was the one area where I needed more focused review. I'll take a look at your referenced. Many thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have some scripts in a local branch for bringing up an integration test environment in docker (multi broker kafka cluster, schema registry, monitoring etc). once that is in, it will be easy to get java <-> .net integration tests going. i have been thinking that is a pre-requisite for getting this PR merged. this is somewhere high on my priority list, but not at the top.

@timjroberts
Copy link
Author

@mhowlett - have you had any time to look at the interop scenarios that you mentioned above? Is there anything you need me to look at? I'm planning on looking at @petersilverwood's Decimal implementation this week - it's the logical type implementation that I'm most naive with.

@timjroberts
Copy link
Author

I've now reworked the Decimal logical type so that it takes into account the big-endian requirement for encoding, and also to support negative values.

cc/ @petersilverwood

[TestCase("123456789123456789.56")]
[TestCase("-123456789123456789.56")]
[TestCase("000000000000000001.01")]
[TestCase("-000000000000000001.01")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding test cases for Decimal.MinValue and Decimal.MaxValue


private static byte[] GetUnscaledDecimalValue(decimal value, out int scale)
{
var buffer = new byte[12];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest making this new byte[13]. The internals of the decimal are an unsigned integer value, but BigInteger constructor is expecting a signed value. Therefore, easiest way to get BigInteger to interpret the resultant byte[] from this method is to ensure the final bye is 0x00. This should fix the Decimal.MinValue and Decimal.MaxValue test cases.


var paddedBuffer = new byte[12];

Buffer.BlockCopy(buffer, 0, paddedBuffer, 0, Math.Min(buffer.Length, 12));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If buffer.Length > 12 we can't correctly read the avro data into a decimal - so would it be better to raise an exception in that case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timjroberts - thinking on this point some more - would it make sense to create a new type AvroDecimal which wraps a BigInteger and scale? This class could then include fromDecimal() and toDecimal() methods to bridge to the native decimal type for the majority of use cases, but still allow use with arbitrary size decimals as per the spec?

BitConverter.ToInt32(paddedBuffer, 0),
BitConverter.ToInt32(paddedBuffer, 4),
BitConverter.ToInt32(paddedBuffer, 8)
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure of this ground, but this is an area I think we have to treat big-endian architectures differently, e.g.:

          if (BitConverter.IsLittleEndian)
            {
                // as above
                
            }
            else
            {
                // On a BigEndian architecture, we need the byte order to be bigendian also before
                // interpreting as an int32. However, the array of int32[] passed to the constructor
                // of decimal should still be least significant byte first
                Array.Reverse(paddedBuffer);
                valBits = new int[]
                {
                    BitConverter.ToInt32(paddedBuffer, 8),
                    BitConverter.ToInt32(paddedBuffer, 4),
                    BitConverter.ToInt32(paddedBuffer, 0)                    
                };                
            }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be correct. Given where these bytes are now coming from their endianess will need to be considered. Good catch. Will update.

@timjroberts
Copy link
Author

I've added a new AvroDecimal type which encapsulates a big decimal. It's a wrapper around .NET's BigInteger and a scale. The 'decimal' logical type has then been refactored to use instances of this type instead of managing byte buffers directly. The logic was based on the 'decimal' logical type in the Java codebase, so they should work the same. However, saying that, I'm not sure about the two's-compliment side of things which the spec says the encoded decimal should be in.

/cc @petersilverwood

@timjroberts
Copy link
Author

So after a little reading, I think that because .NET byte is unsigned, we're already getting the two's compliment encoding of the decimal bytes. So I think we're fine. A Java byte is signed so I think it has to be converted before.

@dcrdev
Copy link

dcrdev commented Jul 24, 2019

Quick observation - this doesn't look to support schemas of mixed primitive and logical types, is that intended?

@timjroberts
Copy link
Author

@dcrdev - not quite sure what you mean, but you can define Records that use a mix of types (primitive, complex, and now logical). For example, the following schema uses a mix of primitive and the decimal logical type:

{
    "type": "record",
    "namespace": "WordCountEntities",
    "name": "WordCount",
    "fields": [
        {
            "name": "TotalWordCount",
            "type": "int"
        },
        {
            "name": "WordCounts",
            "type": { "type": "map", "values": "int" }
        },
        {
            "name": "TopWord",
            "type": "string"
        },
        {
            "name": "ExecutionCost",
            "type": {
                "type": "bytes",
                "logicalType": "decimal",
                "precision": 8,
                "scale": 2
            }
        }
    ]
}

@timjroberts
Copy link
Author

Any planned movement on this inclusion?

@mhowlett
Copy link

there's plenty of demand for this (it came up again yesterday with a customer), but it's not at the top of the list.

note: there's movement on getting rid of our fork, so this work will be done against the apache repo: confluentinc/confluent-kafka-dotnet#1033

@shawnallen85
Copy link

Any update on supporting this?

@mhowlett
Copy link

mhowlett commented Nov 1, 2019

@shawnallen85 - there's movement on this over here apache#492

@maxfield-riley
Copy link

maxfield-riley commented May 6, 2020

Do we have any movement on this tool?
Currently the avrogen tool generates
"type": { "logicalType": "date", "type": "int" }
as
"type": "int"
which leads to schema registry issues of
Schema mismatch. Reader: [schema] writer: [schema]

@thufailmattankil
Copy link

thufailmattankil commented Jul 20, 2020

Confluent.Apache.Avro.AvroGen 1.7.7.5 generates schema file (.avsc) data type based on "type" instead of "logicaltype". Below example generates datatype as Nullable int instead of Nullable DateTime . Could you please help.
"type": ["null", { "type": "int", "logicalType": "date" } ]

@alexander-karnadi
Copy link

hi, is there any plan to include this in a formal release?

@andrewegel
Copy link

This is causing our Jenkins CI to break due to the "unknown repository" - This needs to be closed to get around that. If this change is still required, please re-do it in a fork or branch.

@andrewegel andrewegel closed this Jan 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants