C-extension does allow customizing serialization #192

vamega · 2023-11-15T20:42:03Z

Things to check first

I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release

cbor2 version

5.5.1

Python version

3.11

What happened?

The behaviour of the serializer differs between the C extension and the Python implementation of this library.

The Python implementation allows customizing the by updating the value of the _encoders map for known types. The C-implementation as an optimization does not use this map and defines it's own implementations of encoders for the known types. This comment on the original issue adding the C implementation shows that it was intended that the C implementation allow for the canonical parameter to be set to a value of 2 to opt into the slower codepath where the Python defined encoders were used.

The two changes below appear to be the cause of this:

Fix C module on big-endian problems #77 enforced that enc_style is only ever 0 or 1 (0 when initialized, 1 if the input is exactly 1).
This commit changed the parsing of the field to parse it as a predicate and not a byte.

A few questions here:

I recognize that modifying _encoders is not necessarily a supported interface for the library given how the variable is named? If that's not the case, is there no supported way to perform this substitution?
If allowing the C implementation to use the Python defined encoders, how should this be exposed? Go back to overloading the canonical parameter? Introduce an Enum as a value for that field?

How can we reproduce the bug?

The following test case should show the difference.
The python implementation will pass, the C implementation will fail.

import calendar

def test_encoder_replacement(impl):
    def to_epoch(dt: datetime) -> float:
        millis = int(dt.microsecond / 1000.0) / 1000.0
        return calendar.timegm(dt.astimezone(timezone.utc).timetuple()) + millis


    def encode_datetime(self, obj: datetime):
        self.encode_float(to_epoch(obj))


    dt = datetime(2013, 3, 21, 20, 4, 0, tzinfo=timezone.utc)
    expected = unhexlify("fb41d452d9ec000000")
    with BytesIO() as fp:
        encoder = impl.CBOREncoder(fp, canonical=2)
        encoder._encoders[datetime] = encode_datetime
        encoder.encode(dt)
        serialized = fp.getvalue()

The text was updated successfully, but these errors were encountered:

vamega · 2023-11-15T20:44:44Z

Curiously enough, this could have been missed in version 5.4.3 if people were accessing the encoder via cbor2.encoder instead of the top level module. The C extension didn't shadow namespaces other than the top level one.

agronholm · 2023-11-16T10:35:56Z

None of the serialization libraries I know of allow overriding the handling of built-in types. This was never supported, as you correctly deduced from the naming of the _encoders private variable. What's your use case?

vamega added the bug label Nov 15, 2023

vamega mentioned this issue Nov 15, 2023

POC: Allow customizing encoding of known types in the C-extension. Fixes #192 #193

Open

alexrashed mentioned this issue Jul 3, 2024

fix CBOR timestamp parsing localstack/localstack#11133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C-extension does allow customizing serialization #192

C-extension does allow customizing serialization #192

vamega commented Nov 15, 2023

vamega commented Nov 15, 2023

agronholm commented Nov 16, 2023

C-extension does allow customizing serialization #192

C-extension does allow customizing serialization #192

Comments

vamega commented Nov 15, 2023

Things to check first

cbor2 version

Python version

What happened?

How can we reproduce the bug?

vamega commented Nov 15, 2023

agronholm commented Nov 16, 2023