Skip to content

Commit

Permalink
Fix Protobuf msgidx serialization and added use.deprecated.format
Browse files Browse the repository at this point in the history
  • Loading branch information
edenhill committed Jan 6, 2022
1 parent f44d6ce commit 2ac0d72
Show file tree
Hide file tree
Showing 6 changed files with 221 additions and 54 deletions.
38 changes: 38 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

v1.8.2 is a maintenance release with the following fixes and enhancements:

- **IMPORTANT**: Added mandatory `use.deprecated.format` to
`ProtobufSerializer` and `ProtobufDeserializer`.
See **Upgrade considerations** below for more information.
- **Python 2.7 binary wheels are no longer provided.**
Users still on Python 2.7 will need to build confluent-kafka from source
and install librdkafka separately, see [README.md](README.md#Prerequisites)
Expand All @@ -25,6 +28,41 @@ for a complete list of changes, enhancements, fixes and upgrade considerations.
**Note**: There were no v1.8.0 and v1.8.1 releases.


## Upgrade considerations

### Protobuf serialization format changes

Prior to this version the confluent-kafka-python client had a bug where
nested protobuf schemas indexes were incorrectly serialized, causing
incompatibility with other Schema-Registry protobuf consumers and producers.

This has now been fixed, but since the old defect serialization and the new
correct serialization are mutually incompatible the user of
confluent-kafka-python will need to make an explicit choice which
serialization format to use during a transitory phase while old producers and
consumers are upgraded.

The `ProtobufSerializer` and `ProtobufDeserializer` constructors now
both take a (for the time being) configuration dictionary that requires
the `use.deprecated.format` configuration property to be explicitly set.

Producers should be upgraded first and as long as there are old (<=v1.7.0)
Python consumers reading from topics being produced to, the new (>=v1.8.2)
Python producer must be configured with `use.deprecated.format` set to `True`.

When all existing messages in the topic have been consumed by older consumers
the consumers should be upgraded and both new producers and the new consumers
must set `use.deprecated.format` to `False`.


The requirement to explicitly set `use.deprecated.format` will be removed
in a future version and the setting will then default to `False` (new format).






## v1.7.0

v1.7.0 is a maintenance release with the following fixes and enhancements:
Expand Down
6 changes: 4 additions & 2 deletions examples/protobuf_consumer.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@
def main(args):
topic = args.topic

protobuf_deserializer = ProtobufDeserializer(user_pb2.User)
protobuf_deserializer = ProtobufDeserializer(user_pb2.User,
{'use.deprecated.format': False})
string_deserializer = StringDeserializer('utf_8')

consumer_conf = {'bootstrap.servers': args.bootstrap_servers,
Expand All @@ -62,7 +63,8 @@ def main(args):

user = msg.value()
if user is not None:
print("User record {}: name: {}\n"
print("User record {}:\n"
"\tname: {}\n"
"\tfavorite_number: {}\n"
"\tfavorite_color: {}\n"
.format(msg.key(), user.name,
Expand Down
7 changes: 4 additions & 3 deletions examples/protobuf_producer.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ def main(args):
schema_registry_client = SchemaRegistryClient(schema_registry_conf)

protobuf_serializer = ProtobufSerializer(user_pb2.User,
schema_registry_client)
schema_registry_client,
{'use.deprecated.format': True})

producer_conf = {'bootstrap.servers': args.bootstrap_servers,
'key.serializer': StringSerializer('utf_8'),
Expand All @@ -93,9 +94,9 @@ def main(args):
user = user_pb2.User(name=user_name,
favorite_color=user_favorite_color,
favorite_number=user_favorite_number)
producer.produce(topic=topic, key=str(uuid4()), value=user,
producer.produce(topic=topic, partition=0, key=str(uuid4()), value=user,
on_delivery=delivery_report)
except KeyboardInterrupt:
except (KeyboardInterrupt, EOFError):
break
except ValueError:
print("Invalid input, discarding record...")
Expand Down
171 changes: 147 additions & 24 deletions src/confluent_kafka/schema_registry/protobuf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import sys
import base64
import struct
import warnings
from collections import deque

from google.protobuf.message import DecodeError
Expand Down Expand Up @@ -110,11 +111,6 @@ def _create_msg_index(msg_desc):
if not found:
raise ValueError("MessageDescriptor not found in file")

# The root element at the 0 position does not need a length prefix.
if len(msg_idx) == 1 and msg_idx[0] == 0:
return [0]

msg_idx.appendleft(len(msg_idx))
return list(msg_idx)


Expand Down Expand Up @@ -169,6 +165,17 @@ class ProtobufSerializer(object):
| | | Schema Registry subject names for Schema References |
| | | Defaults to reference_subject_name_strategy |
+-------------------------------------+----------+------------------------------------------------------+
| ``use.deprecated.format`` | bool | Specifies whether the Protobuf serializer should |
| | | serialize message indexes without zig-zag encoding. |
| | | This option must be explicitly configured as older |
| | | and newer Protobuf producers are incompatible. |
| | | If the consumers of the topic being produced to are |
| | | using confluent-kafka-python <1.8 then this property |
| | | must be set to True until all old consumers have |
| | | have been upgraded. |
| | | Warning: This configuration property will be removed |
| | | in a future version of the client. |
+-------------------------------------+----------+------------------------------------------------------+
Schemas are registered to namespaces known as Subjects which define how a
schema may evolve over time. By default the subject name is formed by
Expand Down Expand Up @@ -208,17 +215,27 @@ class ProtobufSerializer(object):
__slots__ = ['_auto_register', '_use_latest_version', '_skip_known_types',
'_registry', '_known_subjects',
'_msg_class', '_msg_index', '_schema', '_schema_id',
'_ref_reference_subject_func', '_subject_name_func']
'_ref_reference_subject_func', '_subject_name_func',
'_use_deprecated_format']
# default configuration
_default_conf = {
'auto.register.schemas': True,
'use.latest.version': False,
'skip.known.types': False,
'subject.name.strategy': topic_subject_name_strategy,
'reference.subject.name.strategy': reference_subject_name_strategy
'reference.subject.name.strategy': reference_subject_name_strategy,
'use.deprecated.format': False,
}

def __init__(self, msg_type, schema_registry_client, conf=None):

if conf is None or 'use.deprecated.format' not in conf:
raise RuntimeError(
"ProtobufSerializer: the 'use.deprecated.format' configuration "
"property must be explicitly set due to backward incompatibility "
"with older confluent-kafka-python Protobuf producers and consumers. "
"See the release notes for more details")

# handle configuration
conf_copy = self._default_conf.copy()
if conf is not None:
Expand All @@ -238,6 +255,19 @@ def __init__(self, msg_type, schema_registry_client, conf=None):
if not isinstance(self._skip_known_types, bool):
raise ValueError("skip.known.types must be a boolean value")

self._use_deprecated_format = conf_copy.pop('use.deprecated.format')
if not isinstance(self._use_deprecated_format, bool):
raise ValueError("use.deprecated.format must be a boolean value")
if not self._use_deprecated_format:
warnings.warn("ProtobufSerializer: the 'use.deprecated.format' "
"configuration property, and the ability to use the "
"old incorrect Protobuf serializer heading format "
"introduced in confluent-kafka-python v1.4.0, "
"will be removed in an upcoming release in 2021 Q2. "
"Please migrate your Python Protobuf producers and "
"consumers to 'use.deprecated.format':True as "
"soon as possible")

self._subject_name_func = conf_copy.pop('subject.name.strategy')
if not callable(self._subject_name_func):
raise ValueError("subject.name.strategy must be callable")
Expand All @@ -263,20 +293,46 @@ def __init__(self, msg_type, schema_registry_client, conf=None):
schema_type='PROTOBUF')

@staticmethod
def _encode_uvarints(buf, ints):
def _write_varint(buf, val, zigzag=True):
"""
Writes val to buf, either using zigzag or uvarint encoding.
Args:
buf (BytesIO): buffer to write to.
val (int): integer to be encoded.
zigzag (bool): whether to encode in zigzag or uvarint encoding
"""

if zigzag:
val = (val << 1) ^ (val >> 63)

while (val & ~0x7f) != 0:
buf.write(_bytes((val & 0x7f) | 0x80))
val >>= 7
buf.write(_bytes(val))

@staticmethod
def _encode_varints(buf, ints, zigzag=True):
"""
Encodes each int as a uvarint onto buf
Args:
buf (BytesIO): buffer to write to.
ints ([int]): ints to be encoded.
zigzag (bool): whether to encode in zigzag or uvarint encoding
"""

assert len(ints) > 0
# The root element at the 0 position does not need a length prefix.
if ints == [0]:
buf.write(_bytes(0x00))
return

ProtobufSerializer._write_varint(buf, len(ints), zigzag=zigzag)

for value in ints:
while (value & ~0x7f) != 0:
buf.write(_bytes((value & 0x7f) | 0x80))
value >>= 7
buf.write(_bytes(value))
ProtobufSerializer._write_varint(buf, value, zigzag=zigzag)

def _resolve_dependencies(self, ctx, file_desc):
"""
Expand Down Expand Up @@ -361,7 +417,8 @@ def __call__(self, message_type, ctx):
# (big endian)
fo.write(struct.pack('>bI', _MAGIC_BYTE, self._schema_id))
# write the record index to the buffer
self._encode_uvarints(fo, self._msg_index)
self._encode_varints(fo, self._msg_index,
zigzag=not self._use_deprecated_format)
# write the record itself
fo.write(message_type.SerializeToString())
return fo.getvalue()
Expand All @@ -374,28 +431,82 @@ class ProtobufDeserializer(object):
Args:
message_type (GeneratedProtocolMessageType): Protobuf Message type.
conf (dict): Configuration dictionary.
ProtobufDeserializer configuration properties:
+-------------------------------------+----------+------------------------------------------------------+
| Property Name | Type | Description |
+-------------------------------------+----------+------------------------------------------------------+
| ``use.deprecated.format`` | bool | Specifies whether the Protobuf deserializer should |
| | | deserialize message indexes without zig-zag encoding.|
| | | This option must be explicitly configured as older |
| | | and newer Protobuf producers are incompatible. |
| | | If Protobuf messages in the topic to consume were |
| | | produced with confluent-kafka-python <1.8 then this |
| | | property must be set to True until all old messages |
| | | have been processed and producers have been upgraded.|
| | | Warning: This configuration property will be removed |
| | | in a future version of the client. |
+-------------------------------------+----------+------------------------------------------------------+
See Also:
`Protobuf API reference <https://googleapis.dev/python/protobuf/latest/google/protobuf.html>`_
"""
__slots__ = ['_msg_class', '_msg_index']
__slots__ = ['_msg_class', '_msg_index', '_use_deprecated_format']

# default configuration
_default_conf = {
'use.deprecated.format': False,
}

def __init__(self, message_type, conf=None):

# Require use.deprecated.format to be explicitly configured
# during a transitionary period since old/new format are
# incompatible.
if conf is None or 'use.deprecated.format' not in conf:
raise RuntimeError(
"ProtobufDeserializer: the 'use.deprecated.format' configuration "
"property must be explicitly set due to backward incompatibility "
"with older confluent-kafka-python Protobuf producers and consumers. "
"See the release notes for more details")

# handle configuration
conf_copy = self._default_conf.copy()
if conf is not None:
conf_copy.update(conf)

self._use_deprecated_format = conf_copy.pop('use.deprecated.format')
if not isinstance(self._use_deprecated_format, bool):
raise ValueError("use.deprecated.format must be a boolean value")
if not self._use_deprecated_format:
warnings.warn("ProtobufDeserializer: the 'use.deprecated.format' "
"configuration property, and the ability to use the "
"old incorrect Protobuf serializer heading format "
"introduced in confluent-kafka-python v1.4.0, "
"will be removed in an upcoming release in 2022 Q2. "
"Please migrate your Python Protobuf producers and "
"consumers to 'use.deprecated.format':True as "
"soon as possible")

def __init__(self, message_type):
descriptor = message_type.DESCRIPTOR
self._msg_index = _create_msg_index(descriptor)
self._msg_class = MessageFactory().GetPrototype(descriptor)

@staticmethod
def _decode_uvarint(buf):
def _decode_varint(buf, zigzag=True):
"""
Decodes a single uvarint from a buffer.
Decodes a single varint from a buffer.
Args:
buf (BytesIO): buffer to read from
zigzag (bool): decode as zigzag or uvarint
Returns:
int: decoded uvarint
int: decoded varint
Raises:
EOFError: if buffer is empty
Expand All @@ -410,7 +521,12 @@ def _decode_uvarint(buf):
value |= (i & 0x7f) << shift
shift += 7
if not (i & 0x80):
return value
break

if zigzag:
value = (value >> 1) ^ -(value & 1)

return value

except EOFError:
raise EOFError("Unexpected EOF while reading index")
Expand All @@ -432,7 +548,7 @@ def _read_byte(buf):
return ord(i)

@staticmethod
def _decode_index(buf):
def _decode_index(buf, zigzag=True):
"""
Extracts message index from Schema Registry Protobuf formatted bytes.
Expand All @@ -443,10 +559,17 @@ def _decode_index(buf):
int: Protobuf Message index.
"""
size = ProtobufDeserializer._decode_uvarint(buf)
msg_index = [size]
size = ProtobufDeserializer._decode_varint(buf, zigzag=zigzag)
if size < 0 or size > 100000:
raise DecodeError("Invalid Protobuf msgidx array length")

if size == 0:
return [0]

msg_index = []
for _ in range(size):
msg_index.append(ProtobufDeserializer._decode_uvarint(buf))
msg_index.append(ProtobufDeserializer._decode_varint(buf,
zigzag=zigzag))

return msg_index

Expand Down Expand Up @@ -486,7 +609,7 @@ def __call__(self, value, ctx):

# Protobuf Messages are self-describing; no need to query schema
# Move the reader cursor past the index
_ = ProtobufDeserializer._decode_index(payload)
_ = self._decode_index(payload, zigzag=not self._use_deprecated_format)
msg = self._msg_class()
try:
msg.ParseFromString(payload.read())
Expand Down
Loading

0 comments on commit 2ac0d72

Please sign in to comment.