Protobuf lacks framed stream of messages #54

brianolson · 2014-10-15T16:20:09Z

Lots of applications want a stream of protobuf messages in a file or a network stream.

It could be as simple as exposing the internal utility functions to write a varint to a stream. An application could then write a varint length prefix and then the blob of serialized protobuf.

xfxyjwf · 2014-10-15T17:57:58Z

To be clear, protobuf does support framed stream of messages. Are you talking specifically about protobuf in Python? In C++ and Java, you can use CodedInputStream/CodedOutputStream to read/write varint or any other protobuf wire format data.

brianolson · 2014-10-16T01:39:15Z

Yes, Python. Can we get a stable interface to the underlying operations in Python officially blessed?

hzeller · 2014-10-16T02:04:52Z

On 15 October 2014 09:20, Brian Olson [email protected] wrote:

Lots of applications want a stream of protobuf messages in a file or a
network stream.

It could be as simple as exposing the internal utility functions to write
a varint to a stream.

Doesn't CodedOutputStream already provide that ?

cbsmith · 2015-02-04T06:21:37Z

It does. Have you seen the Python implementation?

brianolson · 2015-02-06T15:16:19Z

CodedOutputStream is in the C++ library. In Python I think what I want is buried in google.protobuf.internal.encoder._EncodeVarint and google.protobuf.internal.decoder._DecodeVarint
I think it would be useful to promote these to part of the public API. If the equivalent is already in the public API in C++ then I think that means there's no reason not to in Python.

…lbuffers#54) * Changed schema for JSON test to be defined in a .proto file. Before we had lots of code to build these schemas manually, but this was verbose and made it difficult to add to the schema easily. Now we can just write a .proto file and adding fields is easy. To avoid making the tests depend on upbc (and thus Lua) we check in the generated schema. * Made protobuf-compiler a dependency of "make genfiles." * For genfiles download recent protoc that can handle proto3. * Only use new protoc for genfiles.

Ubehebe · 2019-04-10T16:55:25Z

I recently implemented a Bazel persistent worker in Python. The lack of varint-delimited reading/writing APIs was an obstacle. I worked around it by using the private APIs.

This is case of two Google products not working well together. Is it possible to publish these APIs?

mishas · 2020-06-06T18:25:41Z

I would like to revive this thread by sharing our experience.

We have a distributed system, which communicates using protobuf messages.
To make things a bit faster, some of the processes aggregate those messages and send them in batches (i.e. hold a buffer for 30 seconds, and sends out a list of everything they got within those 30 seconds).

Since those aggregation points aggregate a LOT of protobufs, it's important for the aggregation to be fast.

We've tried different ways of creating a list of protobufs, and found out that the following way is the fastest.
For each one of our message types, we also have a MessageTypeList message, which has only one repeated field of type MessageType.
We found out that using that is by far the fastest way, although it has its drawbacks:

For one, this creates a huge mess in protos (we have to have every message type twice - once the message itself, and the other one is a list of that message).
Secondly, the output list is larger in size than a real stream.

Here's our benchmark and results:

my_proto.proto:

message MyMsg {
    int32 n = 1;
}

message MyMsgList {
    repeated MyMsg msgs = 1;
}

benchmark.py:

from google.protobuf.internal import encoder
import my_proto_pb2

def serialize_list(l):
    msgs = my_proto_pb2.MyMsgList(msgs=l)
    return msgs.SerializeToString()

def serialize_stream(l):
    iob = io.BytesIO()
    for x in l:
        encoder._EncodeVarint(iob.write, x.ByteSize())
        iob.write(x.SerializeToString())
    return iob.getvalue()

Results for a list of 1M protos:
With pure python protos:

Done serialize_list in 13.030245 seconds
List size 5983486 bytes
Done serialize_stream in 7.377367 seconds (0.57x time)
List size 4983486 bytes, (16.71% smaller)

As you can see, with pure proto, the stream code not only creates a smaller output, but also almost twice as fast, but we must do better, so we use CPP protos:

With CPP python protos:

Done serialize_list in 0.650137 seconds
List size 5983486 bytes
Done serialize_stream in 1.002155 seconds (1.54x time)
List size 4983486 bytes, (16.71% smaller)

Here you can see that the stream code is more than 1.5 times slower :(.

Can we please get streams as part of this package, as it seems doing it any other way will not give us good enough speed.

Fixes protocolbuffers#54

ericsalo · 2022-09-01T17:44:20Z

Now that Python is implemented on top of upb, this has become a upb issue. First up is to implement a proto text parser, which is something I am doing now. Initial implementation will be limited to continuous buffers, after that we will look into adding support for stream I/O. I can't say yet when (or even whether) this may float to the top of the work queue but it is definitely on my radar so reassigning this to myself.

…lbuffers#54) * Changed schema for JSON test to be defined in a .proto file. Before we had lots of code to build these schemas manually, but this was verbose and made it difficult to add to the schema easily. Now we can just write a .proto file and adding fields is easy. To avoid making the tests depend on upbc (and thus Lua) we check in the generated schema. * Made protobuf-compiler a dependency of "make genfiles." * For genfiles download recent protoc that can handle proto3. * Only use new protoc for genfiles.

github-actions · 2024-06-23T10:02:19Z

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

haberman · 2024-06-24T01:15:24Z

Python is officially getting a public API for length-prefixed streams of messages: #16965

It is not released yet, but it will be included in the next minor version.

If you have any performance issues with this API, please open a separate issue for it.

xfxyjwf added enhancement python labels Jan 20, 2016

gerben-s assigned anandolee Mar 9, 2017

acozzette added the P3 label Jun 8, 2018

google-admin unassigned anandolee Aug 21, 2018

xfxyjwf assigned anandolee Aug 22, 2018

arnow117 mentioned this issue Apr 1, 2019

python: SIGSEGV when use PyImport_Import import symbol_database #5979

Closed

Ubehebe mentioned this issue Apr 10, 2019

Allow persistent workers to communicate with Bazel using JSON instead of protos bazelbuild/bazel#7998

Closed

saurabhs2501 added a commit to saurabhs2501/protobuf that referenced this issue Sep 8, 2020

Expose python Encode/Decode APIs

1de08b8

Fixes protocolbuffers#54

saurabhs2501 mentioned this issue Sep 8, 2020

Expose python Encode/Decode APIs #7879

Closed

eme-p added a commit to eme-p/protobuf that referenced this issue Sep 21, 2020

Expose python Encode/Decode APIs

eb983fc

Fixes protocolbuffers#54

eme-p mentioned this issue Sep 21, 2020

Expose python Encode/Decode APIs #7901

Closed

elharo assigned haberman and unassigned anandolee Oct 1, 2021

ericsalo assigned ericsalo and unassigned haberman Sep 1, 2022

github-actions bot added the inactive Denotes the issue/PR has not seen activity in the last 90 days. label Jun 23, 2024

haberman closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protobuf lacks framed stream of messages #54

Protobuf lacks framed stream of messages #54

brianolson commented Oct 15, 2014

xfxyjwf commented Oct 15, 2014

brianolson commented Oct 16, 2014

hzeller commented Oct 16, 2014

cbsmith commented Feb 4, 2015

brianolson commented Feb 6, 2015

Ubehebe commented Apr 10, 2019

mishas commented Jun 6, 2020

ericsalo commented Sep 1, 2022

github-actions bot commented Jun 23, 2024

haberman commented Jun 24, 2024

Protobuf lacks framed stream of messages #54

Protobuf lacks framed stream of messages #54

Comments

brianolson commented Oct 15, 2014

xfxyjwf commented Oct 15, 2014

brianolson commented Oct 16, 2014

hzeller commented Oct 16, 2014

cbsmith commented Feb 4, 2015

brianolson commented Feb 6, 2015

Ubehebe commented Apr 10, 2019

mishas commented Jun 6, 2020

ericsalo commented Sep 1, 2022

github-actions bot commented Jun 23, 2024

haberman commented Jun 24, 2024