Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python generated code is no longer compatible with Cython since 3.20 #10800

Closed
vthib opened this issue Oct 20, 2022 · 6 comments
Closed

Python generated code is no longer compatible with Cython since 3.20 #10800

vthib opened this issue Oct 20, 2022 · 6 comments
Assignees

Comments

@vthib
Copy link
Contributor

vthib commented Oct 20, 2022

Version: 4.21.8
Language: Python

Before the 3.20 update, the generated code was statically defining all the messages, and could be given to cython without issues.
Since the 3.20 update, those messages are dynamically generated, leading to a cython error.

For example, given this proto file

syntax = "proto3";

message Foo {
    bool a = 1;
}

The codegen gives this:

$ protoc --python_out=gen proto/a.proto

# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: proto/a.proto
"""Generated protocol buffer code."""
from google.protobuf.internal import builder as _builder
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)

_sym_db = _symbol_database.Default()




DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\rproto/a.proto\"\x10\n\x03\x46oo\x12\t\n\x01\x61\x18\x01 \x01(\x08\x62\x06proto3')

_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'proto.a_pb2', globals())
if _descriptor._USE_C_DESCRIPTORS == False:

  DESCRIPTOR._options = None
  _FOO._serialized_start=17
  _FOO._serialized_end=33
# @@protoc_insertion_point(module_scope)

This can no longer be compiled by cython:

$ cython gen/proto/a_pb2.py
Error compiling Cython file:
------------------------------------------------------------
...
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'proto.a_pb2', globals())
if _descriptor._USE_C_DESCRIPTORS == False:

  DESCRIPTOR._options = None
  _FOO._serialized_start=17
 ^
------------------------------------------------------------

gen/proto/a_pb2.py:23:2: undeclared name not builtin: _FOO

Cython does not dynamically check variables from modifications of globals(), and thinks the variable is not set.

This isn't really a regression since I suppose you do not support Cython. However, a small change could make this work with native python and cython:

Instead of generating this:

_FOO._serialized_start=17

generating this would fix the issue:

globals()["_FOO"]._serialized_start=17

This is not an easy fix to do with some post processing of the generated files with regexes and seds and stuff like this, but shouldn't be too hard to do in the code generator I suppose.

Would you be OK with such a change? That would be really useful for cython users. I can try to make this change if needed.

@vthib vthib added the untriaged auto added to all issues by default when created. label Oct 20, 2022
@zhangskz zhangskz added question python and removed untriaged auto added to all issues by default when created. labels Oct 20, 2022
@haberman
Copy link
Member

What is the use case for this? What is the expected behavior if you let Cython compile the generated code?

@vthib
Copy link
Contributor Author

vthib commented Oct 20, 2022

The expected behavior is to have the exact same behavior as without the Cython pass. Cython is usually used to improve performances, but it can also be used for obfuscation and protecting the source code.

@haberman
Copy link
Member

I see. That seems reasonable. We reference globals() several times, I could see rewriting this to:

_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'proto.a_pb2', _globals)

if _descriptor._USE_C_DESCRIPTORS == False:

  DESCRIPTOR._options = None
  _globals["_FOO"]._serialized_start=17

We could use _globals for any case where we are referencing a variable we did not directly assign.

I would worry some about the performance, except that _descriptor._USE_C_DESCRIPTORS will always be true unless we are using the pure-Python library, which is slow anyway. So I think we are ok there.

@vthib
Copy link
Contributor Author

vthib commented Oct 21, 2022

Yes that would work well!

@vthib
Copy link
Contributor Author

vthib commented Nov 14, 2022

Are you working on this @haberman ? I can try to make a PR for it otherwise

@haberman
Copy link
Member

I am not working on this currently. PRs welcome.

copybara-service bot pushed a commit that referenced this issue Dec 5, 2022
Replace access to variables created in globals() by explicit access to the globals() array. This avoids static analysis of the code to error out on those unknown variables, and make the files cythonable.

Fixes #10800

Closes #11011

FUTURE_COPYBARA_INTEGRATE_REVIEW=#11011 from vthib:10800-cython-compatibility 705da40
PiperOrigin-RevId: 493055381
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants