chore: transition the library to the new microgenerator #158

plamut · 2020-07-15T11:49:44Z

Closes #131.
Closes #168.

This PR replaces the old code generator, the generated parts of the code now use microgenerator. The latter also implies dropping support for Python 2.7 and 3.5.

There are a lot of changes and the transition itself was far from smooth, thus it's probably best to review this commit by commit (I tried to make the commits self-contained with each containing a single change/fix).

Things to focus on in reviews

Regenerating the code overrides some of the URLs in the samples README. Seems like a synthtool issue.
The clients do not support the client_config argument anymore. At least the ordering keys feature used that to change the timeout to "infinity". We need to see if that's crucial, and if the same can be set in a different way (maybe through the retry policy...)
SERVICE_ADDRESS and _DEFAULT_SCOPES constants in clients might be obsolete. Let's see if there are more modern alternatives (they are currently still injected into the generated clients).
Regenerating the code out of the box fails, because the Bazel tool incorrectly tries to use Python 2, resulting in syntax errors (could be just a problem on my machine, but it is a known Bazel issue).

Workaround:
- Add the snippet from the comment to google/pubsub/v1/BUILD.bazel file in the local googleapis clone and point the synthtool to it:
```
$ export SYNTHTOOL_GOOGLEAPIS=/path/to/googleapis
```
- Patch the local synthtool installation. Add the following two Bazel arguments to the _generate_code() method in synthtool/gcp/gapic_bazel.py (lines 177-178):
```
"--python_top=//google/pubsub/v1:myruntime",
"--incompatible_use_python_toolchains=false",
```
The workaround should convince Bazel to use Python 3, as this is the Python version in the configs.

Things left to do

~~Double check that the new client performance is adequate. There have been reports of possibly degraded performance with the new microgenerator.~~ The 10% performance hit was declared acceptable, not a release blocker anymore.
After approvals, re-generate the code one more time to make sure it works without errors (such as a new version of black wanting to re-generate some files and causing the CI check to fail).
Lint samples. It appears that the "fixup_keywords" step in the migration guide made the linter unhappy. 😄
Fix samples and their tests.
Fix system tests.
Determine how to handle methods that are now either missing, e.g. get_iam_policy(), or do not support all config options anymore, e.g create_subscription(). Adjust or delete?
~~Determine a replacement for now-unsupported client_config argument to client constructor. Or does the code generator need an update?~~ client_config has been replaced by custom Retry settings passed to the GAPIC client publish(). If we want to support custom retries, we must update the user-facing client's publish() method.
Add UPGRADING guide to the docs (also depends on the previous point - some of the changes might actually be bugs in the new generator).
Make the samples CI checks required?

PR checklist

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

The argument is not supported anymore in the new generated client.

This configuration is not anymore supported by the generated subscriber client.

plamut · 2020-07-16T11:57:28Z

I believe this PR is now ready to be reviewed. The things that still fail are related to incompatible changes in the client itself, for example:

Missing IAM methods - get_iam_policy(), set iam policy(), test_iam_permission()
Missing create_subscription() parameters - dead_letter_policy, expiration_policy, etc.
Missing client_config parameter in publisher client constructor (is there an alternative mechanism?)

We need to figure out how to handle these and I will add the UPGRADING.md file to the docs after all these things have been clarified.

Since this PR is quite complex, every pair of eyes would be beneficial. 🙂

@pradn Please check if ordering keys are affected in any way due to the lack of client_config options - they need "infinite" timeouts, right?
@anguillanneuf Any feedback on samples would be great, as you're probably most familiar with them.
@busunkim96 General feedback on the migration process in case I missed something.

Let's not rush and take the time to review this thoroughly, thanks!

software-dov · 2020-08-19T18:29:07Z

I don't think I understand the benchmark result very well; are the two different throughput numbers for cps_publisher_task.py and cps_subscriber_task.py respectively?

Also, just had a thought that may be relevant: is the manual layer invoking the asynchronous client or the synchronous? I see a lot of task/future names and semantics, which made me ask. The asynchronous surface has had basically no optimization focus.

plamut · 2020-08-20T13:49:08Z

I did some publisher profiling with yappi using roughly the code the benchmark framework uses and it seems that in the main thread a lot of time (~70%) is spent constructing PubsubMessage instances inside the publisher.publish() method:

from google.pubsub_v1 import types as gapic_types

...
# Create the Pub/Sub message object.
message = gapic_types.PubsubMessage(
    data=data, ordering_key=ordering_key, attributes=attrs
)

FWIW, profiling the current released version (i.e. the master) branch did now show that piece of code to be problematic.

Is instantiating a PubsubMessage more expensive with the microgenerator and are there ways to construct it faster, e.g. by circumventing any checks/extra logic that might be redundant for this use case?

software-dov · 2020-08-20T16:33:50Z

Yes, constructing messages is more expensive with the microgenerator because of its use of proto-plus. Depending on what data, ordering_key, and attrs (actual params, not formal) are, proto-plus may be doing non-trivial work to marshal python structures to protobuf structures.

Can you send me the raw data in some way? I'd like to take a closer look at which part of construction is expensive.
There's a potential hack to get around this, but it's not really something we can recommend our users do.
It's possible to interact with 'vanilla' proto types in Python and then just shove them into proto-plus types before passing the result into a method call.

E.g.

vanilla_pb = gapic_types.PubsubMessage.pb(data=data, ordering_key=ordering_key, attributes=attrs)
proto_plus_pb = gapic_types.PubsubMessage.wrap(vanilla_pb)

plamut · 2020-08-21T22:14:30Z

@software-dov

There's a potential hack to get around this, but it's not really something we can recommend our users do.

This actually improved things considerably - the benchmark showed a publisher throughput of ~57 MB/s, which is only ~11% worse than the current stable version. In addition, the PubsubMessage instance is constructed internally (users only pass in the message data and kwargs), meaning that we can actually use it.

Can you send me the raw data in some way?

The following is roughly how the messages look like in the benchmark and my local script for profiling:

data = b"A" * 250
ordering_key = ""
attrs = {'clientId': '0', 'sendTime': '1598046901869', 'sequenceNumber': '0'}
...
message = gapic_types.PubsubMessage(
    data=data, ordering_key=ordering_key, attributes=attrs
)

(sequenceNumber monotonically increases)

If using the hack, instantiation is considerably faster, but there is still quite a lot of time spent in Marshal.to_proto() - if we can speed this up further, it would be great.

I'll also see if the same can be used to improve subscriber performance.

I don't think I understand the benchmark result very well; are the two different throughput numbers for cps_publisher_task.py and cps_subscriber_task.py respectively?

Correct. The benchmark framework spins up separate compute instances for publisher and subscriber and then measures how well they perform individually.

Also, just had a thought that may be relevant: is the manual layer invoking the asynchronous client or the synchronous? I see a lot of task/future names and semantics, which made me ask.

It's the synchronous client, i.e. google.pubsub_v1.services.publisher.client.PublisherClient. Those futures are subclasses of google.api_core.future.Future and are managed manually by the (hand-written) publisher client itself.

plamut · 2020-08-22T10:15:23Z

I have some good news - circumventing the wrapper classes around the raw protobuf messages to speed up instantiation and attribute access seems to help a lot. Running the benchmarks with the two most recent experimental commits produces the following:

INFO-Results for cps-gcloud-python-publisher:
1157113 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-publisher:
INFO-50%: 5.0625 - 7.59375
1157124 [main] INFO com.google.pubsub.flic.Driver  - 50%: 5.0625 - 7.59375
INFO-99%: 86.49755859375 - 129.746337890625
1157124 [main] INFO com.google.pubsub.flic.Driver  - 99%: 86.49755859375 - 129.746337890625
INFO-99.9%: 194.6195068359375 - 291.92926025390625
1157124 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 194.6195068359375 - 291.92926025390625
INFO-Average throughput: 57.02 MB/s
1157126 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 57.02 MB/s
INFO-Results for cps-gcloud-python-subscriber:
1157126 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-subscriber:
INFO-50%: 2216.8378200531006 - 3325.256730079651
1157127 [main] INFO com.google.pubsub.flic.Driver  - 50%: 2216.8378200531006 - 3325.256730079651
INFO-99%: 16834.112196028233 - 25251.16829404235
1157127 [main] INFO com.google.pubsub.flic.Driver  - 99%: 16834.112196028233 - 25251.16829404235
INFO-99.9%: 25251.16829404235 - 37876.75244106352
1157128 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 25251.16829404235 - 37876.75244106352
INFO-Average throughput: 57.83 MB/s
1157128 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 57.83 MB/s

The performance hit is now only around 10% (publisher ~12%, subscriber ~8%), which might actually be acceptable.

Profiling shows that the speed of creating a new pubsub message and the speed of accessing the message's attributes significantly affects the throughput of publisher and subscriber. This commit makes everything faster by circumventing the wrapper class around the raw protobuf pubsub messages where possible.

software-dov · 2020-08-24T17:01:09Z

Optimizing marshal.to_proto is a mid-low priority ongoing effort. The good news is that any changes made to proto-plus will be available to all users for free and transparently, so if the new throughput numbers are acceptable they may eventually get better without having to cut a new release.
The not so good news is that I don't think to_proto is likely to get much faster in the general case. There may be specific instances we can optimize, or some crazy fast-path shenanigans, but the code there is doing nonzero work in order to facilitate a more flexible and ergonomic interface. The downside of proto-plus is that its user visible benefits come with a performance cost.

software-dov · 2020-08-24T18:57:25Z

My bad, when I said "the raw data" above, I meant the profiling data. I don't particularly want to set up the benchmark environment, but I do want to try looking through the data to see what the hot spots are and if they could be optimized.

software-dov · 2020-08-24T20:57:52Z

I have a special, optimized-but-not-reviewed-or-production-ready tarball of proto-plus. Is it possible to run the benchmark suite using it? My own benchmarking has these changes saving about 10-15 milliseconds out of about 500 milliseconds. YMMV, but if it improves the throughput enough it could be worth putting the proto-plus changes up for review.
proto-plus-1.7.1.tar.gz

plamut · 2020-08-25T11:16:48Z

I have a special, optimized-but-not-reviewed-or-production-ready tarball of proto-plus. Is it possible to run the benchmark suite using it?

If pip can access and install it, we can specify this version as a dependency in the framework's requirements.txt, should be doable.

Just to clarify, would you like to run the benchmark using the tip of this PR branch, i.e. the version that already includes the optimizations that try to bypass the protobuf buffer classes? Or on the version that excludes this last commit and uses the wrapper classes more heavily?

My bad, when I said "the raw data" above, I meant the profiling data.

No worries, sent you an email with the data just now.

software-dov · 2020-08-25T16:20:10Z

The tip of this PR branch.

Changes are also available in this branch of my fork: https://github.com/software-dov/proto-plus-python/tree/hackaday

plamut · 2020-08-26T15:04:42Z

First run, using proto-plus 1.7.1 on the tip of this PR branch. The results seem more or less similar to the previous benchmark:

INFO-Results for cps-gcloud-python-publisher:
1151295 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-publisher:
INFO-50%: 5.0625 - 7.59375
1151304 [main] INFO com.google.pubsub.flic.Driver  - 50%: 5.0625 - 7.59375
INFO-99%: 86.49755859375 - 129.746337890625
1151304 [main] INFO com.google.pubsub.flic.Driver  - 99%: 86.49755859375 - 129.746337890625
INFO-99.9%: 194.6195068359375 - 291.92926025390625
1151304 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 194.6195068359375 - 291.92926025390625
INFO-Average throughput: 56.58 MB/s
1151306 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 56.58 MB/s
INFO-Results for cps-gcloud-python-subscriber:
1151306 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-subscriber:
INFO-50%: 1477.8918800354004 - 2216.8378200531006
1151306 [main] INFO com.google.pubsub.flic.Driver  - 50%: 1477.8918800354004 - 2216.8378200531006
INFO-99%: 25251.16829404235 - 37876.75244106352
1151307 [main] INFO com.google.pubsub.flic.Driver  - 99%: 25251.16829404235 - 37876.75244106352
INFO-99.9%: 37876.75244106352 - 56815.128661595285
1151307 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 37876.75244106352 - 56815.128661595285
INFO-Average throughput: 57.37 MB/s
1151307 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 57.37 MB/s

I also did another run, the measured throughput was around 55 MB/s - a bit worse, but within the normal results variance (the typical difference appears to be somewhere in the 1-2 MB/s range).

software-dov · 2020-08-26T16:56:59Z

Okay. In light of those numbers I'm inclined not to merge the proto-plus changes. Is there an approval process for accepting the throughput regression?

plamut · 2020-08-26T17:04:02Z

Please just mind that that the proto-plus optimizations might not have a noticeable effect here, as the optimizations added here actively try to circumvent proto-plus, but other libraries might still see benefits.

I'm not aware of any formal processes, but we do have weekly PubSub meetings on Thursdays where we discuss these things. I added the question to the agenda whether a -10% performance hit is acceptable in the new major release.

Update: Was confirmed, -10% is good enough for now, although we should strive to improve this further in the mid-term.

cguardia

This looks good to me. Sorry this took so long. Had to go commit by commit.

plamut · 2020-09-07T07:30:54Z

@cguardia That's fine, appreciated. I will re-generate and benchmark the code again now, just in case, and see if we are ready for the major release.

plamut · 2020-09-07T09:13:42Z

After re-generating the code yet again, performance still appears to be in line with the previous benchmarks:

INFO-Results for cps-gcloud-python-publisher:
1095535 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-publisher:
INFO-50%: 5.0625 - 7.59375
1095545 [main] INFO com.google.pubsub.flic.Driver  - 50%: 5.0625 - 7.59375
INFO-99%: 86.49755859375 - 129.746337890625
1095546 [main] INFO com.google.pubsub.flic.Driver  - 99%: 86.49755859375 - 129.746337890625
INFO-99.9%: 194.6195068359375 - 291.92926025390625
1095546 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 194.6195068359375 - 291.92926025390625
INFO-Average throughput: 56.04 MB/s
1095547 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 56.04 MB/s
INFO-Results for cps-gcloud-python-subscriber:
1095547 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-subscriber:
INFO-50%: 194.6195068359375 - 291.92926025390625
1095549 [main] INFO com.google.pubsub.flic.Driver  - 50%: 194.6195068359375 - 291.92926025390625
INFO-99%: 16834.112196028233 - 25251.16829404235
1095549 [main] INFO com.google.pubsub.flic.Driver  - 99%: 16834.112196028233 - 25251.16829404235
INFO-99.9%: 37876.75244106352 - 56815.128661595285
1095550 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 37876.75244106352 - 56815.128661595285
INFO-Average throughput: 56.33 MB/s
1095550 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 56.33 MB/s

Merging. 🎉

Edit: Ah, @kamalaboulhosn expressed a wish to review the UPGRADING guide, will wait with merging a bit more.

kamalaboulhosn

The upgrade guide looks good.

plamut added the type: process A process-related concern. May include testing, release, or the like. label Jul 15, 2020

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jul 15, 2020

plamut changed the title ~~Transition the library to the new microgenerator~~ chore: transition the library to the new microgenerator Jul 15, 2020

plamut force-pushed the use-microgenerator branch 4 times, most recently from a18ea36 to 1ecc48f Compare July 16, 2020 10:02

plamut added 17 commits July 16, 2020 13:43

chore: regenerate the library with microgenerator

d218232

Update setup.py for Python 3.6+

529c853

Fix GAPIC types imports

a744caf

Inject SERVICE_ADDRESS and _DEFAULT_SCOPES to generated classes

f416327

Cleanup types.py

256081a

Adjust determining message size to new protobufs

dbf76f0

Adjust Message.publish_time property to new protobuf

1c08fc4

Remove client_config argument to publisher client

8568830

The argument is not supported anymore in the new generated client.

Remove subscriber client config

5a881e5

This configuration is not anymore supported by the generated subscriber client.

Fix method patch in gapic instance method test

022e710

Fix transport-related test failures

d44494f

Adjust calls to changed client.publish() signature

19952aa

Remove obsolete replacement rules from synth.py

d8b0a34

Fix streaming pull monkeypatch (no prefetch)

cd8ad7e

Update supported Python versions in README

1253156

Adjust system tests to new method signatures

5a78181

Adjust method calls in samples to new client

2b01cc4

plamut force-pushed the use-microgenerator branch from 1ecc48f to 2b01cc4 Compare July 16, 2020 11:44

plamut marked this pull request as ready for review July 16, 2020 11:57

plamut requested review from anguillanneuf and hongalex as code owners July 16, 2020 11:57

plamut requested a review from pradn July 16, 2020 11:57

plamut force-pushed the use-microgenerator branch from 30dce66 to c29d7f8 Compare August 22, 2020 11:20

software-dov approved these changes Aug 24, 2020

View reviewed changes

hongalex approved these changes Aug 27, 2020

View reviewed changes

plamut requested a review from cguardia August 28, 2020 10:02

plamut mentioned this pull request Aug 31, 2020

fix: Expose types.PublisherOptions #167

Closed

4 tasks

cguardia approved these changes Sep 7, 2020

View reviewed changes

Regenerate the code with the latest changes

23e1612

plamut requested a review from kamalaboulhosn September 7, 2020 09:15

kamalaboulhosn approved these changes Sep 7, 2020

View reviewed changes

plamut merged commit 6b9eec8 into googleapis:master Sep 7, 2020

plamut mentioned this pull request Sep 8, 2020

Got an unexpected keyword argument 'request' on create_topic #182

Closed

crwilcox mentioned this pull request Mar 22, 2021

proto-plus introduces substantial overhead to interactions with pb. googleapis/proto-plus-python#222

Closed

plamut deleted the use-microgenerator branch March 22, 2021 16:57

craiglabenz mentioned this pull request May 4, 2021

[WIP] Protobuf Performance Refactor googleapis/proto-plus-python#230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: transition the library to the new microgenerator #158

chore: transition the library to the new microgenerator #158

plamut commented Jul 15, 2020 •

edited

Loading

plamut commented Jul 16, 2020

software-dov commented Aug 19, 2020

plamut commented Aug 20, 2020 •

edited

Loading

software-dov commented Aug 20, 2020

plamut commented Aug 21, 2020 •

edited

Loading

plamut commented Aug 22, 2020

software-dov commented Aug 24, 2020

software-dov commented Aug 24, 2020

software-dov commented Aug 24, 2020

plamut commented Aug 25, 2020

software-dov commented Aug 25, 2020

plamut commented Aug 26, 2020 •

edited

Loading

software-dov commented Aug 26, 2020

plamut commented Aug 26, 2020 •

edited

Loading

cguardia left a comment

plamut commented Sep 7, 2020

plamut commented Sep 7, 2020 •

edited

Loading

kamalaboulhosn left a comment

chore: transition the library to the new microgenerator #158

chore: transition the library to the new microgenerator #158

Conversation

plamut commented Jul 15, 2020 • edited Loading

Things to focus on in reviews

Things left to do

PR checklist

plamut commented Jul 16, 2020

software-dov commented Aug 19, 2020

plamut commented Aug 20, 2020 • edited Loading

software-dov commented Aug 20, 2020

plamut commented Aug 21, 2020 • edited Loading

plamut commented Aug 22, 2020

software-dov commented Aug 24, 2020

software-dov commented Aug 24, 2020

software-dov commented Aug 24, 2020

plamut commented Aug 25, 2020

software-dov commented Aug 25, 2020

plamut commented Aug 26, 2020 • edited Loading

software-dov commented Aug 26, 2020

plamut commented Aug 26, 2020 • edited Loading

cguardia left a comment

Choose a reason for hiding this comment

plamut commented Sep 7, 2020

plamut commented Sep 7, 2020 • edited Loading

kamalaboulhosn left a comment

Choose a reason for hiding this comment

plamut commented Jul 15, 2020 •

edited

Loading

plamut commented Aug 20, 2020 •

edited

Loading

plamut commented Aug 21, 2020 •

edited

Loading

plamut commented Aug 26, 2020 •

edited

Loading

plamut commented Aug 26, 2020 •

edited

Loading

plamut commented Sep 7, 2020 •

edited

Loading