-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: grpcio==1.48.0 causes Java expansion service to hang on startup in Python XLang pipelines. #22533
Comments
@chamikaramj @ihji Who would be a good person to investigate this? |
Will check. Thanks. |
Per @Abacn this is due to grpcio 1.48.0. |
Thanks for looking @Abacn. |
@kileys making this a release blocker in case we have to update Python or x-lang wrappers somehow for automatic expansion service startup to not hang. |
I can reproduce this locally. With grpcio 1.48.0 the "subprocess.Popen" invocation below gets stuck.
cmd is: ['java' '-jar' '/beam/sdks/java/io/expansion-service/build/libs/beam-sdks-java-io-expansion-service-2.42.0-SNAPSHOT.jar' '55010' '--filesToStage=/beam/sdks/java/io/expansion-service/build/libs/beam-sdks-java-io-expansion-service-2.42.0-SNAPSHOT.jar'] It works fine with grpcio 1.47.0. @tvalentyn any idea how to proceed here ? Sounds to me like a bug in new grpcio. Should we try to pin the version ? |
I think we need to reach out to the maintainers of grpcio to clarify. |
Are there any additional details as to what step is failing ? |
cc: @veblush from grpcio team. |
I would consider trying to install grpcio from sources, and try to do a bisection via |
@gnossen for gRPC Python |
@tvalentyn bisecting would be helpful and callstack when it's hanging would be also helpful if possible. |
Update: we could reproduce locally. @tvalentyn is running a bisect on the grpcio repo to to identify the culprit. |
Bisection notes. To rebuild grpcio between bisection iterations:
|
bisection points to grpc/grpc@977ebbe |
cc: @ctiller |
Detailed instructions to reproduce using an existing Apache Beam release. Install Apache Beam 2.40.0$ mkdir test_env pip freeze(if gRPC version is not 1.48.0 uninstall and install version 1.48.0 to test or build from source).Run the test$ export BOOTSTRAP_SERVER="dummy_broker" $ python -m apache_beam.examples.kafkataxi.kafka_taxi You'll notice that the test get stuck at the "subprocess.Popen" call at below location (right after the log at the line "Starting service with...").
If you try the same test with gRPC version 1.47.0 the test will proceed (but note that the test will still fail at a later location since we provided dummy parameters above). |
Offending release was yanked. |
@chamikaramj We have a prospective fix here. Any chance you'd be willing to pull it down and test it out. We believe we have an automated test that has been fixed by this change, but better safe than sorry. |
Yeah, I can try. Lemme know when it's ready to be tested (or I can do it when the PR is merged). |
It's ready to be tested right now. We're hoping to get verification from your end before merging the PR. |
Hi @gnossen, unfortunately I'm still running into the issue both on grpc/grpc#30473 and grpcio HEAD. @tvalentyn will you be able to check as well ? Happy to check again if needed. |
@chamikaramj Thanks for the feedback. We'll keep digging and come back to you when we have another candidate to test. I'll also keep on trying to create our own in-repo reproduction. |
@chamikaramj I believe you're running into a deadlock with Popen using I'm not sure what would have changed in python 1.48 to precipitate this, but presumably: more stdout/stderr than there was in v1.47. See https://docs.python.org/3.8/library/subprocess.html#subprocess.Popen |
Agreed. We've seen this many times in our codebase. We tend to use TemporaryFiles to avoid the issue. |
This will unfortunately result in a breakage for our released SDKs. Also, we can very reliably reproduce this for gRPC 1.48.0 and all released Beam SDKs will automatically pick up the change when it's released. Is there a way to update [1] (suspected PR that introduced the change in behavior) so that it does not result in a breakage for existing users ? Also, I suspect other gRPC users might run into similar issues when the change is released. cc: @aaltay @tvalentyn |
Tentatively re-opening to fix the ongoing release. |
I'm still running into this even with the fix suggested in #22533 (comment) So something is off. |
@kileys - this still looks like a release blocker. |
Update: @drfloob confirmed that he's not able to confirm the fix anymore as well. So possibly something was not setup correctly when the fix was tried the first time ? Also, after reading subprocess docs. Seems like the deadlock scenario is following.
But that's not what Beam is doing. Beam does following.
From documentation: subprocess.STDOUT: Relavent code is here:
So I think Beam's implementation is correct and should not result in the deadlock scenario mentioned in the documentation. |
@chamikaramj What version are you testing with? Do you have a stacktrace of where the test is currently timing out? |
I tried both master and pull/30473/head with stdout and stderr set to None. Job get stuck in both cases. I don't have a stacktrace since it just hangs till I kill the process. |
I opened #22654 to remove usage of subprocess.PIPE but it doesn't fix the expansion service hanging with unreleased gRPC as mentioned above. |
After rebuilding my beam dev environment (I lost some state when shutting down the docker container), I was unable to make the fix work again reliably. It's unclear to me how it got into a working state before. Apologies.
I can confirm I'm seeing the same behavior. I added enough debug logging in gRPC to establish that the pthread pre-fork handlers are not being run on |
We've been playing around in your docker dev environment and have come up with another fix to try. Can you please give this PR a shot? grpc/grpc#30572 |
@kileys - Is this not a release blocker? We haven an RC but this issue is still open? |
@gnossen I have a stracktrace from the hung spot with symbols:
apache-beam 2.40.0 grpcio installed via
|
I realized it might be too late for that stacktrace, it might help alternative timelines of me find this issue in the future if the timelines happen to merge. Thanks for the fix! |
What happened?
Symptoms
Workaround:
pip install grpcio==1.47.0
Affected test suites
https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/2949/
https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/5697/
https://ci-beam.apache.org/job/beam_PostCommit_XVR_PythonUsingJavaSQL_Dataflow/715/
https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/4014/
https://ci-beam.apache.org/job/beam_PostCommit_Python37/5528/
https://ci-beam.apache.org/job/beam_PostCommit_Python38/2939/
https://ci-beam.apache.org/job/beam_PostCommit_Python39/661/
https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/5931/
Issue Priority
Priority: 1
Issue Component
Component: cross-language
The text was updated successfully, but these errors were encountered: