Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: grpcio==1.48.0 causes Java expansion service to hang on startup in Python XLang pipelines. #22533

Closed
apilloud opened this issue Jul 29, 2022 · 42 comments · Fixed by #22539

Comments

@apilloud
Copy link
Member

apilloud commented Jul 29, 2022

What happened?

Symptoms

22:32:02 apache_beam/transforms/sql_test.py::SqlTransformTest::test_filter 
22:32:02 -------------------------------- live log call ---------------------------------
22:32:02 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:239 Using pre-built snapshot at /home/jenkins/jenkins-slave/workspace/beam_PostCommit_XVR_Direct/src/sdks/java/extensions/sql/expansion-service/build/libs/beam-sdks-java-extensions-sql-expansion-service-2.42.0-SNAPSHOT.jar
22:32:02 INFO     root:external.py:773 Starting a JAR-based expansion service from JAR /home/jenkins/jenkins-slave/workspace/beam_PostCommit_XVR_Direct/src/sdks/java/extensions/sql/expansion-service/build/libs/beam-sdks-java-extensions-sql-expansion-service-2.42.0-SNAPSHOT.jar 
22:32:02 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:116 Starting service with ['java' '-jar' '/home/jenkins/jenkins-slave/workspace/beam_PostCommit_XVR_Direct/src/sdks/java/extensions/sql/expansion-service/build/libs/beam-sdks-java-extensions-sql-expansion-service-2.42.0-SNAPSHOT.jar' '56083' '--filesToStage=/home/jenkins/jenkins-slave/workspace/beam_PostCommit_XVR_Direct/src/sdks/java/extensions/sql/expansion-service/build/libs/beam-sdks-java-extensions-sql-expansion-service-2.42.0-SNAPSHOT.jar']
22:32:02 Build timed out (after 100 minutes). Marking the build as aborted.
23:50:24 Build was aborted

Workaround:
pip install grpcio==1.47.0

Affected test suites

https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/2949/
https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/5697/
https://ci-beam.apache.org/job/beam_PostCommit_XVR_PythonUsingJavaSQL_Dataflow/715/
https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/4014/
https://ci-beam.apache.org/job/beam_PostCommit_Python37/5528/
https://ci-beam.apache.org/job/beam_PostCommit_Python38/2939/
https://ci-beam.apache.org/job/beam_PostCommit_Python39/661/
https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/5931/

Issue Priority

Priority: 1

Issue Component

Component: cross-language

@apilloud
Copy link
Member Author

@chamikaramj @ihji Who would be a good person to investigate this?

@chamikaramj chamikaramj self-assigned this Jul 29, 2022
@chamikaramj
Copy link
Contributor

Will check. Thanks.

@apilloud
Copy link
Member Author

Per @Abacn this is due to grpcio 1.48.0.

@chamikaramj chamikaramj removed their assignment Jul 29, 2022
@chamikaramj
Copy link
Contributor

Thanks for looking @Abacn.

@chamikaramj chamikaramj added this to the 2.41.0 Release milestone Jul 29, 2022
@chamikaramj
Copy link
Contributor

@kileys making this a release blocker in case we have to update Python or x-lang wrappers somehow for automatic expansion service startup to not hang.

@chamikaramj chamikaramj self-assigned this Jul 29, 2022
@chamikaramj
Copy link
Contributor

I can reproduce this locally. With grpcio 1.48.0 the "subprocess.Popen" invocation below gets stuck.

self._process = subprocess.Popen(

cmd is: ['java' '-jar' '/beam/sdks/java/io/expansion-service/build/libs/beam-sdks-java-io-expansion-service-2.42.0-SNAPSHOT.jar' '55010' '--filesToStage=/beam/sdks/java/io/expansion-service/build/libs/beam-sdks-java-io-expansion-service-2.42.0-SNAPSHOT.jar']

It works fine with grpcio 1.47.0.

@tvalentyn any idea how to proceed here ? Sounds to me like a bug in new grpcio. Should we try to pin the version ?
(that will still break old SDKs though).

@tvalentyn
Copy link
Contributor

I think we need to reach out to the maintainers of grpcio to clarify.

@tvalentyn
Copy link
Contributor

Are there any additional details as to what step is failing ?

@tvalentyn
Copy link
Contributor

cc: @veblush from grpcio team.

@tvalentyn
Copy link
Contributor

I would consider trying to install grpcio from sources, and try to do a bisection via git bisect within grpcio repo. There are instructions on building from sources in https://pypi.org/project/grpcio/1.48.0/

@veblush
Copy link
Contributor

veblush commented Jul 29, 2022

@gnossen for gRPC Python

@veblush
Copy link
Contributor

veblush commented Jul 29, 2022

@tvalentyn bisecting would be helpful and callstack when it's hanging would be also helpful if possible.

@chamikaramj
Copy link
Contributor

Update: we could reproduce locally. @tvalentyn is running a bisect on the grpcio repo to to identify the culprit.

@chamikaramj chamikaramj assigned tvalentyn and unassigned chamikaramj Jul 30, 2022
@tvalentyn
Copy link
Contributor

Bisection notes. To rebuild grpcio between bisection iterations:

  • deactivate and recreate a venv
  • install apache-beam[gcp]
  • uninstall grpcio
  • cleanup leftovers in grpcio clone from previous build iteration, prepare for new build. git clean -dfx; git submodule update --init ; pip install -r requirements.txt ;
  • install grpcio GRPC_PYTHON_BUILD_WITH_CYTHON=1 pip install .
  • stash local changes w/ git stash
  • try to repro the issue
  • continue bisection

@tvalentyn
Copy link
Contributor

bisection points to grpc/grpc@977ebbe

@tvalentyn
Copy link
Contributor

cc: @ctiller

@tvalentyn tvalentyn changed the title [Bug]: Python expansion service hanging in tests [Bug]: Update to grpcio==1.48.0 causes Java expansion service to hang in Python XLang pipelines. Jul 30, 2022
@chamikaramj
Copy link
Contributor

chamikaramj commented Aug 1, 2022

Detailed instructions to reproduce using an existing Apache Beam release.

Install Apache Beam 2.40.0

$ mkdir test_env
$ python -m venv test_env/
$ . ./test_env/bin/activate
$ pip install apache-beam[gcp]==2.40.0

pip freeze

(if gRPC version is not 1.48.0 uninstall and install version 1.48.0 to test or build from source).

Run the test

$ export BOOTSTRAP_SERVER="dummy_broker"
$ export PROJECT=dummyproject
$ export TEMP_LOCATION=gs://dummybucket/tmp
$ export REGION="us-central1"
$ export JOB_NAME="kafka-taxi-date +%Y%m%d-%H%M%S"
$ export NUM_WORKERS="5"

$ python -m apache_beam.examples.kafkataxi.kafka_taxi
--runner DataflowRunner
--temp_location $TEMP_LOCATION
--project $PROJECT
--region $REGION
--num_workers $NUM_WORKERS
--job_name $JOB_NAME
--bootstrap_servers $BOOTSTRAP_SERVER

You'll notice that the test get stuck at the "subprocess.Popen" call at below location (right after the log at the line "Starting service with...").

self._process = subprocess.Popen(

If you try the same test with gRPC version 1.47.0 the test will proceed (but note that the test will still fail at a later location since we provided dummy parameters above).

@tvalentyn
Copy link
Contributor

Offending release was yanked.

@gnossen
Copy link

gnossen commented Aug 8, 2022

@chamikaramj We have a prospective fix here. Any chance you'd be willing to pull it down and test it out. We believe we have an automated test that has been fixed by this change, but better safe than sorry.

@chamikaramj
Copy link
Contributor

Yeah, I can try. Lemme know when it's ready to be tested (or I can do it when the PR is merged).

@gnossen
Copy link

gnossen commented Aug 8, 2022

It's ready to be tested right now. We're hoping to get verification from your end before merging the PR.

@chamikaramj
Copy link
Contributor

Hi @gnossen, unfortunately I'm still running into the issue both on grpc/grpc#30473 and grpcio HEAD. @tvalentyn will you be able to check as well ?
(note, that I had to manually re-install protobuf 3.20.1)

Happy to check again if needed.

@gnossen
Copy link

gnossen commented Aug 8, 2022

@chamikaramj Thanks for the feedback. We'll keep digging and come back to you when we have another candidate to test.

I'll also keep on trying to create our own in-repo reproduction.

@drfloob
Copy link

drfloob commented Aug 9, 2022

@chamikaramj I believe you're running into a deadlock with Popen using stdout=subprocess.PIPE. I confirmed that the gRPC fork handlers aren't running at all when your example hangs, which means this isn't likely a threading issue. And by changing that one line in subprocess_server.py to subprocess.Popen(cmd, stdout=None, stderr=None), Popen no longer hangs. Please try reproducing my results.

I'm not sure what would have changed in python 1.48 to precipitate this, but presumably: more stdout/stderr than there was in v1.47.

See https://docs.python.org/3.8/library/subprocess.html#subprocess.Popen

@gnossen
Copy link

gnossen commented Aug 9, 2022

more stdout/stderr than there was in v1.47.

Agreed. We've seen this many times in our codebase. We tend to use TemporaryFiles to avoid the issue.

@chamikaramj
Copy link
Contributor

chamikaramj commented Aug 9, 2022

This will unfortunately result in a breakage for our released SDKs.

Also, we can very reliably reproduce this for gRPC 1.48.0 and all released Beam SDKs will automatically pick up the change when it's released. Is there a way to update [1] (suspected PR that introduced the change in behavior) so that it does not result in a breakage for existing users ?

Also, I suspect other gRPC users might run into similar issues when the change is released.

cc: @aaltay @tvalentyn

[1] grpc/grpc@977ebbe

@chamikaramj chamikaramj reopened this Aug 9, 2022
@chamikaramj
Copy link
Contributor

Tentatively re-opening to fix the ongoing release.

@chamikaramj
Copy link
Contributor

I'm still running into this even with the fix suggested in #22533 (comment)

So something is off.

@aaltay
Copy link
Member

aaltay commented Aug 9, 2022

@kileys - this still looks like a release blocker.

@chamikaramj
Copy link
Contributor

chamikaramj commented Aug 10, 2022

Update:

@drfloob confirmed that he's not able to confirm the fix anymore as well. So possibly something was not setup correctly when the fix was tried the first time ?

Also, after reading subprocess docs.

Seems like the deadlock scenario is following.

  • Do subprocess.Popen() to start a process while using stdout=PIPE or stderr=PIPE.
  • Do a wait() on that process.
  • This could deadlock if the child process generates enough output to block waiting for the OS buffer to clear.

But that's not what Beam is doing. Beam does following.

  • Do a subprocess.Popen() to start a process while using stdout=subprocess.PIPE and stderr=subprocess.STDOUT.
  • Start a separate thread to read lines from the child process's stdout and print those lines.

From documentation:
subprocess.PIPE:
Special value that can be used as the stdin, stdout or stderr argument to [Popen]
(https://docs.python.org/3.8/library/subprocess.html#subprocess.Popen) and indicates that a pipe to the standard stream should be opened. Most useful with Popen.communicate().

subprocess.STDOUT:
Special value that can be used as the stderr argument to [Popen]
(https://docs.python.org/3.8/library/subprocess.html#subprocess.Popen) and indicates that standard error should go into the same handle as standard output.

Relavent code is here:

t = threading.Thread(target=log_stdout)

So I think Beam's implementation is correct and should not result in the deadlock scenario mentioned in the documentation.

@drfloob and @gnossen do you agree ?

@gnossen
Copy link

gnossen commented Aug 10, 2022

@chamikaramj What version are you testing with? master? I expect you'll need both grpc/grpc#30473 and the change to remove PIPE.

Do you have a stacktrace of where the test is currently timing out?

@chamikaramj
Copy link
Contributor

I tried both master and pull/30473/head with stdout and stderr set to None. Job get stuck in both cases.

I don't have a stacktrace since it just hangs till I kill the process.

@chamikaramj
Copy link
Contributor

I opened #22654 to remove usage of subprocess.PIPE but it doesn't fix the expansion service hanging with unreleased gRPC as mentioned above.

@drfloob
Copy link

drfloob commented Aug 10, 2022

I tried both master and pull/30473/head with stdout and stderr set to None. Job get stuck in both cases.

After rebuilding my beam dev environment (I lost some state when shutting down the docker container), I was unable to make the fix work again reliably. It's unclear to me how it got into a working state before. Apologies.

I don't have a stacktrace since it just hangs till I kill the process.

I can confirm I'm seeing the same behavior. I added enough debug logging in gRPC to establish that the pthread pre-fork handlers are not being run on subprocess.Popen, so I'm not sure where exactly this is hanging yet.

@chamikaramj chamikaramj removed this from the 2.41.0 Release milestone Aug 10, 2022
@gnossen
Copy link

gnossen commented Aug 12, 2022

We've been playing around in your docker dev environment and have come up with another fix to try. Can you please give this PR a shot? grpc/grpc#30572

@aaltay
Copy link
Member

aaltay commented Aug 13, 2022

@kileys - Is this not a release blocker? We haven an RC but this issue is still open?

@chamikaramj
Copy link
Contributor

@gnossen and @drfloob yes, that seems to have fixed it. The test doesn't get stuck on pull/30572/head but gets stuck on gRPC master. Thanks a lot for investigating.

@aaltay and @kileys seems like the issue is in the gRPC side. So no need to block on this. I'll resolve this Beam side issue.

@joar
Copy link
Contributor

joar commented Aug 30, 2022

Do you have a stacktrace of where the test is currently timing out?

@gnossen I have a stracktrace from the hung spot with symbols:

#0  0x00007ffff7da056d in syscall () from /usr/lib/libc.so.6
#1  0x00007ffff479e496 in absl::lts_20220623::synchronization_internal::FutexImpl::WaitUntil (t=..., val=0, v=0x7ffff0db0340) at third_party/abseil-cpp/absl/synchronization/internal/futex.h:104
#2  absl::lts_20220623::synchronization_internal::Waiter::Wait (this=this@entry=0x7ffff0db0340, t=t@entry=...) at third_party/abseil-cpp/absl/synchronization/internal/waiter.cc:93
#3  0x00007ffff479e3d1 in AbslInternalPerThreadSemWait_lts_20220623 (t=...) at third_party/abseil-cpp/absl/synchronization/internal/per_thread_sem.cc:89
#4  0x00007ffff42c38b8 in absl::lts_20220623::synchronization_internal::PerThreadSem::Wait (t=...) at third_party/abseil-cpp/absl/synchronization/internal/per_thread_sem.h:107
#5  absl::lts_20220623::Mutex::DecrementSynchSem (t=..., w=<optimized out>, mu=0x555556fd7c38) at third_party/abseil-cpp/absl/synchronization/mutex.cc:579
#6  absl::lts_20220623::CondVar::WaitCommon (this=0x555556fd7c40, mutex=0x555556fd7c38, t=...) at third_party/abseil-cpp/absl/synchronization/mutex.cc:2577
#7  0x00007ffff45221df in gpr_cv_wait (cv=0x555556fd7c40, mu=0x555556fd7c38, abs_deadline=...) at src/core/lib/gpr/sync_abseil.cc:85
#8  0x00007ffff452375b in grpc_core::internal::ThreadState::AwaitThreads (this=0x555556fd7c30) at src/core/lib/gprpp/fork.cc:145
#9  grpc_core::Fork::AwaitThreads () at src/core/lib/gprpp/fork.cc:232
#10 0x00007ffff4540016 in grpc_prefork () at src/core/lib/iomgr/fork_posix.cc:84
#11 0x00007ffff7d72a17 in ?? () from /usr/lib/libc.so.6
#12 0x00007ffff7d720fe in fork () from /usr/lib/libc.so.6
#13 0x00007ffff7fb2066 in subprocess_fork_exec (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220812191842.128421/Python-3.9.13/Modules/_posixsubprocess.c:844

apache-beam 2.40.0

grpcio installed via

  1. git checkout v1.48.0 (in https://github.com/grpc/grpc)
  2. This diff (no idea why _grpcio_metadata.py lost copyright notice, diff included in the highly unlikely event it would be the culprit)
    diff --git a/src/python/grpcio/commands.py b/src/python/grpcio/commands.py
    index d93b6c7039..936e4df4db 100644
    --- a/src/python/grpcio/commands.py
    +++ b/src/python/grpcio/commands.py
    @@ -194,6 +194,7 @@ def try_cythonize(extensions, linetracing=False, mandatory=True):
                 include_dir for extension in extensions
                 for include_dir in extension.include_dirs
             ] + [CYTHON_STEM],
    +        gdb_debug=True,
             compiler_directives=cython_compiler_directives)
     
     
    diff --git a/src/python/grpcio/grpc/_grpcio_metadata.py b/src/python/grpcio/grpc/_grpcio_metadata.py
    index a1e53d7e81..d5d9fe287a 100644
    --- a/src/python/grpcio/grpc/_grpcio_metadata.py
    +++ b/src/python/grpcio/grpc/_grpcio_metadata.py
    @@ -1,17 +1 @@
    -# Copyright 2017 gRPC authors.
    -#
    -# Licensed under the Apache License, Version 2.0 (the "License");
    -# you may not use this file except in compliance with the License.
    -# You may obtain a copy of the License at
    -#
    -#     http://www.apache.org/licenses/LICENSE-2.0
    -#
    -# Unless required by applicable law or agreed to in writing, software
    -# distributed under the License is distributed on an "AS IS" BASIS,
    -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    -# See the License for the specific language governing permissions and
    -# limitations under the License.
    -
    -# AUTO-GENERATED FROM `$REPO_ROOT/templates/src/python/grpcio/grpc/_grpcio_metadata.py.template`!!!
    -
    -__version__ = """1.48.0"""
    +__version__ = """1.48.0"""
    \ No newline at end of file
  3. GRPC_PYTHON_BUILD_WITH_CYTHON=1 pip install --ignore-installed .

@joar
Copy link
Contributor

joar commented Aug 30, 2022

I realized it might be too late for that stacktrace, it might help alternative timelines of me find this issue in the future if the timelines happen to merge. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants