Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: apache_beam.runners.portability.portable_runner_test.PortableRunnerTestWithSubprocesses is flaky #22115

Closed
Abacn opened this issue Jun 30, 2022 · 7 comments
Labels
bug direct done & done Issue has been reviewed after it was closed for verification, followups, etc. flake P1 python runners

Comments

@Abacn
Copy link
Contributor

Abacn commented Jun 30, 2022

What happened?

TImeout exception:

self = <apache_beam.runners.portability.portable_runner_test.PortableRunnerTestWithSubprocessesAndMultiWorkers testMethod=test_assert_that>

    def test_assert_that(self):
      # TODO: figure out a way for fn_api_runner to parse and raise the
      # underlying exception.
      with self.assertRaisesRegex(Exception, 'Failed assert'):
        with self.create_pipeline() as p:
>         assert_that(p | beam.Create(['a', 'b']), equal_to(['a']))
E         AssertionError: "Failed assert" does not match "Pipeline timed out waiting for job service subprocess."

There was a resolved ticket BEAM-9118 and resolved in #12633 . It's happening again. Seems occurs more frequently when jenkins are more busy.

Issue Priority

Priority: 2

Issue Component

Component: runner-py-direct

@Abacn
Copy link
Contributor Author

Abacn commented Jun 30, 2022

@ryanthompson591
Copy link
Contributor

.take-issue

@ryanthompson591
Copy link
Contributor

I've hopefully fixed this in #23696.

However, I had two other theories on what could be going that I'm not sure about.

I'm going to note them here in case they turn out to be true:

  1. It's possible _pick_unused_ports is running into a race case and picking an already used port and timing out. Ideally this would raise a different exception.
  2. It's possible that multiple processes running on the same machine are having collisions in starting subprocesses on similar ports.

I think both these are unlikely and the most likely thing happening is these tests are slow (I was getting long 15 second unit tests on my local machine). Likely above PR is the solution.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 18, 2022

Thanks @ryanthompson591. Case 1 is likely. I observe that in most case the time needed for setup is small, much less than 30 seconds. If it is just a performance issue the time should have some distribution.

@kennknowles
Copy link
Member

@ryanthompson591 are you actively working on this?

@kennknowles
Copy link
Member

Haven't seen this flake in a while. Is it disabled or is it green now?

@Abacn
Copy link
Contributor Author

Abacn commented Mar 23, 2023

It is green now. It has been migrated to https://ci-beam.apache.org/job/beam_PreCommit_Python_Runners_Cron/ and checked that test still running. We can close this now

@Abacn Abacn closed this as completed Mar 23, 2023
@github-actions github-actions bot added this to the 2.47.0 Release milestone Mar 23, 2023
@tvalentyn tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug direct done & done Issue has been reviewed after it was closed for verification, followups, etc. flake P1 python runners
Projects
None yet
Development

No branches or pull requests

5 participants