-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966
Comments
@Abacn are you working on this? |
Hi @AnandInguva, do you think this is a reasonable request? If so I would like to work on this, also CC: @tvalentyn |
I think replicating 81af13c in Dataflow-owned containers would reduce the time to build a wheel to 20 seconds. (installation from a wheel might still be much faster, perhaps around 3 seconds). Note: We are also planning to switch Dataflow Python to use Beam-provided containers next quarter. |
Thanks Yi for the suggestion. building a wheel for a Dataflow integration test suite, that builds an SDK once and runs a suite of Pipelines makes sense and can save some time, although the above suggestion is a shorter path to achieve more of the gain. We should have patched that commit earlier. Care should be taken to build the wheel for the correct target platform and correct python version. If build happens on Jenkins/GH action, build environment is somewhat predeterimined. However forcing users to build the wheel locally every time they run a gradle task may add some friction (needs extra dependencies, increases build the time, will it work well on macs or different deps needed?). Note there are also slight differences in wheel naming pattern between py37 and py38. |
Thanks @tvalentyn given that
there is no need to change Dataflow-owned containers at this moment. Could come back when the switch is done to see if this task still values. |
Using beam provided container image ( jobId:
based on this experiment #25970 could still save 2.5 minutes per test for Beam provided image (that uses ccache). |
I see, thanks for correction. I also did my own tests before replying above, they looked like the following:
|
I am not sure where the discrepancy comes from. Repeating this same with shows a much slower installation:
|
I don't reproduce the fast behavior on |
on my machine it took: real 1m24.951s |
which as you mention still faster than 3 min |
subsequent uninstallation-and-installation is faster. Possible explanation: cache contents matters and when there are recent changes in the cythonized codepath, installation is slower. |
Hi @tvalentyn thanks for sharing the experiments. My experiments done in #25966 (comment) checked out the latest master, installed it in local virtual env and also packaged it to tarball. So the difference between the |
Yes, likely # of cores matters. Also if sibling_sdk_worker experiment is not enabled for the project, performance would be worse, since compilation would happen in each container separately. This experiment is in process of getting rolled out at the moment. |
What needs to happen?
Currently, Python PostCommit use
--sdk_location
to upload the Python SDK subjected to testing to Dataflow. It is found that building the wheels for the SDK is very slow:it takes 4 minutes to install apache beam from source, where 3 and half minutes is used to build the wheel. It should be able to build a wheel locally, whenever possible (host machine generates manylinux wheel, which is the case of Jenkins). This would cut the running time of Python PostCommit on Dataflow by half.
Issue Priority
Priority: 2 (default / most normal work should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: