[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

Abacn · 2023-03-24T15:47:57Z

What needs to happen?

Currently, Python PostCommit use --sdk_location to upload the Python SDK subjected to testing to Dataflow. It is found that building the wheels for the SDK is very slow:

2023/03/24 15:31:16 Executing: /usr/local/bin/pip install --disable-pip-version-check /var/opt/google/staged/dataflow_python_sdk.tar[gcp]
2023/03/24 15:31:16 Processing /var/opt/google/staged/dataflow_python_sdk.tar
2023/03/24 15:31:17 Preparing metadata (setup.py): started
2023/03/24 15:31:35 Preparing metadata (setup.py): finished with status 'done'
2023/03/24 15:31:37 Building wheels for collected packages: apache-beam
2023/03/24 15:31:37 Building wheel for apache-beam (setup.py): started
2023/03/24 15:32:42 Building wheel for apache-beam (setup.py): still running...
2023/03/24 15:33:44 Building wheel for apache-beam (setup.py): still running...
2023/03/24 15:34:49 Building wheel for apache-beam (setup.py): still running...
2023/03/24 15:35:04 Building wheel for apache-beam (setup.py): finished with status 'done'
2023/03/24 15:35:04 Successfully built apache-beam
2023/03/24 15:35:04 Installing collected packages: apache-beam
2023/03/24 15:35:07 Successfully installed apache-beam-2.47.0.dev0

it takes 4 minutes to install apache beam from source, where 3 and half minutes is used to build the wheel. It should be able to build a wheel locally, whenever possible (host machine generates manylinux wheel, which is the case of Jenkins). This would cut the running time of Python PostCommit on Dataflow by half.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

AnandInguva · 2023-03-24T17:04:35Z

@Abacn are you working on this?

Abacn · 2023-03-24T19:31:11Z

Hi @AnandInguva, do you think this is a reasonable request? If so I would like to work on this, also CC: @tvalentyn

tvalentyn · 2023-03-24T23:22:06Z

I think replicating 81af13c in Dataflow-owned containers would reduce the time to build a wheel to 20 seconds. (installation from a wheel might still be much faster, perhaps around 3 seconds).

Note: We are also planning to switch Dataflow Python to use Beam-provided containers next quarter.

tvalentyn · 2023-03-24T23:35:28Z

Thanks Yi for the suggestion. building a wheel for a Dataflow integration test suite, that builds an SDK once and runs a suite of Pipelines makes sense and can save some time, although the above suggestion is a shorter path to achieve more of the gain. We should have patched that commit earlier.

Care should be taken to build the wheel for the correct target platform and correct python version. If build happens on Jenkins/GH action, build environment is somewhat predeterimined. However forcing users to build the wheel locally every time they run a gradle task may add some friction (needs extra dependencies, increases build the time, will it work well on macs or different deps needed?).

Note there are also slight differences in wheel naming pattern between py37 and py38.

Abacn · 2023-03-27T13:43:41Z

Thanks @tvalentyn given that

replicating [BEAM-8544] Use ccache for compiling the Beam Python SDK. #9966 in Dataflow-owned containers would reduce the time
planning to switch Dataflow Python to use Beam-provided containers next quarter

there is no need to change Dataflow-owned containers at this moment. Could come back when the switch is done to see if this task still values.

Abacn · 2023-03-27T21:34:12Z

Using beam provided container image (--sdk_container_image=gcr.io/apache-beam-testing/beam-sdk/beam_python3.10_sdk:latest), re-installation of the SDK is somewhat faster,
however still takes 2min30s (1 minutes faster than default, Dataflow provided image). Building the wheel takes 2min20s.

jobId: 2023-03-27_14_11_55-1971041453326172975

2023-03-27 17:15:14.903 EDT 2023/03/27 21:15:14 Found artifact: dataflow_python_sdk.tar
2023-03-27 17:15:14.903 EDT 2023/03/27 21:15:14 Installing setup packages ...
2023-03-27 17:15:15.444 EDT Processing /var/opt/google/staged/dataflow_python_sdk.tar
2023-03-27 17:15:15.937 EDT Preparing metadata (setup.py): started
2023-03-27 17:15:19.149 EDT Preparing metadata (setup.py): finished with status 'done'
...
2023-03-27 17:15:20.042 EDT Building wheels for collected packages: apache-beam
2023-03-27 17:15:20.043 EDT Building wheel for apache-beam (setup.py): started
2023-03-27 17:16:40.346 EDT Building wheel for apache-beam (setup.py): still running...
2023-03-27 17:17:39.535 EDT Building wheel for apache-beam (setup.py): finished with status 'done'
...
2023-03-27 17:17:39.559 EDT Successfully built apache-beam
2023-03-27 17:17:40.178 EDT Installing collected packages: apache-beam
...
2023-03-27 17:17:42.247 EDT Successfully installed apache-beam-2.47.0.dev0

based on this experiment #25970 could still save 2.5 minutes per test for Beam provided image (that uses ccache).

tvalentyn · 2023-03-28T02:52:57Z

I see, thanks for correction. I also did my own tests before replying above, they looked like the following:

docker run --rm -it --entrypoint=/bin/bash apache/beam_python3.7_sdk:2.45.0
root@577f14daaa3f:/# pip uninstall apache-beam
root@577f14daaa3f:/# wget https://files.pythonhosted.org/packages/09/07/a8cef9d9193a65f7d7a35d72b46c97cc3684eea7b7728a89f5accbb5f297/apache-beam-2.45.0.zip
root@577f14daaa3f:/# time pip install ./apache-beam-2.45.0.zip 

Successfully installed apache-beam-2.45.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: pip install --upgrade pip

real	0m13.258s
user	0m11.930s
sys	0m2.265s

tvalentyn · 2023-03-28T02:58:19Z

I am not sure where the discrepancy comes from.

Repeating this same with docker run --rm -it --entrypoint=/bin/bash gcr.io/cloud-dataflow/v1beta3/python310:beam-master-20230322

shows a much slower installation:

real	2m43.945s
user	2m36.349s
sys	0m7.157s

tvalentyn · 2023-03-28T03:05:21Z

I don't reproduce the fast behavior on gcr.io/apache-beam-testing/beam-sdk/beam_python3.10_sdk:latest.

tvalentyn · 2023-03-28T03:05:49Z

on my machine it took:

real 1m24.951s
user 1m19.878s
sys 0m5.468s

tvalentyn · 2023-03-28T03:06:11Z

which as you mention still faster than 3 min

tvalentyn · 2023-03-28T03:17:16Z

subsequent uninstallation-and-installation is faster. Possible explanation: cache contents matters and when there are recent changes in the cythonized codepath, installation is slower.
Regardless, I am open to the idea to install wheels in tests with caveats mentioned in #25966 (comment)

Abacn · 2023-03-28T03:37:19Z

Hi @tvalentyn thanks for sharing the experiments. My experiments done in #25966 (comment) checked out the latest master, installed it in local virtual env and also packaged it to tarball. So the difference between the --sdk_location and the :latest image was minimum (<6 h difference). And afaik there is no cython change today. For some reason Dataflow VM is building wheels notably slower than local machine, maybe it has fewer (2) cores matter?

tvalentyn · 2023-03-28T03:59:48Z

Yes, likely # of cores matters. Also if sibling_sdk_worker experiment is not enabled for the project, performance would be worse, since compilation would happen in each container separately. This experiment is in process of getting rolled out at the moment.

Abacn added task awaiting triage labels Mar 24, 2023

github-actions bot added python P2 labels Mar 24, 2023

Abacn mentioned this issue Mar 24, 2023

Use wheels for Dataflow postcommit tests #25970

Merged

3 tasks

tvalentyn removed the awaiting triage label Mar 28, 2023

Abacn closed this as completed in #25970 Apr 6, 2023

github-actions bot added this to the 2.48.0 Release milestone Apr 6, 2023

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Apr 11, 2023

Abacn mentioned this issue Apr 28, 2023

use wheel sdk location for PostCommit_Py_Examples #26473

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

Abacn commented Mar 24, 2023

AnandInguva commented Mar 24, 2023

Abacn commented Mar 24, 2023

tvalentyn commented Mar 24, 2023

tvalentyn commented Mar 24, 2023 •

edited

Loading

Abacn commented Mar 27, 2023 •

edited

Loading

Abacn commented Mar 27, 2023 •

edited

Loading

tvalentyn commented Mar 28, 2023 •

edited

Loading

tvalentyn commented Mar 28, 2023 •

edited

Loading

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

Abacn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

Comments

Abacn commented Mar 24, 2023

What needs to happen?

Issue Priority

Issue Components

AnandInguva commented Mar 24, 2023

Abacn commented Mar 24, 2023

tvalentyn commented Mar 24, 2023

tvalentyn commented Mar 24, 2023 • edited Loading

Abacn commented Mar 27, 2023 • edited Loading

Abacn commented Mar 27, 2023 • edited Loading

tvalentyn commented Mar 28, 2023 • edited Loading

tvalentyn commented Mar 28, 2023 • edited Loading

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

Abacn commented Mar 28, 2023

tvalentyn commented Mar 28, 2023

tvalentyn commented Mar 24, 2023 •

edited

Loading

Abacn commented Mar 27, 2023 •

edited

Loading

Abacn commented Mar 27, 2023 •

edited

Loading

tvalentyn commented Mar 28, 2023 •

edited

Loading

tvalentyn commented Mar 28, 2023 •

edited

Loading