Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

Closed
1 of 15 tasks
Abacn opened this issue Mar 24, 2023 · 14 comments · Fixed by #25970
Closed
1 of 15 tasks

[Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit #25966

Abacn opened this issue Mar 24, 2023 · 14 comments · Fixed by #25970
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python task

Comments

@Abacn
Copy link
Contributor

Abacn commented Mar 24, 2023

What needs to happen?

Currently, Python PostCommit use --sdk_location to upload the Python SDK subjected to testing to Dataflow. It is found that building the wheels for the SDK is very slow:

2023/03/24 15:31:16 Executing: /usr/local/bin/pip install --disable-pip-version-check /var/opt/google/staged/dataflow_python_sdk.tar[gcp]
2023/03/24 15:31:16 Processing /var/opt/google/staged/dataflow_python_sdk.tar
2023/03/24 15:31:17 Preparing metadata (setup.py): started
2023/03/24 15:31:35 Preparing metadata (setup.py): finished with status 'done'
2023/03/24 15:31:37 Building wheels for collected packages: apache-beam
2023/03/24 15:31:37 Building wheel for apache-beam (setup.py): started
2023/03/24 15:32:42 Building wheel for apache-beam (setup.py): still running...
2023/03/24 15:33:44 Building wheel for apache-beam (setup.py): still running...
2023/03/24 15:34:49 Building wheel for apache-beam (setup.py): still running...
2023/03/24 15:35:04 Building wheel for apache-beam (setup.py): finished with status 'done'
2023/03/24 15:35:04 Successfully built apache-beam
2023/03/24 15:35:04 Installing collected packages: apache-beam
2023/03/24 15:35:07 Successfully installed apache-beam-2.47.0.dev0

it takes 4 minutes to install apache beam from source, where 3 and half minutes is used to build the wheel. It should be able to build a wheel locally, whenever possible (host machine generates manylinux wheel, which is the case of Jenkins). This would cut the running time of Python PostCommit on Dataflow by half.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@AnandInguva
Copy link
Contributor

@Abacn are you working on this?

@Abacn
Copy link
Contributor Author

Abacn commented Mar 24, 2023

Hi @AnandInguva, do you think this is a reasonable request? If so I would like to work on this, also CC: @tvalentyn

@tvalentyn
Copy link
Contributor

I think replicating 81af13c in Dataflow-owned containers would reduce the time to build a wheel to 20 seconds. (installation from a wheel might still be much faster, perhaps around 3 seconds).

Note: We are also planning to switch Dataflow Python to use Beam-provided containers next quarter.

@tvalentyn
Copy link
Contributor

tvalentyn commented Mar 24, 2023

Thanks Yi for the suggestion. building a wheel for a Dataflow integration test suite, that builds an SDK once and runs a suite of Pipelines makes sense and can save some time, although the above suggestion is a shorter path to achieve more of the gain. We should have patched that commit earlier.

Care should be taken to build the wheel for the correct target platform and correct python version. If build happens on Jenkins/GH action, build environment is somewhat predeterimined. However forcing users to build the wheel locally every time they run a gradle task may add some friction (needs extra dependencies, increases build the time, will it work well on macs or different deps needed?).

Note there are also slight differences in wheel naming pattern between py37 and py38.

@Abacn
Copy link
Contributor Author

Abacn commented Mar 27, 2023

Thanks @tvalentyn given that

there is no need to change Dataflow-owned containers at this moment. Could come back when the switch is done to see if this task still values.

@Abacn
Copy link
Contributor Author

Abacn commented Mar 27, 2023

Using beam provided container image (--sdk_container_image=gcr.io/apache-beam-testing/beam-sdk/beam_python3.10_sdk:latest), re-installation of the SDK is somewhat faster,
however still takes 2min30s (1 minutes faster than default, Dataflow provided image). Building the wheel takes 2min20s.

jobId: 2023-03-27_14_11_55-1971041453326172975

2023-03-27 17:15:14.903 EDT 2023/03/27 21:15:14 Found artifact: dataflow_python_sdk.tar
2023-03-27 17:15:14.903 EDT 2023/03/27 21:15:14 Installing setup packages ...
2023-03-27 17:15:15.444 EDT Processing /var/opt/google/staged/dataflow_python_sdk.tar
2023-03-27 17:15:15.937 EDT Preparing metadata (setup.py): started
2023-03-27 17:15:19.149 EDT Preparing metadata (setup.py): finished with status 'done'
...
2023-03-27 17:15:20.042 EDT Building wheels for collected packages: apache-beam
2023-03-27 17:15:20.043 EDT Building wheel for apache-beam (setup.py): started
2023-03-27 17:16:40.346 EDT Building wheel for apache-beam (setup.py): still running...
2023-03-27 17:17:39.535 EDT Building wheel for apache-beam (setup.py): finished with status 'done'
...
2023-03-27 17:17:39.559 EDT Successfully built apache-beam
2023-03-27 17:17:40.178 EDT Installing collected packages: apache-beam
...
2023-03-27 17:17:42.247 EDT Successfully installed apache-beam-2.47.0.dev0

based on this experiment #25970 could still save 2.5 minutes per test for Beam provided image (that uses ccache).

@tvalentyn
Copy link
Contributor

tvalentyn commented Mar 28, 2023

I see, thanks for correction. I also did my own tests before replying above, they looked like the following:

docker run --rm -it --entrypoint=/bin/bash apache/beam_python3.7_sdk:2.45.0
root@577f14daaa3f:/# pip uninstall apache-beam
root@577f14daaa3f:/# wget https://files.pythonhosted.org/packages/09/07/a8cef9d9193a65f7d7a35d72b46c97cc3684eea7b7728a89f5accbb5f297/apache-beam-2.45.0.zip
root@577f14daaa3f:/# time pip install ./apache-beam-2.45.0.zip 

Successfully installed apache-beam-2.45.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: pip install --upgrade pip

real	0m13.258s
user	0m11.930s
sys	0m2.265s

@tvalentyn
Copy link
Contributor

tvalentyn commented Mar 28, 2023

I am not sure where the discrepancy comes from.

Repeating this same with docker run --rm -it --entrypoint=/bin/bash gcr.io/cloud-dataflow/v1beta3/python310:beam-master-20230322

shows a much slower installation:

real	2m43.945s
user	2m36.349s
sys	0m7.157s

@tvalentyn
Copy link
Contributor

I don't reproduce the fast behavior on gcr.io/apache-beam-testing/beam-sdk/beam_python3.10_sdk:latest.

@tvalentyn
Copy link
Contributor

on my machine it took:

real 1m24.951s
user 1m19.878s
sys 0m5.468s

@tvalentyn
Copy link
Contributor

which as you mention still faster than 3 min

@tvalentyn
Copy link
Contributor

subsequent uninstallation-and-installation is faster. Possible explanation: cache contents matters and when there are recent changes in the cythonized codepath, installation is slower.
Regardless, I am open to the idea to install wheels in tests with caveats mentioned in #25966 (comment)

@Abacn
Copy link
Contributor Author

Abacn commented Mar 28, 2023

Hi @tvalentyn thanks for sharing the experiments. My experiments done in #25966 (comment) checked out the latest master, installed it in local virtual env and also packaged it to tarball. So the difference between the --sdk_location and the :latest image was minimum (<6 h difference). And afaik there is no cython change today. For some reason Dataflow VM is building wheels notably slower than local machine, maybe it has fewer (2) cores matter?

@tvalentyn
Copy link
Contributor

Yes, likely # of cores matters. Also if sibling_sdk_worker experiment is not enabled for the project, performance would be worse, since compilation would happen in each container separately. This experiment is in process of getting rolled out at the moment.

@github-actions github-actions bot added this to the 2.48.0 Release milestone Apr 6, 2023
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants