Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU CI Setup #1045

Merged
merged 13 commits into from
Sep 3, 2020
Merged

Conversation

michdolan
Copy link
Collaborator

Initial work for getting GPU CI up and running via AWS CodeBuild. The CodeBuild project doesn't exist yet, so this will fail currently, but the PR can be used for setup testing purposes.

I also updated all the GH Actions jobs to detect available threads from the system when running cmake build (which will be even easier in CMake 3.12 with the new parallel support). The AWS GPU instances have 32 CPU threads (if I'm reading the spec correctly) so we can also leverage that to build much more quickly. I use a 24 thread machine at home and can build OCIO in around ~1 minute. Our GH Actions VMs all have 2 threads and build in ~10 minutes on Linux.

Signed-off-by: Michael Dolan <[email protected]>
@jfpanisset
Copy link
Contributor

If I understand this correctly, there's a pre-created CodeBuild project called OpenColorIO_GPU_CI which specifies the "environment type" as LINUX_GPU_CONTAINER as per https://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref-compute-types.html and possibly points to one of the aswf-docker containers?

The aws command line tool has a aws codebuild create-project option, it might be possible to script the creation and deletion of the OpenColorIO_GPU_CI on demand instead of relying on a pre-created project with configuration that lives outside of the repo?

@michdolan
Copy link
Collaborator Author

michdolan commented Jun 25, 2020

If I understand this correctly, there's a pre-created CodeBuild project called OpenColorIO_GPU_CI which specifies the "environment type" as LINUX_GPU_CONTAINER as per https://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref-compute-types.html and possibly points to one of the aswf-docker containers?

Yes, that describes the intended current approach.

The aws command line tool has a aws codebuild create-project option, it might be possible to script the creation and deletion of the OpenColorIO_GPU_CI on demand instead of relying on a pre-created project with configuration that lives outside of the repo?

Good point. @tykeal any thoughts on @jfpanisset suggested workflow?

buildspec.yml Outdated
-DOCIO_BUILD_DOCS=OFF \
-DOCIO_BUILD_TESTS=ON \
-DOCIO_BUILD_GPU_TESTS=ON \
-DOCIO_BUILD_PYTHON=ON \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is dedicated to the AWS GPU build so, the Python build & test could be disabled.

@hodoulp
Copy link
Member

hodoulp commented Jun 25, 2020

[...] I also updated all the GH Actions jobs to detect available threads [...]

That's great.

@tykeal
Copy link

tykeal commented Jun 25, 2020

If I understand this correctly, there's a pre-created CodeBuild project called OpenColorIO_GPU_CI which specifies the "environment type" as LINUX_GPU_CONTAINER as per https://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref-compute-types.html and possibly points to one of the aswf-docker containers?

The aws command line tool has a aws codebuild create-project option, it might be possible to script the creation and deletion of the OpenColorIO_GPU_CI on demand instead of relying on a pre-created project with configuration that lives outside of the repo?

We haven't created the CodeBuild project yet. We'll need several different parameters while defining it. As for using aws codebuild create-project the IAM that was provided for CodeBuild jobs does not have the rights to create or destroy CodeBuild projects. It has the barest minimum rights needed to run jobs. In specific it has the following rights:

  • codebuild:StartBuild
  • codebuild:BatchGetBuilds
  • logs:GetLogEvents

Take a look at https://docs.aws.amazon.com/codebuild/latest/userguide/create-project.html#create-project-cli to see all the configuration that is in the CLI template for setting one of these up!

If y'all feel that we should allow creation and destruction of the CodeBuild projects on the fly I can ask that the IAM be updated to include the needed rights. The issue I see with that is that if multiple PRs end up triggering the build at the same time, then if we're doing a create / destroy things could are rather likely to fail somewhere.

@michdolan
Copy link
Collaborator Author

I'm OK with using one pre-built CodeBuild project to start. We can always make it more dynamic in the future if needed. @tykeal do you know if CodeBuild supports a dynamic docker container path? If we could specify the docker tag I don't anticipate there being a lot else to configure. If that's not possible for now, it could be something to investigate in the future. What else do you need to get the CodeBuild project setup? Happy to provide that info.

Signed-off-by: Michael Dolan <[email protected]>
@tykeal
Copy link

tykeal commented Jul 13, 2020

@michdolan I need to know what container image you need to use as it needs to be specified. I can use any container in the Amazon ECR or a different registry. I'm assuming you want to use the images created by @aloysbaillet but I need to know the container coordinates to be able to get the basic project up.

@michdolan
Copy link
Collaborator Author

@michdolan
Copy link
Collaborator Author

Do you know if there a way to pass the container to CodeBuild via the buildspec.yml file? That would be an ideal setup if it was supported, long term at least.

@tykeal
Copy link

tykeal commented Jul 14, 2020

I'm not aware of a way to do it in the buildspec. I'm not using the sha itself, just aswf/ci-ocio:2020 from my reading of the docs that should be all I need. It should do the standard container thing and follow the label as it moves.

@jfpanisset
Copy link
Contributor

As per other discussions, adding:

-DOCIO_USE_HEADLESS

to the build options in buildspec.yml and merging the fixes for CMake GLEW detection from PR #1112 should allow the GPU code to build and run on CodeBuild.

@michdolan
Copy link
Collaborator Author

@jfpanisset I ran a build on gpu_ci_test branch running in headless mode: https://github.com/AcademySoftwareFoundation/OpenColorIO/runs/1019879003?check_suite_focus=true

GPU tests error with:
EGL could not be initialized.

@jfpanisset
Copy link
Contributor

That's disappointing. A couple of differences I can see with the tests I ran:

  • I was building directly from CodeBuild, pointing to my fork of OCIO: https://github.com/jfpanisset/OpenColorIO rather than (correctly) going through GitHub Actions like you are doing. More specifically I had explicitly set the environment type to LINUX_GPU_CONTAINER so it would be good to verify that this is indeed the case for the CodeBuild environment created for OCIO.
  • It looks like the version of buildspec.yml I used is based on an older version of your branch, and specifically I don't set the DISPLAY environment variable:
env:
  variables:
    CXX: g++
    CC: gcc
  exported-variables:
    - CXX
    - CC

whereas if I'm looking at the right version:

https://github.com/AcademySoftwareFoundation/OpenColorIO/blob/07d9f9c9b76629f0d80222e7a394f9235427ff47/buildspec.yml

it seems you may be setting DISPLAY=:0

env:
  variables:
    CXX: g++
    CC: gcc
    DISPLAY: ':0'
  exported-variables:
    - CXX
    - CC
    - DISPLAY

https://www.khronos.org/registry/EGL/extensions/EXT/EGL_EXT_platform_x11.txt

says:

To obtain an EGLDisplay backed by an X11 screen, call
eglGetPlatformDisplayEXT with set to EGL_PLATFORM_X11_EXT. The
<native_display> parameter specifies the X11 display connection to use, and
must point to a valid X11 Display or be NULL.

My guess would be that removing the DISPLAY environment variable should prevent EGL from trying to "obtain an EGLDisplay backed by an X11 screen" and should hopefully allow EGL to work without an X11 server present.

@tykeal
Copy link

tykeal commented Aug 24, 2020

I can definitively state that the CodeBuild environment type for OCIO is LINUX_GPU_CONTAINER. It's the only option for GPU when setting up a CodeBuild environment.

@michdolan
Copy link
Collaborator Author

Good catch on the DISPLAY @jfpanisset . I left that in by mistake after some earlier experimentation. I'll try removing that to restore the NULL value.

@michdolan
Copy link
Collaborator Author

It worked! Removing DISPLAY solved the EGL issue. Thanks again for catching that JF.

https://github.com/AcademySoftwareFoundation/OpenColorIO/runs/1024540874?check_suite_focus=true

GPU CI will need to be run as part of a nightly build due to permissions, but the tests ran successfully and passed.

@jfpanisset
Copy link
Contributor

That's great news indeed. Turns out that the DISPLAY environment variable and its interaction with EGL was discussed in the original PR #1047 that added EGL support, it would probably be worth documenting what's necessary to get a working GPU CI setup for OCIO. I'll try to capture some of this for the ASWF Sample Project.

@michdolan michdolan marked this pull request as ready for review August 25, 2020 02:55
@michdolan
Copy link
Collaborator Author

michdolan commented Aug 25, 2020

Instead of doing a nightly build I have the GPU CI job running on commit to any OpenColorIO (v2) branch (this does not work with pull request CI jobs that originate from forks). That should keep our AWS usage minimal while getting continuous validation following merge. Hopefully we can find a solution to get this working in PRs in the future.

Signed-off-by: Michael Dolan <[email protected]>
@hodoulp hodoulp merged commit 27599db into AcademySoftwareFoundation:master Sep 3, 2020
@michdolan michdolan deleted the gpu_ci_setup branch October 21, 2020 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants