Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: conda-smithy v3.31 mismatches cuda_compiler & image, breaks zip_keys #1847

Closed
h-vetinari opened this issue Feb 23, 2024 · 15 comments · Fixed by #1851
Closed

BUG: conda-smithy v3.31 mismatches cuda_compiler & image, breaks zip_keys #1847

h-vetinari opened this issue Feb 23, 2024 · 15 comments · Fixed by #1851

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Feb 23, 2024

I'm getting the following rerender in conda-forge/arrow-cpp-feedstock#1325, which mismatches the cuda_compiler with the image, and then predictably fails.

--- a/.azure-pipelines/azure-pipelines-linux.yml
+++ b/.azure-pipelines/azure-pipelines-linux.yml
@@ -11,7 +11,7 @@ jobs:
       linux_64_cuda_compiler_version11.2:                          # CUDA
         CONFIG: linux_64_cuda_compiler_version11.2
         UPLOAD_PACKAGES: 'True'
-        DOCKER_IMAGE: quay.io/condaforge/linux-anvil-cuda:11.2     # CUDA
+        DOCKER_IMAGE: quay.io/condaforge/linux-anvil-cos7-x86_64   # !! no CUDA !!
       linux_64_cuda_compiler_versionNone:
         CONFIG: linux_64_cuda_compiler_versionNone
         UPLOAD_PACKAGES: 'True'
--- a/.ci_support/linux_64_cuda_compiler_version11.2.yaml
+++ b/.ci_support/linux_64_cuda_compiler_version11.2.yaml           # CUDA
@@ -7,15 +7,15 @@ bzip2:
 cuda_compiler:
-- nvcc                                                             # CUDA
+- None                                                             # !! no CUDA !!
 cuda_compiler_version:
 - '11.2'                                                           # CUDA
 cuda_compiler_version_min:
@@ -23,23 +23,23 @@ cuda_compiler_version_min:
 docker_image:
-- quay.io/condaforge/linux-anvil-cuda:11.2
+- quay.io/condaforge/linux-anvil-cos7-x86_64
 gflags:
--- a/.ci_support/linux_64_cuda_compiler_versionNone.yaml
+++ b/.ci_support/linux_64_cuda_compiler_versionNone.yaml           # no CUDA
@@ -9,13 +9,13 @@ c_compiler:
 cuda_compiler:
-- None                                                             # no CUDA
+- nvcc                                                             # !! CUDA !!
 cuda_compiler_version:
 - None                                                             # no CUDA
 cuda_compiler_version_min:

In fact, for 4/4 builds on the arrow feedstock, the rerender randomly mismatches at least two jobs, seemingly by messing up the zip_keys. Not sure what's happening here or whether #1815 had a bug or uncovered another problem.

@mbargull @beckermr
CC @conda-forge/core

@beckermr
Copy link
Member

I'm away from my key board. Should be mark it broke? If so can you submit a PR?

@h-vetinari
Copy link
Member Author

It depends a bit what's happening elsewhere and how widespread the breakage is. I'm leaning towards yes, however I'm not 100% that this wouldn't completely break rerendering, because smithy does not rely on the resolver for determining the newest available conda-smithy version, but queries the data directly somehow.

I remember hitting this a while ago (#1491), where rerenders wouldn't work anymore, because smithy would see the new build (even if broken), and fail the version check unless --no-check-uptodate is specified. I'm not sure if the conda-forge-admin bot uses that argument, but if it doesn't, marking 3.31 as broken might stop rerenders happening everywhere?

@h-vetinari
Copy link
Member Author

Just tried jaxlib, where a local rerender also messed up python-vs-numpy versions. However, after deleting lines with cuda_compiler_version in the meta.yaml and rerendering, the result looked alright (aside from removing all the CUDA builds, obviously).

So at least for CUDA feedstocks, 3.31.0 seems to be pretty badly broken.

@jakirkham
Copy link
Member

I remember hitting this a while ago (#1491), where rerenders wouldn't work anymore, because smithy would see the new build (even if broken), and fail the version check unless --no-check-uptodate is specified. I'm not sure if the conda-forge-admin bot uses that argument, but if it doesn't, marking 3.31 as broken might stop rerenders happening everywhere?

The re-rendering bot disables the up-to-date check

There are probably other places that need to disable update checks

That said, we may want to rethink how that check works for this reason. Opened issue: #1848

@jakirkham
Copy link
Member

In any event, we should feel comfortable marking it broken if it is not behaving and worry about fixing fallout afterwards

@h-vetinari
Copy link
Member Author

Thank you! I opened conda-forge/admin-requests#947

@h-vetinari
Copy link
Member Author

As another corner-case that's broken in 3.31 (but probably too marginal for a separate issue?) - feedstocks that explicitly set conda_build_tool: mambabuild in conda-forge.yml (for whatever reason, example: ray), cannot even pass the CI setup because updating the conda installation runs into conda-forge/boa-feedstock#79.

@mbargull
Copy link
Member

The rendering times for arrow-cpp-feedstock with its multiple outputs is... something.
Rendering conda-forge/arrow-cpp-feedstock@fd8da32 took more than 10 minutes for me locally.

I've distilled this down to

package:
  name: test_conda_smithy_v3.31.0
  version: 1.0

build:
  skip: true  # [not (linux and x86_64) or cuda_compiler_version not in ("None", cuda_compiler_version_min)]

requirements:
  build:
    - {{ compiler("c") }}
    - {{ compiler("cuda") }}  # [cuda_compiler_version != "None"]

outputs:
  - name: test_conda_smithy_v3.31.0-output
    requirements:
      build:
        - {{ compiler("c") }}
        - {{ compiler("cuda") }}  # [cuda_compiler_version != "None"]

about: {}

which (indeterministically) fails after 85482fd .

I'll take a look later today -- if I can't figure this out right away, then we can just revert that commit for now.
But we should add the above test case in any case, of course.


Thanks for reporting this quickly and handling the mark as broken stuff!

@h-vetinari
Copy link
Member Author

h-vetinari commented Feb 23, 2024

Awesome that you reduced this, thanks for tackling this so quickly!

The rendering times for arrow-cpp-feedstock with its multiple outputs is... something.
Rendering conda-forge/arrow-cpp-feedstock@fd8da32 took more than 10 minutes for me locally.

Yeah... Local rerenders take up to 25min for me now - not fun. 😑

@beckermr
Copy link
Member

PYTHONHASHSEED=4 will cause the error for those trying this out.

@beckermr
Copy link
Member

PR #1851 appears to fix things but needs specific tests and some cleanup possibly

@mbargull
Copy link
Member

PYTHONHASHSEED=4 will cause the error for those trying this out.

Must be something magical with that seed; IIRC I also used that one for gh-1815 😆 .

PR #1851 appears to fix things but needs specific tests and some cleanup possibly

Thanks very much for tackling this!

@mbargull
Copy link
Member

FWIW, I tried your changes from gh-1851 on conda-forge/python-feedstock@73873d8 (which failed with 3.30.4 but works with 3.31.0; i.e., the reason for gh-1815) and it still works fine for that case, too 👍.

@mbargull
Copy link
Member

The rendering times for arrow-cpp-feedstock with its multiple outputs is... something.
Rendering conda-forge/arrow-cpp-feedstock@fd8da32 took more than 10 minutes for me locally.

Yeah... Local rerenders take up to 25min for me now - not fun. 😑

I've profiled this coarsely and quickly worked on some changes at the end of last month and but only got to put in conda/conda-build#5224 today. (conda-build=24.3 release processes is starting now, so that would only be available in >=24.5.)
With those changes, it is still very far from how quick it should be, but at least it should hopefully avoid >10 minute re-rendering times.
Re-rendering arrow-cpp-feedstock went down to about 3 minutes and ctng-compilers-feedstock to about 4 minutes for me -- still very bad, IMO; but I had no idea how atrociously slow these things are currently.
(Please do file bug reports for such things!)

@beckermr
Copy link
Member

The compilers runs ~30 minutes on GHA so a factor of 4 would be a big improvement no matter what. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants