-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move image downloads into resource downloader containers #564
Move image downloads into resource downloader containers #564
Conversation
@@ -69,7 +69,6 @@ sudo yum -y install \ | |||
ansible \ | |||
bind-utils \ | |||
jq \ | |||
libguestfs-tools \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this from something else you were doing? Not needed here is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's removing the requirement as we're not longer doing qemu-img-convert I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to need it for some of the image customisation we did but is no longer needed. Actually I think we might have removed the requirement when we stopped calling virt-filesystems or something.
Nice, thanks for looking into this! :) So I guess we could then convert the local bind-mount to a persistent volume in the cluster-deployed Ironic case, and provided the boostrap VM is up we can download from there (although we'd need some way to prevent bootstrap-complete happening or the installer could delete the VM :) cc @imain |
Yes this could work. Currently I'm copying them off the host into the shared volume in the pod.. this is probably better and could be used to get it off the internet too right? Is there a timing issue here with the images arriving vs when ironic is ready to do work? I've seen ironic fail to boot a node because of a partially transferred image. |
A couple of things need to be developed here if this looks like its going in the right direction
Yup, if we move ironic to the bootstrap VM then we'd need to either prevent it from going away too soon
if the CACHEURL env variable isn't set then they get downloaded from internet
Yup, We'll need to allow for that |
Yeah I like the CACHEURL setup! |
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/673/ |
Actually I think if we just 'pivot' the files.. eg download them somewhere temporary (on the same filesystem) and then 'mv' them into place I think that would solve any real race issues. PXE boot just keeps retrying until the image is available so that part is solved. |
The containers already download to a tmpdir and move when finished so we'd be ok there, we'd have the see if there is a similar retry when IPA downloads the CoreOS image. |
90ee23b
to
457d262
Compare
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/699/ |
Ok this seems to work well, so I guess the next step is to get the resource-downloader images into the metal3 repos and quay org? Something like:
Does that sound reasonable @derekhiggins? If you can do 1,2,3, I'm working on 4, and @imain can help with 5? |
We also have to figure out the update/upgrade strategy, on deployment the installer knows the RHCOS version it's tied to, so we can work out the URL, but post-deploy the cluster may get upgraded, so we'll need some way to read that version and download any new image. Similarly for IPA, do we have the script loop checking for new versions periodically, or somehow restart the container to trigger a re-check? |
@@ -75,6 +55,12 @@ sudo podman run -d --net host --privileged --name ironic --pod ironic-pod \ | |||
--env MARIADB_PASSWORD=$mariadb_password \ | |||
-v $IRONIC_DATA_DIR:/shared ${IRONIC_IMAGE} | |||
|
|||
sudo podman run -d --net host --privileged --name ipa-downloader --pod ironic-pod \ | |||
-v $IRONIC_DATA_DIR:/shared ${IPA_DOWNLOADER_IMAGE} /usr/local/bin/get-resource.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these actually need to be privileged? I forget why we needed that, but it'd be good to ensure only those containers that really need it have it specified?
Similarly with host networking, I guess we need that for the containers which run a server, but probably not for those that just download and write to a shared volume?
I tested the containers in a pod via podman play
on my host without either and it worked fine (although podman play on the RHCOS image seems to segfault, so we'll have to stick to the manual script approach for starting the static-pod on the bootstrap node).
Yes I can do 5. I also added a new containers that will have to go into the org. |
@derekhiggins Ok this works well for me, and I created the upstream images - would you like to rebase this, then we can iterate on the remaining aspects of getting this working on the bootstrap VM and cluster in other PRs? |
457d262
to
fbddf63
Compare
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/730/ |
common.sh
Outdated
export RHCOS_IMAGE_FILENAME_LATEST="rhcos-ootpa-latest.qcow2" | ||
|
||
# Ironic vars | ||
export IRONIC_IMAGE=${IRONIC_IMAGE:-"quay.io/metal3-io/ironic:master"} | ||
export IRONIC_INSPECTOR_IMAGE=${IRONIC_INSPECTOR_IMAGE:-"quay.io/metal3-io/ironic-inspector:master"} | ||
export IPA_DOWNLOADER_IMAGE=${IPA_DOWNLOADER_IMAGE:-"quay.io/higginsd/resource-downloader-ipa:master"} | ||
export COREOS_DOWNLOADER_IMAGE=${COREOS_DOWNLOADER_IMAGE:-"quay.io/higginsd/resource-downloader-coreos:master"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be updated to point to the metalkube images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh, sorry.
fbddf63
to
d3f8cbb
Compare
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/736/ |
Hmm the CI failure looks real, but we don't save the install-config contents in the logs, so it's hard to see what's up, trying locally to see if there are clues. |
Ok this is working for me locally so I'm confused. |
So I think this could work really well as an init container, but init containers are expected to exit when complete. The sleep here at the end means it won't work properly in that way. |
We can remove the sleep, I had it there because I was testing the container as part of the BMO pod and it needed to stick around to avoid it being continuously restarted. |
Sure removing the sleep sounds fine, but it'd be good to have an easy way to check that the files are fully downloaded, perhaps we could remove the md5sum at the start of any download, then re-calculate it at the end (and add similar checksums for the IPA images?) The reason I need this is so we can make terraform wait for the images to be available before attempting master deployment - I assume we'll need similar logic in the operator so we spin waiting on the images before any worker deployment gets triggered. |
Also any ideas on the CI failure, worth a rebase to re-try? |
d3f8cbb
to
9b0398a
Compare
Can you elaborate on this, I'm not sure what you mean here.
I've rebased and added some logging of the downloader containers to see if I can find out what happening |
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/756/ |
9b0398a
to
d683115
Compare
We need to wait until the images are all downloaded before trying to deploy (in both the terraform and BMO case), e.g see openshift-metal3/kni-installer@e26cd91 Initially I thought we downloaded the IPA kernel/ramdisk directly, but I see now it's a tarball with symlinks, so probably what I have there is OK, and checking the RHCOS image md5sum exists will prove it's fully downloaded. I was also worried about the case where we re-run the *downloader containers on upgrade, and in that case I guess we want to prevent any deployments starting until the new images are downloaded, so we could e.g remove the IPA image symlinks at the start of the get_resource.sh script, and likewise remove the RHCOS image md5sum at the start to signal download is in progress? |
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/757/ |
Move the download of the IPA and CoreOS image download into containers. These containers can be used to download the images for ironic running on the bootstrap and master nodes. When downloading images on the master nodes the container can be instructed to get the image from the bootstrap http server by setting CACHEURL=http://172.22.0.1/images
d683115
to
cdff373
Compare
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/758/ |
Ok, now I understand what your talking about, in both containers we are downloading into a tmpdir which is moved into place when the download is complete, so you shouldn't need to worry about partial downloads, each file is either present or not
ya, removing the symlinks would be best here to stop the old images being used BTW: The CI problem has been fixed in the latest version, The CI file caching was getting in the way of the new logic, the latest failure was a timeout, I'll retrigger. |
This should just go straight to OpenShift. I don't think OpenShift specific images under opensihft-metal3 buy us any value except for short term dev and prototyping. We should ship things through OpenShift properly ASAP. |
Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/759/ |
Ok this lgtm now - @derekhiggins are you ready for this to land? |
Good point, we already got the build working in the openshift-metal3 org for now, but we should follow-up with moving the rhcos-downloader container to the openshift org - any links on the process we need to follow? |
It looks good to go, I want to add back in the caching of the RHCOS image (For CI) and I've just noticed you comment about some containers not needing to be privileged, I can follow up with those. |
Note that I think we may need to change the CACHEURL path slightly to check the subdirectory, assuming the new *downloader containers are used to create the cached files - see comment on openshift-metal3/rhcos-downloader#1 Re the privileged containers, I think we can remove that (and possibly net=host) from both downloader containers, but any containers that bind a port on the host via net=host do need it. |
Ok we discussed the CACHEURL concern on openshift-metal3/rhcos-downloader#1 and the --privileged cleanups can be a follow-up, so lets merge this one to make progress. |
In openshift-metal3#564 we removed the mkdir like: $ git show cdff373 | grep mkdir -mkdir -p "$IRONIC_DATA_DIR/html/images" But we still need IRONIC_DATA_DIR to exist, e.g to bind into the containers We didn't catch this in CI or local testing because the images dir is cached.
Currently a POC of one potential way to handle image downloads
Move the download of the IPA and CoreOS image download into
containers. These containers can be used to download the images
for ironic running on the bootstrap and master nodes.
When downloading images on the master nodes the container can be
instructed to get the image from the bootstrap http server by setting
CACHEURL=http://172.22.0.1/images
Source code for the container images here
https://github.com/derekhiggins/resource-downloader