Move image downloads into resource downloader containers #564

derekhiggins · 2019-05-23T16:09:10Z

Currently a POC of one potential way to handle image downloads

Move the download of the IPA and CoreOS image download into
containers. These containers can be used to download the images
for ironic running on the bootstrap and master nodes.

When downloading images on the master nodes the container can be
instructed to get the image from the bootstrap http server by setting
CACHEURL=http://172.22.0.1/images

Source code for the container images here
https://github.com/derekhiggins/resource-downloader

imain · 2019-05-23T16:13:56Z

01_install_requirements.sh

@@ -69,7 +69,6 @@ sudo yum -y install \
  ansible \
  bind-utils \
  jq \
-  libguestfs-tools \


Is this from something else you were doing? Not needed here is it?

It's removing the requirement as we're not longer doing qemu-img-convert I think?

We used to need it for some of the image customisation we did but is no longer needed. Actually I think we might have removed the requirement when we stopped calling virt-filesystems or something.

hardys · 2019-05-23T16:16:48Z

Nice, thanks for looking into this! :)

So I guess we could then convert the local bind-mount to a persistent volume in the cluster-deployed Ironic case, and provided the boostrap VM is up we can download from there (although we'd need some way to prevent bootstrap-complete happening or the installer could delete the VM :)

cc @imain

imain · 2019-05-23T16:18:26Z

Yes this could work. Currently I'm copying them off the host into the shared volume in the pod.. this is probably better and could be used to get it off the internet too right?

Is there a timing issue here with the images arriving vs when ironic is ready to do work? I've seen ironic fail to boot a node because of a partially transferred image.

derekhiggins · 2019-05-23T16:31:53Z

A couple of things need to be developed here if this looks like its going in the right direction

So I guess we could then convert the local bind-mount to a persistent volume in the cluster-deployed Ironic case, and provided the boostrap VM is up we can download from there (although we'd need some way to prevent bootstrap-complete happening or the installer could delete the VM :)

Yup, if we move ironic to the bootstrap VM then we'd need to either prevent it from going away too soon
or maybe start the download containers and http server on the virt host just as a cache that both the bootstrap and cluster could initially download from.

Yes this could work. Currently I'm copying them off the host into the shared volume in the pod.. this is probably better and could be used to get it off the internet too right?

if the CACHEURL env variable isn't set then they get downloaded from internet

Is there a timing issue here with the images arriving vs when ironic is ready to do work? I've seen ironic fail to boot a node because of a partially transferred image.

Yup, We'll need to allow for that

imain · 2019-05-23T17:25:02Z

Yeah I like the CACHEURL setup!

derekhiggins · 2019-05-23T17:50:45Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/673/

imain · 2019-05-24T16:10:51Z

Actually I think if we just 'pivot' the files.. eg download them somewhere temporary (on the same filesystem) and then 'mv' them into place I think that would solve any real race issues. PXE boot just keeps retrying until the image is available so that part is solved.

derekhiggins · 2019-05-27T09:29:19Z

Actually I think if we just 'pivot' the files.. eg download them somewhere temporary (on the same filesystem) and then 'mv' them into place I think that would solve any real race issues. PXE boot just keeps retrying until the image is available so that part is solved.

The containers already download to a tmpdir and move when finished so we'd be ok there, we'd have the see if there is a similar retry when IPA downloads the CoreOS image.

derekhiggins · 2019-06-04T09:54:09Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/699/

hardys · 2019-06-07T08:55:45Z

Ok this seems to work well, so I guess the next step is to get the resource-downloader images into the metal3 repos and quay org?

Something like:

Move the IPA downloader to the metal3-io org, so it gets published to https://quay.io/organization/metal3-io (this image is common to upstream testing via metal3-dev-env and downstream coreos based deployments)
Move the coreos downloader to openshift-metal3, so it gets published via https://quay.io/organization/openshift-metal3
Update dev-scripts so we no longer download images manually, we probably need to parameterize the URL so that can be easily changed each kni-installer rebase
Stop running the ironic service containers on the deployment host, instead run them via a staticpod on the bootstrap VM (optionally with cacheurl set to download from the dev-scripts cached versions for dev/test use)
Update BMO to use the new containers, optionally using cacheurl from the host, we can defer any caching via the bootstrap VM for now, since there's a ton of duplicate image container downloads anyway.

Does that sound reasonable @derekhiggins? If you can do 1,2,3, I'm working on 4, and @imain can help with 5?

hardys · 2019-06-07T10:21:10Z

We also have to figure out the update/upgrade strategy, on deployment the installer knows the RHCOS version it's tied to, so we can work out the URL, but post-deploy the cluster may get upgraded, so we'll need some way to read that version and download any new image.

Similarly for IPA, do we have the script loop checking for new versions periodically, or somehow restart the container to trigger a re-check?

hardys · 2019-06-07T11:08:50Z

04_setup_ironic.sh

@@ -75,6 +55,12 @@ sudo podman run -d --net host --privileged --name ironic --pod ironic-pod \
     --env MARIADB_PASSWORD=$mariadb_password \
     -v $IRONIC_DATA_DIR:/shared ${IRONIC_IMAGE}

+sudo podman run -d --net host --privileged --name ipa-downloader --pod ironic-pod \
+     -v $IRONIC_DATA_DIR:/shared ${IPA_DOWNLOADER_IMAGE} /usr/local/bin/get-resource.sh


Do these actually need to be privileged? I forget why we needed that, but it'd be good to ensure only those containers that really need it have it specified?

Similarly with host networking, I guess we need that for the containers which run a server, but probably not for those that just download and write to a shared volume?

I tested the containers in a pod via podman play on my host without either and it worked fine (although podman play on the RHCOS image seems to segfault, so we'll have to stick to the manual script approach for starting the static-pod on the bootstrap node).

04_setup_ironic.sh

imain · 2019-06-10T15:54:04Z

Ok this seems to work well, so I guess the next step is to get the resource-downloader images into the metal3 repos and quay org?

Something like:

Move the IPA downloader to the metal3-io org, so it gets published to https://quay.io/organization/metal3-io (this image is common to upstream testing via metal3-dev-env and downstream coreos based deployments)

Move the coreos downloader to openshift-metal3, so it gets published via https://quay.io/organization/openshift-metal3

Update dev-scripts so we no longer download images manually, we probably need to parameterize the URL so that can be easily changed each kni-installer rebase

Stop running the ironic service containers on the deployment host, instead run them via a staticpod on the bootstrap VM (optionally with cacheurl set to download from the dev-scripts cached versions for dev/test use)

Update BMO to use the new containers, optionally using cacheurl from the host, we can defer any caching via the bootstrap VM for now, since there's a ton of duplicate image container downloads anyway.

Does that sound reasonable @derekhiggins? If you can do 1,2,3, I'm working on 4, and @imain can help with 5?

Yes I can do 5. I also added a new containers that will have to go into the org.

https://github.com/imain/static-ip-manager

hardys · 2019-06-11T14:27:56Z

Ok created https://github.com/metal3-io/ironic-ipa-downloader which builds to https://quay.io/repository/metal3-io/ironic-ipa-downloader

And https://github.com/openshift-metal3/rhcos-downloader which builds to https://quay.io/repository/openshift-metal3/rhcos-downloader

hardys · 2019-06-11T14:30:27Z

@derekhiggins Ok this works well for me, and I created the upstream images - would you like to rebase this, then we can iterate on the remaining aspects of getting this working on the bootstrap VM and cluster in other PRs?

derekhiggins · 2019-06-11T16:33:23Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/730/

hardys · 2019-06-12T09:35:50Z

common.sh

 export RHCOS_IMAGE_FILENAME_LATEST="rhcos-ootpa-latest.qcow2"

 # Ironic vars
 export IRONIC_IMAGE=${IRONIC_IMAGE:-"quay.io/metal3-io/ironic:master"}
 export IRONIC_INSPECTOR_IMAGE=${IRONIC_INSPECTOR_IMAGE:-"quay.io/metal3-io/ironic-inspector:master"}
+export IPA_DOWNLOADER_IMAGE=${IPA_DOWNLOADER_IMAGE:-"quay.io/higginsd/resource-downloader-ipa:master"}
+export COREOS_DOWNLOADER_IMAGE=${COREOS_DOWNLOADER_IMAGE:-"quay.io/higginsd/resource-downloader-coreos:master"}


These should be updated to point to the metalkube images

doh, sorry.

derekhiggins · 2019-06-12T10:07:19Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/736/

hardys · 2019-06-12T12:39:45Z

Hmm the CI failure looks real, but we don't save the install-config contents in the logs, so it's hard to see what's up, trying locally to see if there are clues.

hardys · 2019-06-12T12:57:18Z

Ok this is working for me locally so I'm confused.

imain · 2019-06-14T18:52:33Z

So I think this could work really well as an init container, but init containers are expected to exit when complete. The sleep here at the end means it won't work properly in that way.

derekhiggins · 2019-06-17T08:43:02Z

So I think this could work really well as an init container, but init containers are expected to exit when complete. The sleep here at the end means it won't work properly in that way.

We can remove the sleep, I had it there because I was testing the container as part of the BMO pod and it needed to stick around to avoid it being continuously restarted.

hardys · 2019-06-17T09:28:16Z

Sure removing the sleep sounds fine, but it'd be good to have an easy way to check that the files are fully downloaded, perhaps we could remove the md5sum at the start of any download, then re-calculate it at the end (and add similar checksums for the IPA images?)

The reason I need this is so we can make terraform wait for the images to be available before attempting master deployment - I assume we'll need similar logic in the operator so we spin waiting on the images before any worker deployment gets triggered.

hardys · 2019-06-17T09:28:37Z

Also any ideas on the CI failure, worth a rebase to re-try?

derekhiggins · 2019-06-17T10:04:40Z

Sure removing the sleep sounds fine, but it'd be good to have an easy way to check that the files are fully downloaded, perhaps we could remove the md5sum at the start of any download, then re-calculate it at the end (and add similar checksums for the IPA images?)

Can you elaborate on this, I'm not sure what you mean here.

Also any ideas on the CI failure, worth a rebase to re-try?

I've rebased and added some logging of the downloader containers to see if I can find out what happening

metal3ci · 2019-06-17T10:51:33Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/756/

hardys · 2019-06-17T12:14:34Z

Can you elaborate on this, I'm not sure what you mean here.

We need to wait until the images are all downloaded before trying to deploy (in both the terraform and BMO case), e.g see openshift-metal3/kni-installer@e26cd91

Initially I thought we downloaded the IPA kernel/ramdisk directly, but I see now it's a tarball with symlinks, so probably what I have there is OK, and checking the RHCOS image md5sum exists will prove it's fully downloaded.

I was also worried about the case where we re-run the *downloader containers on upgrade, and in that case I guess we want to prevent any deployments starting until the new images are downloaded, so we could e.g remove the IPA image symlinks at the start of the get_resource.sh script, and likewise remove the RHCOS image md5sum at the start to signal download is in progress?

metal3ci · 2019-06-17T12:51:01Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/757/

Move the download of the IPA and CoreOS image download into containers. These containers can be used to download the images for ironic running on the bootstrap and master nodes. When downloading images on the master nodes the container can be instructed to get the image from the bootstrap http server by setting CACHEURL=http://172.22.0.1/images

metal3ci · 2019-06-17T14:40:35Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/758/

derekhiggins · 2019-06-17T14:55:00Z

Can you elaborate on this, I'm not sure what you mean here.

We need to wait until the images are all downloaded before trying to deploy (in both the terraform and BMO case), e.g see openshift-metal3/kni-installer@e26cd91

Initially I thought we downloaded the IPA kernel/ramdisk directly, but I see now it's a tarball with symlinks, so probably what I have there is OK, and checking the RHCOS image md5sum exists will prove it's fully downloaded.

Ok, now I understand what your talking about, in both containers we are downloading into a tmpdir which is moved into place when the download is complete, so you shouldn't need to worry about partial downloads, each file is either present or not

I was also worried about the case where we re-run the *downloader containers on upgrade, and in that case I guess we want to prevent any deployments starting until the new images are downloaded, so we could e.g remove the IPA image symlinks at the start of the get_resource.sh script, and likewise remove the RHCOS image md5sum at the start to signal download is in progress?

ya, removing the symlinks would be best here to stop the old images being used

BTW: The CI problem has been fixed in the latest version, The CI file caching was getting in the way of the new logic, the latest failure was a timeout, I'll retrigger.

russellb · 2019-06-17T18:27:28Z

Move the coreos downloader to openshift-metal3, so it gets published via https://quay.io/organization/openshift-metal3

This should just go straight to OpenShift. I don't think OpenShift specific images under opensihft-metal3 buy us any value except for short term dev and prototyping. We should ship things through OpenShift properly ASAP.

metal3ci · 2019-06-18T11:04:20Z

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/759/

hardys · 2019-06-19T09:39:12Z

Ok this lgtm now - @derekhiggins are you ready for this to land?

hardys · 2019-06-19T09:40:04Z

Move the coreos downloader to openshift-metal3, so it gets published via https://quay.io/organization/openshift-metal3

This should just go straight to OpenShift. I don't think OpenShift specific images under opensihft-metal3 buy us any value except for short term dev and prototyping. We should ship things through OpenShift properly ASAP.

Good point, we already got the build working in the openshift-metal3 org for now, but we should follow-up with moving the rhcos-downloader container to the openshift org - any links on the process we need to follow?

derekhiggins · 2019-06-19T10:06:51Z

Ok this lgtm now - @derekhiggins are you ready for this to land?

It looks good to go, I want to add back in the caching of the RHCOS image (For CI) and I've just noticed you comment about some containers not needing to be privileged, I can follow up with those.

hardys · 2019-06-19T10:13:20Z

Ok this lgtm now - @derekhiggins are you ready for this to land?

It looks good to go, I want to add back in the caching of the RHCOS image (For CI) and I've just noticed you comment about some containers not needing to be privileged, I can follow up with those.

Note that I think we may need to change the CACHEURL path slightly to check the subdirectory, assuming the new *downloader containers are used to create the cached files - see comment on openshift-metal3/rhcos-downloader#1

Re the privileged containers, I think we can remove that (and possibly net=host) from both downloader containers, but any containers that bind a port on the host via net=host do need it.

hardys · 2019-06-19T10:47:51Z

Ok we discussed the CACHEURL concern on openshift-metal3/rhcos-downloader#1 and the --privileged cleanups can be a follow-up, so lets merge this one to make progress.

In openshift-metal3#564 we removed the mkdir like: $ git show cdff373 | grep mkdir -mkdir -p "$IRONIC_DATA_DIR/html/images" But we still need IRONIC_DATA_DIR to exist, e.g to bind into the containers We didn't catch this in CI or local testing because the images dir is cached.

derekhiggins added the CI check this PR with CI label May 23, 2019

imain reviewed May 23, 2019

View reviewed changes

derekhiggins force-pushed the download-containers branch from 90ee23b to 457d262 Compare June 4, 2019 09:43

hardys mentioned this pull request Jun 6, 2019

Use installer version command to download image version #606

Closed

hardys reviewed Jun 7, 2019

View reviewed changes

hardys reviewed Jun 10, 2019

View reviewed changes

04_setup_ironic.sh Show resolved Hide resolved

derekhiggins force-pushed the download-containers branch from 457d262 to fbddf63 Compare June 11, 2019 16:19

hardys reviewed Jun 12, 2019

View reviewed changes

derekhiggins force-pushed the download-containers branch from fbddf63 to d3f8cbb Compare June 12, 2019 09:49

hardys mentioned this pull request Jun 12, 2019

WIP add Ironic to the bootstrap VM openshift-metal3/kni-installer#100

Closed

derekhiggins force-pushed the download-containers branch from d3f8cbb to 9b0398a Compare June 17, 2019 09:59

derekhiggins force-pushed the download-containers branch from 9b0398a to d683115 Compare June 17, 2019 11:20

derekhiggins force-pushed the download-containers branch from d683115 to cdff373 Compare June 17, 2019 13:33

hardys approved these changes Jun 19, 2019

View reviewed changes

derekhiggins changed the title ~~WIP: Move image downloads into resource downloader containers~~ Move image downloads into resource downloader containers Jun 19, 2019

hardys merged commit 7868a02 into openshift-metal3:master Jun 19, 2019

hardys mentioned this pull request Jun 19, 2019

Reinstate mkdir for IRONIC_DATA_DIR #629

Merged

Move image downloads into resource downloader containers #564

Move image downloads into resource downloader containers #564

Conversation

derekhiggins commented May 23, 2019 • edited Loading

imain May 23, 2019

Choose a reason for hiding this comment

hardys May 23, 2019

Choose a reason for hiding this comment

derekhiggins May 23, 2019

Choose a reason for hiding this comment

hardys commented May 23, 2019

imain commented May 23, 2019

derekhiggins commented May 23, 2019 • edited Loading

imain commented May 23, 2019

derekhiggins commented May 23, 2019

imain commented May 24, 2019

derekhiggins commented May 27, 2019

derekhiggins commented Jun 4, 2019

hardys commented Jun 7, 2019

hardys commented Jun 7, 2019

hardys Jun 7, 2019

Choose a reason for hiding this comment

imain commented Jun 10, 2019

hardys commented Jun 11, 2019

hardys commented Jun 11, 2019

derekhiggins commented Jun 11, 2019

hardys Jun 12, 2019

Choose a reason for hiding this comment

derekhiggins Jun 12, 2019

Choose a reason for hiding this comment

derekhiggins commented Jun 12, 2019

hardys commented Jun 12, 2019

hardys commented Jun 12, 2019

imain commented Jun 14, 2019

derekhiggins commented Jun 17, 2019

hardys commented Jun 17, 2019

hardys commented Jun 17, 2019

derekhiggins commented Jun 17, 2019

metal3ci commented Jun 17, 2019

hardys commented Jun 17, 2019

metal3ci commented Jun 17, 2019

metal3ci commented Jun 17, 2019

derekhiggins commented Jun 17, 2019

russellb commented Jun 17, 2019

metal3ci commented Jun 18, 2019

hardys commented Jun 19, 2019

hardys commented Jun 19, 2019

derekhiggins commented Jun 19, 2019 • edited Loading

hardys commented Jun 19, 2019

hardys commented Jun 19, 2019

derekhiggins commented May 23, 2019 •

edited

Loading

derekhiggins commented May 23, 2019 •

edited

Loading

derekhiggins commented Jun 19, 2019 •

edited

Loading