Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machines/libvirt/worker.machineset.yaml: Drop /var/lib/libvirt/images #70

Closed
wants to merge 1 commit into from

Conversation

wking
Copy link
Member

@wking wking commented Sep 23, 2018

The actuator looks up the baseVolumeID in the configured pool. That lookup works with both the volume name and full volume path:

$ virsh -c qemu:///system vol-info --vol coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

$ virsh -c qemu:///system vol-info --vol /home/trking/VirtualMachines/coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

But it fails if you use the wrong full path:

$ virsh -c qemu:///system vol-info --vol /var/lib/libvirt/images/coreos_base --pool default
error: failed to get vol '/var/lib/libvirt/images/coreos_base'
error: Storage volume not found: no storage vol with matching path '/var/lib/libvirt/images/coreos_base'

My default pool happens to be in my home directory:

$ virsh -c qemu:///system pool-dumpxml default
<pool type='dir'>
  <name>default</name>
  <uuid>c20a2154-aa60-44cf-bf37-cd8b7818a4e4</uuid>
  <capacity unit='bytes'>105554829312</capacity>
  <allocation unit='bytes'>44134699008</allocation>
  <available unit='bytes'>61420130304</available>
  <source>
  </source>
  <target>
    <path>/home/trking/VirtualMachines</path>
    <permissions>
      <mode>0777</mode>
      <owner>114032</owner>
      <group>114032</group>
      <label>system_u:object_r:virt_image_t:s0</label>
    </permissions>
  </target>
</pool>

This commit allows configutions like mine by dropping our opinions about the default pool location and just using the volume names:

$ virsh -c qemu:///system vol-list --pool default
 Name                 Path
------------------------------------------------------------------------------
 bootstrap            /home/trking/VirtualMachines/bootstrap
 bootstrap.ign        /home/trking/VirtualMachines/bootstrap.ign
 coreos_base          /home/trking/VirtualMachines/coreos_base
 master-0.ign         /home/trking/VirtualMachines/master-0.ign
 master0              /home/trking/VirtualMachines/master0
 worker.ign           /home/trking/VirtualMachines/worker.ign

Longer-term, it would be nice to pull both the pool and volume names from information pushed by the installer. But I'm punting on that for this commit.

Reported by @mrogers950.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 23, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: vikaschoudhary16

If they are not already assigned, you can assign the PR to them by writing /assign @vikaschoudhary16 in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The actuator looks up the baseVolumeID in the configured pool [1].
That lookup works with both the volume name and full volume path:

  $ virsh -c qemu:///system vol-info --vol coreos_base --pool default
  Name:           coreos_base
  Type:           file
  Capacity:       16.00 GiB
  Allocation:     1.55 GiB

  $ virsh -c qemu:///system vol-info --vol /home/trking/VirtualMachines/coreos_base --pool default
  Name:           coreos_base
  Type:           file
  Capacity:       16.00 GiB
  Allocation:     1.55 GiB

But it fails if you use the wrong full path:

  $ virsh -c qemu:///system vol-info --vol /var/lib/libvirt/images/coreos_base --pool default
  error: failed to get vol '/var/lib/libvirt/images/coreos_base'
  error: Storage volume not found: no storage vol with matching path '/var/lib/libvirt/images/coreos_base'

My default pool happens to be in my home directory:

  $ virsh -c qemu:///system pool-dumpxml default
  <pool type='dir'>
    <name>default</name>
    <uuid>c20a2154-aa60-44cf-bf37-cd8b7818a4e4</uuid>
    <capacity unit='bytes'>105554829312</capacity>
    <allocation unit='bytes'>44134699008</allocation>
    <available unit='bytes'>61420130304</available>
    <source>
    </source>
    <target>
      <path>/home/trking/VirtualMachines</path>
      <permissions>
        <mode>0777</mode>
        <owner>114032</owner>
        <group>114032</group>
        <label>system_u:object_r:virt_image_t:s0</label>
      </permissions>
    </target>
  </pool>

This commit allows configutions like mine by dropping our opinions
about the default pool location and just using the volume names:

  $ virsh -c qemu:///system vol-list --pool default
   Name                 Path
  ------------------------------------------------------------------------------
   bootstrap            /home/trking/VirtualMachines/bootstrap
   bootstrap.ign        /home/trking/VirtualMachines/bootstrap.ign
   coreos_base          /home/trking/VirtualMachines/coreos_base
   master-0.ign         /home/trking/VirtualMachines/master-0.ign
   master0              /home/trking/VirtualMachines/master0
   worker.ign           /home/trking/VirtualMachines/worker.ign

We've been using the full-path approach since the templates landed in
2522d0f (add libvirt support, 2018-08-30, openshift#35), but there's no
discussion there about why the path approach was chosen instead of the
name approach I'm switching to here.

Longer-term, it would be nice to pull both the pool and volume names
from information pushed by the installer [2].  But I'm punting on
*that* for this commit.

Reported-by: Matt Rogers <[email protected]>

[1]: https://github.com/openshift/cluster-api-provider-libvirt/blob/2e5a516afc704c6c94d7b7cde74e78c43bbfeaa5/cloud/libvirt/actuators/machine/utils/volume.go#L174
[2]: https://github.com/openshift/installer/blob/dc4764dc603cea5da0e54f575b7ae1a2c26d3102/pkg/types/machinepools.go#L53-L58
@wking wking force-pushed the drop-var-lib-libvirt-images branch from 1934183 to ea4a450 Compare September 23, 2018 07:19
@enxebre
Copy link
Member

enxebre commented Sep 24, 2018

@wking thanks a lot! lgtm @vikaschoudhary16 / @bison can you please have a look and merge at your convenience?

@mrogers950
Copy link

@wking thanks for fixing this!

@wking
Copy link
Member Author

wking commented Sep 24, 2018

Thoughts on how I'd test this? I built an image from this PR using the Dockerfile in the repo root, and pushed it to quay.io/wking/machine-api-operator:pr-70. Then I patched my installer to use that image:

$ git diff
diff --git a/modules/bootkube/resources/manifests/machine-api-operator.yaml b/modules/bootkube/resources/manifests/machine-api-operator.yaml
index 125b870..5c72b5a 100644
--- a/modules/bootkube/resources/manifests/machine-api-operator.yaml
+++ b/modules/bootkube/resources/manifests/machine-api-operator.yaml
@@ -19,7 +19,7 @@ spec:
     spec:
       containers:
       - name: machine-api-operator
-        image: quay.io/coreos/machine-api-operator:b6a04c2
+        image: quay.io/wking/machine-api-operator:pr-70
         command:
         - "/machine-api-operator"
         resources:
diff --git a/pkg/asset/manifests/content/bootkube/machine-api-operator.go b/pkg/asset/manifests/content/bootkube/machine-api-operator.go
index 48e4765..fd78535 100644
--- a/pkg/asset/manifests/content/bootkube/machine-api-operator.go
+++ b/pkg/asset/manifests/content/bootkube/machine-api-operator.go
@@ -24,7 +24,7 @@ spec:
     spec:
       containers:
       - name: machine-api-operator
-        image: quay.io/coreos/machine-api-operator:b6a04c2
+        image: quay.io/wking/machine-api-operator:pr-70
         command:
         - "/machine-api-operator"
         resources:

But when I try to launch a cluster, the machine-api-operator pod dies complaining about the invocation:

[core@trking-6d200-master-0 ~]$ sudo crictl logs 42396a6f9bd9b
Run Cluster API Controller

Usage:
  machine-api-operator [command]

Available Commands:
  help        Help about any command
  start       Starts Machine API Operator
  version     Print the version number of Machine API Operator

Flags:
      --alsologtostderr                  log to standard error as well as files
      --config string                    path to the mao config (default "/etc/mao-config/config")
  -h, --help                             help for machine-api-operator
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

Use "machine-api-operator [command] --help" for more information about a command.

Is there a different Dockerfile I should be using? Do I need to update the invocation to be more than /machine-api-operator? Something else?

@wking
Copy link
Member Author

wking commented Sep 24, 2018

Looks like I need to add start (and possibly other things) to catch up with #67.

@wking
Copy link
Member Author

wking commented Sep 24, 2018

Ok, I'm farther along with:

diff --git a/modules/bootkube/resources/manifests/machine-api-operator.yaml b/modules/bootkube/resources/manifests/machine-api-operator.yaml
index 125b870..a11046e 100644
--- a/modules/bootkube/resources/manifests/machine-api-operator.yaml
+++ b/modules/bootkube/resources/manifests/machine-api-operator.yaml
@@ -19,9 +19,11 @@ spec:
     spec:
       containers:
       - name: machine-api-operator
-        image: quay.io/coreos/machine-api-operator:b6a04c2
+        image: quay.io/wking/machine-api-operator:pr-70
         command:
         - "/machine-api-operator"
+        args:
+        - "start"
         resources:
           limits:
             cpu: 20m
@@ -51,4 +53,3 @@ spec:
           items:
           - key: mao-config
             path: config
-
diff --git a/pkg/asset/manifests/content/bootkube/machine-api-operator.go b/pkg/asset/manifests/content/bootkube/machine-api-operator.go
index 48e4765..e4e4156 100644
--- a/pkg/asset/manifests/content/bootkube/machine-api-operator.go
+++ b/pkg/asset/manifests/content/bootkube/machine-api-operator.go
@@ -24,9 +24,11 @@ spec:
     spec:
       containers:
       - name: machine-api-operator
-        image: quay.io/coreos/machine-api-operator:b6a04c2
+        image: quay.io/wking/machine-api-operator:pr-70
         command:
         - "/machine-api-operator"
+        args:
+        - "start"
         resources:
           limits:
             cpu: 20m

Now I'm hitting:

[core@trking-359a0-master-0 ~]$ sudo crictl logs --tail 2 8a523d18e0014
E0924 23:01:50.276908       1 leaderelection.go:228] error initially creating leader election record: namespaces "openshift-machine-api-operator" not found
E0924 23:02:25.979902       1 leaderelection.go:228] error initially creating leader election record: namespaces "openshift-machine-api-operator" not found

@wking
Copy link
Member Author

wking commented Sep 24, 2018

E0924 23:01:50.276908       1 leaderelection.go:228] error initially creating leader election record: namespaces "openshift-machine-api-operator" not found

Looks like #68 and 205721c (#67) made a change from a previous tectonic-system.

@enxebre
Copy link
Member

enxebre commented Sep 25, 2018

hey @wking thanks for looking into it, your steps are right. We just rewrote the implementation logic to drop appVersion and x-operator. This is all it needs to be satisfied by the installer https://github.com/openshift/machine-api-operator#manual-deployment-for-kubernetes-cluster so openshift-machine-api-operator would need to be precreated as well. See also https://github.com/openshift/machine-api-operator/tree/master/tests/e2e/manifests

@wking
Copy link
Member Author

wking commented Oct 2, 2018

It sounds like we're comfortable with the changes I'm proposing here, and the only remaining issues are getting the installer bumped to adjust to other changes that have landed since the last time the installer bumped it's pinned operator version. Is there anything blocking this PR, or anything I can do to help get it landed?

@enxebre
Copy link
Member

enxebre commented Oct 8, 2018

Hey @wking sorry for the dealy for this to work we'll also need to update how we create the volume for each domain https://github.com/openshift/cluster-api-provider-libvirt/blob/2e5a516afc704c6c94d7b7cde74e78c43bbfeaa5/cloud/libvirt/actuators/machine/utils/domain.go#L23
https://github.com/openshift/cluster-api-provider-libvirt/blob/2e5a516afc704c6c94d7b7cde74e78c43bbfeaa5/cloud/libvirt/actuators/machine/utils/volume.go#L222
Tracked here https://jira.coreos.com/browse/CLOUD-217
There are some WIP for adding testing on libvirt and refactoring so we can reliably make changes that might slow this pr down. I'll be more than happy to get this in after openshift/cluster-api-provider-libvirt#38 lands

@paulfantom
Copy link
Contributor

/retest

@openshift-ci-robot
Copy link
Contributor

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws ea4a450 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2018
@enxebre
Copy link
Member

enxebre commented Nov 26, 2018

Closing this as it's only relevant to https://github.com/openshift/cluster-api-provider-libvirt repo now

@enxebre enxebre closed this Nov 26, 2018
ingvagabund pushed a commit to ingvagabund/machine-api-operator that referenced this pull request Jul 11, 2019
Makefile: add Go debug support for local binaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants