Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

device: Accept the PCIAddress in the BDF format for block devices #823

Closed
likebreath opened this issue Aug 28, 2020 · 15 comments
Closed

device: Accept the PCIAddress in the BDF format for block devices #823

likebreath opened this issue Aug 28, 2020 · 15 comments
Assignees
Labels
enhancement Improvement to an existing feature needs-forward-port Changes need to be applied to a newer branch / repository needs-review Needs to be assessed by the team.

Comments

@likebreath
Copy link
Contributor

The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. We should extend extend the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor.

@likebreath likebreath added enhancement Improvement to an existing feature needs-review Needs to be assessed by the team. needs-forward-port Changes need to be applied to a newer branch / repository labels Aug 28, 2020
@likebreath likebreath self-assigned this Aug 28, 2020
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
…evices"

This reverts commit c01192e.
The above commit was a temporary (not reliable) workaround to support
container rootfs through hotplugged block device when using
cloud-hypervisor (clh). As the updated version of clh now returns the
PCI BDF information for all hotplugged devices, we no longer need to
rely on the predicted device path in the guest (a.k.a 'VmPath'). This
patch reverts the above commit to remove the special code path for
supporting cloud-hypervisor.

Fixes: kata-containers#823
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
The kata-agent now only accepts the PCIAddress from kata-runtime in
the "bridgeAddr/deviceAddr" format, which is tied to the device
hotplug interface in qemu. This patch extends the kata-agent to accept
input PCIAddress in the standard BDF format, which should be more
generic to support different hypervisors, e.g. cloud-hypervisor.

Fixes: kata-containers#823

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
…evices"

This reverts commit c01192e.
The above commit was a temporary (not reliable) workaround to support
container rootfs through hotplugged block device when using
cloud-hypervisor (clh). As the updated version of clh now returns the
PCI BDF information for all hotplugged devices, we no longer need to
rely on the predicted device path in the guest (a.k.a 'VmPath'). This
patch reverts the above commit to remove the special code path for
supporting cloud-hypervisor.

Fixes: kata-containers#823

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
The kata-agent now only accepts the PCIAddress from kata-runtime in
the "bridgeAddr/deviceAddr" format, which is tied to the device
hotplug interface in qemu. This patch extends the kata-agent to accept
input PCIAddress in the standard BDF format, which should be more
generic to support different hypervisors, e.g. cloud-hypervisor.

Fixes: kata-containers#823

Depends-on: github.com/kata-containers/runtime#2909

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
…devices"

This reverts commit c01192e.
The above commit was a temporary (not reliable) workaround to support
container rootfs through hotplugged block device when using
cloud-hypervisor (clh). As the updated version of clh now returns the
PCI BDF information for all hotplugged devices, we no longer need to
rely on the predicted device path in the guest (a.k.a 'VmPath'). This
patch reverts the above commit to remove the special code path for
supporting cloud-hypervisor.

Fixes: kata-containers#823

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
The kata-agent now only accepts the PCIAddress from kata-runtime in
the "bridgeAddr/deviceAddr" format, which is tied to the device
hotplug interface in qemu. This patch extends the kata-agent to accept
input PCIAddress in the standard BDF format, which should be more
generic to support different hypervisors, e.g. cloud-hypervisor.

Fixes: kata-containers#823

Depends-on: github.com/kata-containers/runtime#2909

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
…evices

This reverts commit c01192e.
The above commit was a temporary (not reliable) workaround to support
container rootfs through hotplugged block device when using
cloud-hypervisor (clh). As the updated version of clh now returns the
PCI BDF information for all hotplugged devices, we no longer need to
rely on the predicted device path in the guest (a.k.a 'VmPath'). This
patch reverts the above commit to remove the special code path for
supporting cloud-hypervisor.

Fixes: kata-containers#823

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Aug 28, 2020
The kata-agent now only accepts the PCIAddress from kata-runtime in
the "bridgeAddr/deviceAddr" format, which is tied to the device
hotplug interface in qemu. This patch extends the kata-agent to accept
input PCIAddress in the standard BDF format, which should be more
generic to support different hypervisors, e.g. cloud-hypervisor.

Fixes: kata-containers#823

Depends-on: github.com/kata-containers/runtime#2909

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Sep 8, 2020
…evices

This reverts commit c01192e.
The above commit was a temporary (not reliable) workaround to support
container rootfs through hotplugged block device when using
cloud-hypervisor (clh). As the updated version of clh now returns the
PCI BDF information for all hotplugged devices, we no longer need to
rely on the predicted device path in the guest (a.k.a 'VmPath'). This
patch reverts the above commit to remove the special code path for
supporting cloud-hypervisor.

Fixes: kata-containers#823

Signed-off-by: Bo Chen <[email protected]>
likebreath added a commit to likebreath/agent that referenced this issue Sep 8, 2020
The kata-agent now only accepts the PCIAddress from kata-runtime in
the "bridgeAddr/deviceAddr" format, which is tied to the device
hotplug interface in qemu. This patch extends the kata-agent to accept
input PCIAddress in the standard BDF format, which should be more
generic to support different hypervisors, e.g. cloud-hypervisor.

Fixes: kata-containers#823

Depends-on: github.com/kata-containers/runtime#2909

Signed-off-by: Bo Chen <[email protected]>
@dgibson
Copy link
Contributor

dgibson commented Sep 18, 2020

I don't think this is a good idea. To be useful as part of the runtime<->agent protocol, any address we use has to be meaningful to both host and guest. A (guest) BDF address is meaningful to the guest kernel, but in general the host can only guess at it. The BB part depends on the guest enumeration of the bus. Even if we count on the guest firmware to enumerate it in a particular way, the bus numbers don't necessarily have to even remain static for the lifetime of the guest.

The domain (DDDD part) has its own set of issues. On PC-like machines it will generally be consistent, because I think it's used as a parameter in one of the controller registers (although nearly all PC-like machines only use domain 0000 anyway). However PCI domains by definition have no PCI level connection to each other - they're independent host bridges - so there can be no PCI standard way of numbering them. On other platforms (e.g. IBM POWER) the bridges are identified by other means and the domain numbers are just allocated dynamically by the guest kernel (and on POWER it's common to have many domains).

Although the "bridgeAddr/slotAddr" format may have originated with qemu, it has real meaning in PCI space which both the host and guest can interpret, regardless of hypervisor - if CLH can generate (a guess at) guest BDF addresses, it can certainly generate the bridge and slot addresses (the "D" part of the BDF will be the device slot).

I would however suggest extending the bridgeAddr/slotAddr format to a more general "PCI path" by:

  • Allowing any number of entries, to allow for both devices plugged directly into the root bus, and for devices under several layers of bridge
  • Allowing each component to be SS.F rather than just SS to allow for multifunction devices.
  • Adding a (necessarily machine type specific) way of specifying which domain/host bridge to start the search from.

@likebreath
Copy link
Contributor Author

@dgibson Thanks a lot the detailed comments and explanation. It is really a good discussion on defining/improving the agent-runtime protocol of locating (block) devices. Let's bring this topic to the broader users in our community. Please feel free to add more people here.

WDYT? @kata-containers/architecture-committee @amshinde @egernst @fidencio @sboeuf @rbradford @kata-containers/agent

@sboeuf
Copy link

sboeuf commented Sep 21, 2020

I'd keep it simple. Since bridgeAddr/slotAddr is useful for some cases as explained by @dgibson I would simply check from the agent for the format received. By default the agent would expect bridgeAddr/slotAddr, but for VMMs with the ability to determine the full BDF, let the possibility to pass a BDF format instead.

@dgibson
Copy link
Contributor

dgibson commented Sep 21, 2020

@sboeuf see.. I don't think that is keeping it simple. By allowing for both options, you're increasing complexity of the protocol. And increasing complexity of the protocol is a bigger deal than increasing complexity of individual implementations (e.g. having the runtime convert one or more BDFs from the VMM into bridgeaddr/slotAddr form).

My view on that is coloured by the fact that I don't think "VMMs with the ability to determine the full BDF" is really a thing. There might be VMMs that think they can predict the BDF, but there's really no way it can be robust and reliable: even if it works now for the simple cases they support, it could well break in future.

@sboeuf
Copy link

sboeuf commented Sep 21, 2020

And increasing complexity of the protocol is a bigger deal than increasing complexity of individual implementations (e.g. having the runtime convert one or more BDFs from the VMM into bridgeaddr/slotAddr form).

I'm definitely okay with that. If this gives slightly more work to the runtime, so that the agent's protocol can stay simple, it sounds like a good compromise. The part where I'm skeptical this would work is the way the agent is going to find the device in the guest based on the bridgeaddr/slotAddr. Because the agent will expect a bridge to be there, the way to find the device might not be appropriate for simple use case where there's no PCI to PCI bridge. I'll let @likebreath investigate and give his feedback on this, but I don't think the current agent implementation would work with CH.

even if it works now for the simple cases they support, it could well break in future.

Fair enough.

@dgibson
Copy link
Contributor

dgibson commented Sep 21, 2020

Sorry, I should clarify. I think we should stick to just the "PCI path" form, of which bridgeAddr/slotAddr is a special case (and the only one supported so far. I do think we should extend that to allow a single entry (no bridge) or >2 entries (multiple bridges). That's a strict generalization of the existing format, rather than an entirely new one, though.

Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol.

@sboeuf
Copy link

sboeuf commented Sep 21, 2020

I think we should stick to just the "PCI path" form

What do you mean by PCI path?

Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol.

When hotplugging through ACPI, no bridge is needed. That's what we use with CH, it's the simplest way of performing PCI hotplug.

@dgibson
Copy link
Contributor

dgibson commented Sep 21, 2020

I think we should stick to just the "PCI path" form

What do you mean by PCI path?

I mean listing the slot+function of every bridge leading to the device, then the device itself. bridgeAddr/slotAddr is a PCI path with the restriction that there's exactly 1 bridge, and we only use function 0.

Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol.

When hotplugging through ACPI, no bridge is needed. That's what we use with CH, it's the simplest way of performing PCI hotplug.

Ok. Does that allow PCI-E devices or only PCI? IIUC plugging a PCI-E device without a root port would mean treating it as an integrated endpoint, which would be... odd.. although it might work anyway.

@c3d
Copy link
Member

c3d commented Sep 21, 2020

I like the idea of having a somewhat general PCI path. However, I have a concern that this might be trying to generalize something that is, at its core, hypervisor dependent. In other words, what would a kernel running on qemu do with a path that does not have the bridge?

So it looks to me like the agent should have some hypervisor-dependent callback that knows how to deal with the guest device format generated by that hypervisor:

  • For qemu, you ask qmp, so you always expect a bridgeAddr/slotAddr format, and the recently proposed lookup code that uses /sys would work
  • for clh, ACPI hotplug might mean no bridge, so maybe another format is required, and that would in turn require a different callback in the agent that knows how to deal with that. That specific code might be able to work without a bridge at all, and might be unable to process /br1/br2/device at all.

So what about making the runtime prefix the address with a hypervisor name, and let the agent invoke different lookup code that may take different input? In other words, replace bridgeAddr/slotAddr with qemu:bridge/slotAddr, and allow later for clh:whateverACPIinfo with the corresponding handling code?

In my opinion, this also leaves the door open for future changes in either the qmp output format or PCI layout in the VM. So in the future, if qemu 6 changes something major, we could have different callbacks for qemu and qemu6.

@dgibson
Copy link
Contributor

dgibson commented Sep 22, 2020

I like the idea of having a somewhat general PCI path. However, I have a concern that this might be trying to generalize something that is, at its core, hypervisor dependent.

No, it's really not. The PCI path as I'm describing it absolutely has a well-defined meaning purely in terms of PCI defined entities, regardless of hypervisor or platform. That's exactly why I'm suggesting it.

Well... to deal with the possibility of multiple PCI roots then we do need a platform dependent (as distinct from hypervisor dependent) extension to express which PCI root. But, one problem at a time.

In other words, what would a kernel running on qemu do with a path that does not have the bridge?

Grab the device with that slot on the root bus. With qemu, I don't think you can combine that with PCI-E, but that form is entirely plausible for either cold-plugged devices, or devices on the pc rather than q35 platform.

So it looks to me like the agent should have some hypervisor-dependent callback that knows how to deal with the guest device format generated by that hypervisor:

  • For qemu, you ask qmp, so you always expect a bridgeAddr/slotAddr format, and the recently proposed lookup code that uses /sys would work

Actually most instances of this format don't query qemu, instead we allocate the addresses and told qemu what to use. My VFIO code is an exception.

  • for clh, ACPI hotplug might mean no bridge, so maybe another format is required, and that would in turn require a different callback in the agent that knows how to deal with that. That specific code might be able to work without a bridge at all, and might be unable to process /br1/br2/device at all.

So what about making the runtime prefix the address with a hypervisor name, and let the agent invoke different lookup code that may take different input? In other words, replace bridgeAddr/slotAddr with qemu:bridge/slotAddr, and allow later for clh:whateverACPIinfo with the corresponding handling code?

Prefixing the string with a "scheme" identifier of some sort isn't necessarily a bad idea (particularly to deal with multiple domains in future). But tying that to hypervisor is broken (almost by definition - the address needs to be meaningful to the guest, which doesn't need to even be aware of what hypervisor it's running under).

In my opinion, this also leaves the door open for future changes in either the qmp output format or PCI layout in the VM. So in the future, if qemu 6 changes something major, we could have different callbacks for qemu and qemu6.

There is nothing specific to the qmp output format here.

@likebreath
Copy link
Contributor Author

@dgibson Sorry for the delayed response. Thanks again for sharing your insights. Given your much better understanding and experience in the area than myself, would you mind drafting a PR based on the proposal you mentioned above (to extend current bridgeaddr/slotAddr format towards a more general PCI path format)? May be that's a good way to continue the discussions here? WDYT? Any other thoughts if you have, please don't hesitate to share.

Meanwhile, I can follow-up on verifying whether the bridgeaddr/slotAddr format would work for CLH. As @sboeuf mentioned earlier: "Because the agent will expect a bridge to be there, the way to find the device might not be appropriate for simple use case where there's no PCI to PCI bridge." Anything else if you need related to CLH, please let me know.

/cc @sboeuf @c3d @amshinde

@dgibson
Copy link
Contributor

dgibson commented Oct 7, 2020

@likebreath sorry for the delay. I'm happy to tackle this PCI path cleanup, it's just a question of when. I had been planning to leave this until Kata 2, but it sounds like you might have need of it before then.

In fact, I've now encountered a very similar problem for myself. My #850 PR is blocked because there's a CI failure on VFIO. AFAICT at this stage, that failure is because it's doing a VFIO plug on cloud-hypervisor, and my code only adds PCI path generation for VFIO devices on qemu. It would be great if I can use your clh expertise to help fix this up.

@dgibson
Copy link
Contributor

dgibson commented Oct 7, 2020

@likebreath I've created #854 to track this work (the agent side, anyway)

@dgibson
Copy link
Contributor

dgibson commented Oct 9, 2020

@likebreath now that #855 is merged, should we close this issue?

@likebreath
Copy link
Contributor Author

@dgibson Agree. Closing the issue in favor of #854 and #855.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Improvement to an existing feature needs-forward-port Changes need to be applied to a newer branch / repository needs-review Needs to be assessed by the team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants