-
Notifications
You must be signed in to change notification settings - Fork 113
device: Accept the PCIAddress in the BDF format for block devices #823
Comments
…evices" This reverts commit c01192e. The above commit was a temporary (not reliable) workaround to support container rootfs through hotplugged block device when using cloud-hypervisor (clh). As the updated version of clh now returns the PCI BDF information for all hotplugged devices, we no longer need to rely on the predicted device path in the guest (a.k.a 'VmPath'). This patch reverts the above commit to remove the special code path for supporting cloud-hypervisor. Fixes: kata-containers#823
The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. This patch extends the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor. Fixes: kata-containers#823 Signed-off-by: Bo Chen <[email protected]>
…evices" This reverts commit c01192e. The above commit was a temporary (not reliable) workaround to support container rootfs through hotplugged block device when using cloud-hypervisor (clh). As the updated version of clh now returns the PCI BDF information for all hotplugged devices, we no longer need to rely on the predicted device path in the guest (a.k.a 'VmPath'). This patch reverts the above commit to remove the special code path for supporting cloud-hypervisor. Fixes: kata-containers#823 Signed-off-by: Bo Chen <[email protected]>
The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. This patch extends the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor. Fixes: kata-containers#823 Depends-on: github.com/kata-containers/runtime#2909 Signed-off-by: Bo Chen <[email protected]>
…devices" This reverts commit c01192e. The above commit was a temporary (not reliable) workaround to support container rootfs through hotplugged block device when using cloud-hypervisor (clh). As the updated version of clh now returns the PCI BDF information for all hotplugged devices, we no longer need to rely on the predicted device path in the guest (a.k.a 'VmPath'). This patch reverts the above commit to remove the special code path for supporting cloud-hypervisor. Fixes: kata-containers#823 Signed-off-by: Bo Chen <[email protected]>
The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. This patch extends the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor. Fixes: kata-containers#823 Depends-on: github.com/kata-containers/runtime#2909 Signed-off-by: Bo Chen <[email protected]>
…evices This reverts commit c01192e. The above commit was a temporary (not reliable) workaround to support container rootfs through hotplugged block device when using cloud-hypervisor (clh). As the updated version of clh now returns the PCI BDF information for all hotplugged devices, we no longer need to rely on the predicted device path in the guest (a.k.a 'VmPath'). This patch reverts the above commit to remove the special code path for supporting cloud-hypervisor. Fixes: kata-containers#823 Signed-off-by: Bo Chen <[email protected]>
The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. This patch extends the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor. Fixes: kata-containers#823 Depends-on: github.com/kata-containers/runtime#2909 Signed-off-by: Bo Chen <[email protected]>
…evices This reverts commit c01192e. The above commit was a temporary (not reliable) workaround to support container rootfs through hotplugged block device when using cloud-hypervisor (clh). As the updated version of clh now returns the PCI BDF information for all hotplugged devices, we no longer need to rely on the predicted device path in the guest (a.k.a 'VmPath'). This patch reverts the above commit to remove the special code path for supporting cloud-hypervisor. Fixes: kata-containers#823 Signed-off-by: Bo Chen <[email protected]>
The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. This patch extends the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor. Fixes: kata-containers#823 Depends-on: github.com/kata-containers/runtime#2909 Signed-off-by: Bo Chen <[email protected]>
I don't think this is a good idea. To be useful as part of the runtime<->agent protocol, any address we use has to be meaningful to both host and guest. A (guest) BDF address is meaningful to the guest kernel, but in general the host can only guess at it. The BB part depends on the guest enumeration of the bus. Even if we count on the guest firmware to enumerate it in a particular way, the bus numbers don't necessarily have to even remain static for the lifetime of the guest. The domain (DDDD part) has its own set of issues. On PC-like machines it will generally be consistent, because I think it's used as a parameter in one of the controller registers (although nearly all PC-like machines only use domain 0000 anyway). However PCI domains by definition have no PCI level connection to each other - they're independent host bridges - so there can be no PCI standard way of numbering them. On other platforms (e.g. IBM POWER) the bridges are identified by other means and the domain numbers are just allocated dynamically by the guest kernel (and on POWER it's common to have many domains). Although the "bridgeAddr/slotAddr" format may have originated with qemu, it has real meaning in PCI space which both the host and guest can interpret, regardless of hypervisor - if CLH can generate (a guess at) guest BDF addresses, it can certainly generate the bridge and slot addresses (the "D" part of the BDF will be the device slot). I would however suggest extending the bridgeAddr/slotAddr format to a more general "PCI path" by:
|
@dgibson Thanks a lot the detailed comments and explanation. It is really a good discussion on defining/improving the agent-runtime protocol of locating (block) devices. Let's bring this topic to the broader users in our community. Please feel free to add more people here. WDYT? @kata-containers/architecture-committee @amshinde @egernst @fidencio @sboeuf @rbradford @kata-containers/agent |
I'd keep it simple. Since |
@sboeuf see.. I don't think that is keeping it simple. By allowing for both options, you're increasing complexity of the protocol. And increasing complexity of the protocol is a bigger deal than increasing complexity of individual implementations (e.g. having the runtime convert one or more BDFs from the VMM into My view on that is coloured by the fact that I don't think "VMMs with the ability to determine the full BDF" is really a thing. There might be VMMs that think they can predict the BDF, but there's really no way it can be robust and reliable: even if it works now for the simple cases they support, it could well break in future. |
I'm definitely okay with that. If this gives slightly more work to the runtime, so that the agent's protocol can stay simple, it sounds like a good compromise. The part where I'm skeptical this would work is the way the agent is going to find the device in the guest based on the
Fair enough. |
Sorry, I should clarify. I think we should stick to just the "PCI path" form, of which Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol. |
What do you mean by PCI path?
When hotplugging through ACPI, no bridge is needed. That's what we use with CH, it's the simplest way of performing PCI hotplug. |
I mean listing the slot+function of every bridge leading to the device, then the device itself.
Ok. Does that allow PCI-E devices or only PCI? IIUC plugging a PCI-E device without a root port would mean treating it as an integrated endpoint, which would be... odd.. although it might work anyway. |
I like the idea of having a somewhat general PCI path. However, I have a concern that this might be trying to generalize something that is, at its core, hypervisor dependent. In other words, what would a kernel running on qemu do with a path that does not have the bridge? So it looks to me like the agent should have some hypervisor-dependent callback that knows how to deal with the guest device format generated by that hypervisor:
So what about making the runtime prefix the address with a hypervisor name, and let the agent invoke different lookup code that may take different input? In other words, replace In my opinion, this also leaves the door open for future changes in either the qmp output format or PCI layout in the VM. So in the future, if qemu 6 changes something major, we could have different callbacks for |
No, it's really not. The PCI path as I'm describing it absolutely has a well-defined meaning purely in terms of PCI defined entities, regardless of hypervisor or platform. That's exactly why I'm suggesting it. Well... to deal with the possibility of multiple PCI roots then we do need a platform dependent (as distinct from hypervisor dependent) extension to express which PCI root. But, one problem at a time.
Grab the device with that slot on the root bus. With qemu, I don't think you can combine that with PCI-E, but that form is entirely plausible for either cold-plugged devices, or devices on the
Actually most instances of this format don't query qemu, instead we allocate the addresses and told qemu what to use. My VFIO code is an exception.
Prefixing the string with a "scheme" identifier of some sort isn't necessarily a bad idea (particularly to deal with multiple domains in future). But tying that to hypervisor is broken (almost by definition - the address needs to be meaningful to the guest, which doesn't need to even be aware of what hypervisor it's running under).
There is nothing specific to the qmp output format here. |
@dgibson Sorry for the delayed response. Thanks again for sharing your insights. Given your much better understanding and experience in the area than myself, would you mind drafting a PR based on the proposal you mentioned above (to extend current Meanwhile, I can follow-up on verifying whether the |
@likebreath sorry for the delay. I'm happy to tackle this PCI path cleanup, it's just a question of when. I had been planning to leave this until Kata 2, but it sounds like you might have need of it before then. In fact, I've now encountered a very similar problem for myself. My #850 PR is blocked because there's a CI failure on VFIO. AFAICT at this stage, that failure is because it's doing a VFIO plug on cloud-hypervisor, and my code only adds PCI path generation for VFIO devices on qemu. It would be great if I can use your clh expertise to help fix this up. |
@likebreath I've created #854 to track this work (the agent side, anyway) |
@likebreath now that #855 is merged, should we close this issue? |
The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. We should extend extend the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor.
The text was updated successfully, but these errors were encountered: