-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating requiredResources in Application Management API #280
Updating requiredResources in Application Management API #280
Conversation
Signed-off-by: Felipe Vicens <[email protected]>
Flavor: | ||
type: string | ||
description: | | ||
Preset configuration for compute, memory, GPU, | ||
and storage capacity. (i.e - A1.2C4M.GPU8G, A1.2C4M.GPU16G, A1.4C8M,..) | ||
example: A1.2C2M.GPU8G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a GET API to get a list of flavors so that user knows what the possible flavor names to use are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tends to agree :-) May be with GET /edge-cloud-zones we can add query parameters to retrieve list of flavors and then in future we can also extend other resources via query parameters. Just a suggestion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I'll add an entry in GET /edge-cloud-zones to report the flavors.
dockerCompose: | ||
type: boolean | ||
example: true | ||
default: false | ||
description: | | ||
Enable docker-compose in the virtual machine to deploy applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I think it would be better to have docker-compose as a package type, rather than a VM addon. A VM addon to a VM-based deployment means the user will have full access to the VM, and may make changes to the VM that conflict with the state expected by the system which is trying to manage the docker deployment (i.e. worst case the user manually uninstalls docker, and then the system will fail trying to install/uninstall/upgrade docker-compose files.
What I would recommend is adding DOCKER_COMPOSE_ZIP as a type to AppManifest.PackageType. So the user uploads a zip file of all their docker compose files, much like a helm chart. The Operator Platform would deploy a specific VM image and manage it, and the user would not have full access directly to the VM (much like users would not have full access directly to a kubernetes cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll remove the VM addon and create a docker-compose type.
description: Number of GPUs | ||
items: | ||
$ref: '#/components/schemas/GpuInfo' | ||
type: array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you really want an array here, right? That would imply the resources could include multiple kubernetes cluster plus multiple VirtualMachines plus multiple Containers. I think you really only want one of either a Kubernetes cluster or a VirtualMachine or a Container resources request.
type: array | ||
items: | ||
oneOf: | ||
- $ref: "#/components/schemas/Kubernetes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note, EdgeXR allows the user to pre-create a (kubernetes) cluster and then specify the cluster during AppInstance create, in addition to specifying the cluster-resources to create one on-the-fly (as per this spec). This allows users to, over time, manage multiple AppInstances in the same cluster. That could be supported here by additionally adding a ClusterRef as one of the oneOf
objects. I don't know if you have considered whether we should allow users to manage clusters directly. The main advantage of this is you get per-tenant cluster isolation without having to pay the overhead of cluster create for every AppInstance, assuming the user wants to share multiple AppInstances in the same cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to manage applications as self-contained units, including the resources they need. This way, the application itself collects all the resources required to work properly. If a second application needs to be deployed on the same cluster, it might indicate that the resource requirements for the first application were overestimated. Here are two options for the developer:
a) Modify the Helm chart to add the application there (even modify the resources to fit - this would require an API for application Update) or
b) Deploy a new application.
- $ref: "#/components/schemas/VirtualMachine" | ||
- $ref: "#/components/schemas/Container" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the schema names misleading. For example VirtualMachine
typically implies an instance of a specific image, and Container
also implies at least a specific image, none of which is the case here. I would recommend changing the names to KubernetesResources
, VMResources
, and ContainerResources
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!, I'll modify it.
enum: | ||
- 1 | ||
- 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why an enum is defined for the controlNodes integer, shouldn't it allow any integer greater than 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I realize now this is for master nodes, so ignore my previous comment. Docs recommend up to 5 nodes for large clusters, I don't think we'll be dealing with large clusters here but perhaps for completeness add an enum value for 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey, the idea here was to offer Control Plane w/HA and wo/HA but at the end it is controlled by the operator. I think better approach is a boolean controlPlaneHa: true/false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this should be part of the API. Application provider should have SLA in place with the operators and they shouldn't care about how the control plane of the operator infra is implemented.
and storage capacity. (i.e - A1.2C4M.GPU8G, A1.2C4M.GPU16G, A1.4C8M,..) | ||
example: A1.2C2M.GPU8G | ||
|
||
NodePools: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be NodePool
, not NodePools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! changed
type: string | ||
description: Minimum Kubernetes Version. | ||
example: "1.29" | ||
controlNodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there should be a controlNodesFlavor
to indicate which flavor to use for control nodes? Or is it expected that the platform shall choose an appropriate size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of control nodes is out of the scope for the Application Developer.
example: A1.2C2M.GPU8G | ||
|
||
NodePools: | ||
description: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general an issue i see with this approach is that it offers too many choices to the Application Developer to ask for the compute it needs. There can be too many possibilities that the developers can provide in the API and platform needs to find out from where it can serve too many diverse combination of resources or clusters. Also, as a developer I may need to run multiple applications on same cluster so how can I express it here?
So may be compute resource creation could have a different API that could return an identifier or handle for that resource and which could be used in this proposal to indicate the resource where app can be deployed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach is to adopt a one-application-to-one-infrastructure-resource approach (VM, Kubernetes cluster, container, Docker Compose). This means we avoid managing infrastructure independently of the application.
For running multiple applications on the same Kubernetes cluster, Helm packages provide a way to bundle them together. A Helm package can contain multiple application charts, such as a database and a web application chart, effectively treating them as a single application for deployment.
This approach aligns well with node pools. Developers can leverage node pools to create clusters with a mix of nodes, such as having one with a GPU and others without, optimizing resource allocation. The application to node pool mapping is done through labels, allowing developers to reference them in Helm chart values for node affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be very resource heavy approach by one app to one type of infra like a k8s cluster unless we provide a way to enable in some way deploy multiple applications on one cluster. Also what will happen if cluster creation fails? That means application onboarding failed as both are now one atomic package. And another issue could be once an app along with its given infra accepted I cannot change the infra e.g. reduce or increase the resources if needed.
So I still think specially with cluster type of infra that it will be hard to implement which could mean creating a cluster dynamically which could be a very time consuming process. If we delink infra creation then there could be options like platform offline creates cluster and provide API to retrieve details of cluster ID or even provide infra creation API to manage infra for applications and use the information with the App LCM API to link them together.
Means there could be ways but otherwise in terms of approach it seems to be tightly couple the infra and applications and may reduce reusability. May be more inputs will help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gunjald,
Sounds good. I think it would be interesting to discuss the creation of an API to manage the infrastructure lifecycle (Create, Update, Delete). Enabling the Kubernetes cluster reference within the Application Management API would be easy.
For now, I think it's safe to keep things this way, allowing developers to use a Kubernetes cluster and define the minimum configuration details required by their application. We can then open a discussion about how to design a more comprehensive API for infrastructure management resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach is to adopt a one-application-to-one-infrastructure-resource approach (VM, Kubernetes cluster, container, Docker Compose). This means we avoid managing infrastructure independently of the application.
I think that this should be discussed further as it changes how some of use see the problem we are trying to solve.
While it makes sense for VM and containers, I'm not sure for k8s clusters. It's my understanding that operators want to use the same infra for multiple app providers/app types. In this case, packaging multiple apps in the same Helm Chart, as suggested above, cannot be done.
Signed-off-by: Felipe Vicens <[email protected]>
Signed-off-by: Felipe Vicens <[email protected]>
EdgeCloudZoneFlavors: | ||
description: List of unique Name IDs of Infrastructure Flavors | ||
type: array | ||
items: | ||
type: string | ||
description: Flavor ID | ||
example: A1.2C2M.GPU8G | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to use flavors, I would suggest that each flavor have the list and spec of all resources they provide.
My take is that I don't think we can rely on the flavor id to be sufficient, or even relevant, for some infra provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Agree Nicola, good point. I'll convert the string to object with the spec information. Thanks!
Signed-off-by: Felipe Vicens <[email protected]>
Signed-off-by: Felipe Vicens <[email protected]>
@@ -897,13 +897,19 @@ components: | |||
and the value of the Edge Cloud Provider | |||
object. This value is used to identify an Edge Cloud zone | |||
between Edge Clouds from different Edge Cloud Providers. | |||
required: | |||
- edgeCloudZoneId | |||
- edgeCloudZoneName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is uniqueness defined here? Is edgeCloudZoneId or edgeCloudZoneName are unique or their combination is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleEdgeDiscovery has:
edgeCloudZoneId
is a UUID for the Edge Cloud Zone.edgeCloudZoneName
is the common name of the closest Edge Cloud Zone to
the user device.edgeCloudProvider
is the name of the operator or cloud provider of
the Edge Cloud Zone.
So, edgeCloudZoneId
is expected to be unique, e.g. a namespaced URN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to point out that if edgeCloudZoneName (or edgeCloudZoneName + edgeCloudProvider) cannot uniquely identify the zone, i.e. only the UUID can uniquely identify the zone, then we won't be able to support a declarative API. That's probably ok if the API is mainly going to be accessed via a GUI by a human, but if we want to support automation and infra-as-code via yaml files, it would be much nicer to be able to have a declarative API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the edgeCloudZoneId be the only required parameter for the edgeCloudZone?
@@ -925,6 +931,38 @@ components: | |||
- unknown | |||
default: unknown | |||
|
|||
EdgeCloudZoneFlavors: | |||
description: List of unique Name IDs of Infrastructure Flavors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should a flavor be visualized from a developer point of view. Is a flavor represent a virtual machine (VM) or a server node with given set of resources mapped to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If say ir represent a single node then gpuMemory attribute alone may not be sufficient to allocate such a resource. There may be attributes like gpuCount, gpuFamily etc are also to be considered to meet the workload requirements which need GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gunjald, given that flavors introduce complexity to the API, I'll implement a change that removes them. This will provide greater flexibility for operators to allocate workloads of any size.
@@ -925,6 +931,38 @@ components: | |||
- unknown | |||
default: unknown | |||
|
|||
EdgeCloudZoneFlavors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another parameter "Flavor" below in yaml. How does that correlates with EdgeCloudZoneFlavors?
Nodepool Name (Autogenerated if not provided in the request) | ||
flavor: | ||
$ref: '#/components/schemas/Flavor' | ||
numNodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt it be something like numFlavors for better correlation?
type: string | ||
example: kubernetes | ||
enum: | ||
- kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The infrakind is part of the top level attribute KubernetesResources and looks redundant with value as "kubernetes" as KubernetesResources itself indicate that it is kubernetes resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how discriminators work in OpenAPI: https://swagger.io/docs/specification/v3_0/data-models/inheritance-and-polymorphism/
required: | ||
- nodePools | ||
- infraKind | ||
- applicationResources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view we should keep applicationResources separate from KubernetesResources. KubernetesResources should point to k8s related information while applicationResources may be a consumer of KubernetesResources. But KubernetesResources should not contain applicationResources and create a tight coupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire KubernetesResources object is intended to describe the Application's required resources for a Kubernetes deployment. So everything in it is in fact application resources. But having another "ApplicationResources" field inside KubernetesResources I think causes confusion. I would suggest simply flattening out that field, and including CpuPool and GpuPool directly into the KubernetesResources object. Or perhaps renaming ApplicationResources to NodePools.
required: | ||
- infraKind | ||
- kubernetesClusterRef | ||
- applicationResources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As my previous comment i suggest to delink applicationResources from KubernetesResourcesRef. We may keep KubernetesResourcesRef to associate kubernetes entities only and let other abstractions which needs it can refer to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit to this PullRequest introduces modifications to cover this and previous comments. The concept of flavours has been removed so that the cpu/memory required for an application is directly defined by the API consumer (the application porvider).
For ths case of kubernetes, the case where an application is expected to be deployed over a k8s cluster, the concept of resource topology is introduced in order to allow the API consumer (the application prvider) describing what is the minimum k8s infrastructure required to host the application. The topology describes the minimum number of nodes required and the minimum cpu/memory required per node.
Also, in order to cope with applications with different components that need different hardware (e.g.: GPU required in some component) the concept of hardware pool has been introduced to allow th application provider to define the specific hardware configuration for each application component. With this information, the API server (i.e.: an edge orchestrator) will be able to determine the requirements of the k8s cluster and nodepools needed, to host the application components. In this proposal we start with 2 types of pools (CPU and GPU) that can be later extended to additional specific hardware (e.g.: future TPU or other specific processors to consider in the future).
Finally, to cope with scenarios of shared k8s clusters, the parameters isStandalone and kubernetesClusterRef have been added.
Signed-off-by: Felipe Vicens <[email protected]>
… instance Signed-off-by: Felipe Vicens <[email protected]>
edgeCloudKubernetesClusterRef: | ||
$ref: '#/components/schemas/KubernetesClusterRef' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just noticed this, but I don't think the kubernetes cluster ref belongs here. It's not really a property of an EdgeCloudZone. I think we need a separate object to be able to specify the optional cluster ref as part of the create appinstance API that can be arrayed. Also discussion #256 is relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gainsley, I've added for the GET /apps/{appId}/instances to retrieve the cluster where an application is deployed. I've noticed that it also impacts on the POST where it is only specified the edgeCloudZone. What do you think if we put the k8s cluster ref outsie from the edgeCloudZone but inside of appInstanceInfo?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense to have the clusterref in appInstanceInfo. Unfortunately though, that doesn't help the post /app/{appId}/instances
case since that doesn't use appInstanceInfo
. There I think you'll need to create a new wrapper object that is { clusterRef (optional), EdgeCloudZoneId } that can then be arrayed.
Signed-off-by: Felipe Vicens <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing that, looks good, just need to fix the typo.
@@ -355,7 +355,7 @@ paths: | |||
content: | |||
application/json: | |||
schema: | |||
$ref: '#/components/schemas/AppZones' | |||
$ref: '#/components/schemas/AppInstaces' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: AppInstaces -> AppInstances
edgeCloudZone: | ||
$ref: '#/components/schemas/EdgeCloudZone' | ||
|
||
AppZones: | ||
AppInstaces: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: AppInstaces -> AppInstances
Not for this PR, but we may want to consider adding a disk space requirement to the Kubernetes Topology and VmResources/ContainerResources/DockerComposeResources objects. I see they have |
Also for a follow-up PR, for GPU-specific resources, it does make sense that users would want to be able to reserve GPU VRAM for their LLM/etc usage, but Kubernetes really only allocates gpu resources based on number of gpus. A better understanding of what Kubernetes provides and what the Nvidia operator provides would inform how users would actually be able to specify GPU resources. I think we probably need a num GPUs field, and maybe a GPU type/family as well, but it needs some exploration. |
Signed-off-by: Felipe Vicens <[email protected]>
Hey @gainsley, When setting the required GPU RAM, I was thinking about using NVIDIA's MIG strategy (Multi-Instance GPU). (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#about-multi-instance-gpu | https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#a100-mig-profiles) MIG allows for hard partitioning of large GPUs, enabling workloads to be allocated to specific portions. This way, the operator can share and allocate the resources of a single big GPU across multiple workloads. I agree, we need to explore this further. We'll likely need to add an array of GPUs, such as: [ {"memory": "4G", "family": "Hopper"}, {"memory": "24G", "family": "Ampere"}] |
Signed-off-by: Felipe Vicens <[email protected]>
Thanks for fixing the typos. Yes let's discuss more how to specify gpus in a separate issue. |
edgeCloudZone: | ||
$ref: '#/components/schemas/EdgeCloudZone' | ||
|
||
AppZones: | ||
AppInstances: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "AppInstances" seems to be conveying the intent that it refers to information about the instances of the application while the object itself and description says it is collection of edge cloud zone where the applications are to be deployed. It seems there is some anomaly here that may be corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think it may make more sense to call it "AppZones" as it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, we've updated the object name to AppInstanceLocations to better reflect the information it contains.
@@ -683,16 +683,25 @@ components: | |||
accessPoints: | |||
$ref: '#/components/schemas/AccessEndpoint' | |||
minItems: 1 | |||
kubernetesClusterRef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear how developer will get the KubernetesClusterRef? Are there suppose to be any query API to find it out or how do we see retrieving this information by the API user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two ways to get a ClusterRef:
- From the result of a previous Create AppInstance. Creating an AppInstance will use (or create) an underlying cluster for Kubernetes/Helm apps. After the instance is created, querying the AppInstanceInfo will give the clusterRef of the cluster the instance was deployed to. The use case here is to allow the user to deploy multiple AppInstances to the same cluster (for example database, tracing, logging, workload, frontend etc sets of instances that can communicate with each other in the same cluster).
- From a cluster management API. The system may provide a cluster management API outside of the NBI, or this EAM CAMARA NBI API may be extended in the future to provide a cluster management API. The clusterRefs can then come from a "GET clusters" API.
Signed-off-by: Felipe Vicens <[email protected]>
Signed-off-by: Felipe Vicens <[email protected]>
What type of PR is this?
What this PR does / why we need it:
Enables the definition of different infrastructure for the application,
allowing selection from among Kubernetes, Virtual Machines, and Containers.
Additionally, the developer can specify the sizing of Kubernetes cluster for the
application, the characteristics of compute resources, and the base add-ons
to enabling monitoring and ingress in Kubernetes clustes.
Moreover, this PR changes
requiredResources
to a set of infrastructureresorces, enabling developers to specify more than one resource for the
application. For example, an application composed of two containers
and one virtual machine.
Which issue(s) this PR fixes:
Fixes #253
Discussion #220
Special notes for reviewers:
Changelog input