Skip to content

Commit

Permalink
Merge pull request #480 from sauronalexander/main
Browse files Browse the repository at this point in the history
Eureka module deployment for eureka
  • Loading branch information
srinivasreddych authored Jun 7, 2024
2 parents 2638869 + 5b22caf commit 87d0f56
Show file tree
Hide file tree
Showing 34 changed files with 1,276 additions and 19 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### **Changed**

- Added `modules/examples/eureka` examples
- fix: module `modules/visualization/dcv-image` to update cdk version and cdk_ecr_deployment version
- fix: module `modules/visualization/dcv-eks` to update cdk version
- fixed the `fsx-lustre-on-eks` integration module's metadata export
- remedation to pass end-to-end integration testing of ADDF manifests
- fixed the `fsx-lustre-on-eks` integration module's static provisioning failure
Expand Down
18 changes: 18 additions & 0 deletions manifests/robotic-training-on-eks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
### Description
This deployment deploys the modules for blog post: How to expansively train Robot Learning on AWS leveraging rewards functions generated by LLM

In summary, it deploys the following components:
- Networking
- Creates a new VPC and public/private subnets to host EKS cluster and FSx
- Bucket
- A data bucket used to store input/output
- EKS cluster
- The core component which is used to schedule and deploy training/simulation workloads
- FSx
- The external hard drive for training. They will be mounted to training containers. The data will be synced to S3.
- ECR
- This deployment will deploy two ECRs. One to store robotic training image and one to store DCV (high performance remote desktop streaming tool).
- DCV components
- This includes building an DCV image and K8S resources which will stream simulation/training applications running in EKS to local dev environment.
- Eureka
- This is the core component to train robotic simulations, which sets up the correct permission to control job running and talk to LLMs.
2 changes: 2 additions & 0 deletions manifests/robotic-training-on-eks/buckets.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
name: eureka-data-bucket
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/buckets?ref=release/1.6.0
109 changes: 109 additions & 0 deletions manifests/robotic-training-on-eks/core-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
name: eks

path: git::https://github.com/awslabs/idf-modules.git//modules/compute/eks?ref=release/1.6.0&depth=1
dataFiles:
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.29.yaml?ref=release/1.6.0
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.6.0
parameters:
- name: vpc-id
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: VpcId
- name: controlplane-subnet-ids
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: PrivateSubnetIds
- name: dataplane-subnet-ids
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: PrivateSubnetIds
- name: eks-admin-role-name
value: Admin
- name: eks-poweruser-role-name
value: PowerUser
- name: eks-read-only-role-name
value: ReadOnly
- name: eks-version
value: "1.29"
- name: eks-compute
value:
eks_nodegroup_config:
- eks_ng_name: ng-gpu
eks_node_quantity: 2
eks_node_max_quantity: 4
eks_node_min_quantity: 2
eks_node_disk_size: 100
eks_node_instance_type: "g5.2xlarge"
use_gpu_ami: True
eks_node_labels:
usage: gpu
eks_node_spot: False
eks_secrets_envelope_encryption: False
eks_api_endpoint_private: False
- name: eks-addons
value:
deploy_aws_lb_controller: True # We deploy it unless set to False
deploy_external_dns: False # We deploy it unless set to False
deploy_aws_ebs_csi: False # We deploy it unless set to False
deploy_aws_efs_csi: False # We deploy it unless set to False
deploy_aws_fsx_csi: True # We deploy it unless set to False
deploy_cluster_autoscaler: False # We deploy it unless set to False
deploy_metrics_server: True # We deploy it unless set to False
deploy_secretsmanager_csi: False # We deploy it unless set to False
deploy_external_secrets: False
deploy_cloudwatch_container_insights_metrics: True # We deploy it unless set to False
deploy_cloudwatch_container_insights_logs: True
cloudwatch_container_insights_logs_retention_days: 7
deploy_adot: False
deploy_amp: False
deploy_grafana_for_amp: False
deploy_kured: False
deploy_calico: False
deploy_nginx_controller:
value: False
nginx_additional_annotations:
nginx.ingress.kubernetes.io/whitelist-source-range: "100.64.0.0/10,10.0.0.0/8"
deploy_kyverno:
value: False
kyverno_policies:
validate:
- block-ephemeral-containers
- block-stale-images
- block-updates-deletes
- check-deprecated-apis
- disallow-cri-sock-mount
- disallow-custom-snippets
- disallow-empty-ingress-host
- disallow-helm-tiller
- disallow-latest-tag
- disallow-localhost-services
- disallow-secrets-from-env-vars
- ensure-probes-different
- ingress-host-match-tls
- limit-hostpath-vols
- prevent-naked-pods
- require-drop-cap-net-raw
- require-emptydir-requests-limits
- require-labels
- require-pod-requests-limits
- require-probes
- restrict-annotations
- restrict-automount-sa-token
- restrict-binding-clusteradmin
- restrict-clusterrole-nodesproxy
- restrict-escalation-verbs-roles
- restrict-ingress-classes
- restrict-ingress-defaultbackend
- restrict-node-selection
- restrict-path
- restrict-service-external-ips
- restrict-wildcard-resources
- restrict-wildcard-verbs
- unique-ingress-host-and-path

49 changes: 49 additions & 0 deletions manifests/robotic-training-on-eks/dcv-eks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: dcv-eks
path: modules/visualization/dcv-eks
parameters:
- name: dcv-namespace
value: dcv
- name: dcv-nodeport
value: 31980
- name: dcv-image-uri
valueFrom:
moduleMetadata:
group: dcv-image
name: dcv-image
key: DCVImageUri
- name: eks-cluster-admin-role-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterAdminRoleArn
- name: eks-cluster-name
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: eks-oidc-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: eks-cluster-open-id-connect-issuer
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterOpenIdConnectIssuer
- name: eks-cluster-security-group-id
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterSecurityGroupId
- name: eks-node-role-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksNodeRoleArn
9 changes: 9 additions & 0 deletions manifests/robotic-training-on-eks/dcv-image.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: dcv-image
path: modules/visualization/dcv-image
parameters:
- name: dcv-ecr-repository-name
valueFrom:
moduleMetadata:
group: storage
name: dcv
key: EcrRepositoryName
28 changes: 28 additions & 0 deletions manifests/robotic-training-on-eks/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: robotic-training-on-eks
toolchainRegion: us-east-1
groups:
- name: optionals
path: manifests/robotic-training-on-eks/optionals.yaml
- name: buckets
path: manifests/robotic-training-on-eks/buckets.yaml
- name: core
path: manifests/robotic-training-on-eks/core-modules.yaml
- name: storage
path: manifests/robotic-training-on-eks/storage.yaml
- name: dcv-image
path: manifests/robotic-training-on-eks/dcv-image.yaml
- name: dcv-eks
path: manifests/robotic-training-on-eks/dcv-eks.yaml
- name: eureka
path: manifests/robotic-training-on-eks/eureka.yaml
targetAccountMappings:
- alias: primary
accountId:
valueFrom:
envVariable: PRIMARY_ACCOUNT
default: true
parametersGlobal:
dockerCredentialsSecret: aws-addf-docker-credentials
regionMappings:
- region: us-east-1
default: true
53 changes: 53 additions & 0 deletions manifests/robotic-training-on-eks/eureka.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: eureka
path: modules/simulations/eureka
parameters:
- name: eks-cluster-admin-role-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterAdminRoleArn
- name: eks-cluster-name
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: eks-oidc-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: eks-cluster-open-id-connect-issuer
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterOpenIdConnectIssuer
- name: application-ecr-name
valueFrom:
moduleMetadata:
group: storage
name: robotic-applications
key: EcrRepositoryName
- name: sqs-name
value: "training-queue"
- name: fsx-volume-handle
valueFrom:
moduleMetadata:
group: storage
name: fsx
key: FSxLustreFileSystemId
- name: fsx-mount-point
valueFrom:
moduleMetadata:
group: storage
name: fsx
key: FSxLustreMountName
- name: data-bucket-name
valueFrom:
moduleMetadata:
group: buckets
name: eureka-data-bucket
key: ArtifactsBucketName
5 changes: 5 additions & 0 deletions manifests/robotic-training-on-eks/optionals.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
name: networking
path: git::https://github.com/awslabs/idf-modules.git//modules/network/basic-cdk
parameters:
- name: internet-accessible
value: true
43 changes: 43 additions & 0 deletions manifests/robotic-training-on-eks/storage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: fsx
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.6.0&depth=1
parameters:
- name: vpc_id
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: VpcId
- name: private_subnet_ids
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: PublicSubnetIds
- name: fs_deployment_type
value: PERSISTENT_2
- name: data_bucket_name
valueFrom:
moduleMetadata:
group: buckets
name: eureka-data-bucket
key: ArtifactsBucketName
- name: import_path
valueFrom:
moduleMetadata:
group: buckets
name: eureka-data-bucket
key: ArtifactsBucketName
- name: storage_throughput
value: 500
---
name: robotic-applications
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/ecr?ref=release/1.6.0
parameter:
- name: image-tag-mutability
value: "MUTABLE"
---
name: dcv
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/ecr?ref=release/1.6.0
parameter:
- name: image-tag-mutability
value: "MUTABLE"
46 changes: 46 additions & 0 deletions modules/simulations/eureka/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@

# examples/eureka


## Description
This module setups the environment for running robotic training and simulation.

- It creates a FSx static provisioning k8s resource
- FSx is used as high performance data storage for storing training inputs/outputs
- It creates an IAM role for simulation
- It allows pods to assume this role to get data from s3, FSx. Playing around with LLMs in Amazon Bedrock.
- It builds an application image
- This will be a ROS2 image which contains necessary environment for training.
- It creates a Amazon SQS message queue
- The queue is used control tasks sent by controller. The task controller will send tasks configs to the message queue and workers will get data from message queue.


## Inputs/Outputs

### Input Paramenters

#### Required
- `eks-cluster-admin-role-arn` - the role which creates the eks cluster
- `eks-cluster-name` - the name of the EKS cluster
- `eks-oidc-arn` - full ARN of the OIDC provider
- `eks-cluster-open-id-connect-issuer` - OIDC provider URI
- `application-ecr-name`: the name of the ecr which will store images containing simulation/training logics
- `sqs-name`: the name of the sqs we are creating
- `fsx-volume-handle`: file system id from the fsx created by dependency module
- `fsx-mount-point`: mount point of the fsx created by dependency module
- `data-bucket-name`: the name of the bucket which stores all simulation/trianing data

### Module Metadata Outputs

- `IamRoleArn`: IAM Role Arn which contains necessary permissions for EKS pods to assume and run simulation/training
- `ApplicationImageUri`: The application image which contains simulation/training logics and will be running in EKS
- `SqsUrl`: The url of the sqs which where task controllers will enqueue and workers will dequeue

#### Output Example

```json
{
"IamRoleArn": "arn:aws:iam::123456789012:role/addf-eureka-simulation-role",
"ApplicationImageUri": "123456789012.dkr.ecr.us-west-2.amazonaws.com/robotic-applications:ubuntu-ros2",
"SqsUrl": "https://sqs.us-west-2.amazonaws.com/123456789012/MyQueue"
}
Loading

0 comments on commit 87d0f56

Please sign in to comment.