Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: NVIDIA NIM on EKS Pattern #565

Merged
merged 17 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ai-ml/nvidia-triton-server/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
nim-llm/
planfile
10 changes: 10 additions & 0 deletions ai-ml/nvidia-triton-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 3.72 |
| <a name="provider_aws.ecr"></a> [aws.ecr](#provider\_aws.ecr) | >= 3.72 |
| <a name="provider_helm"></a> [helm](#provider\_helm) | >= 2.4.1 |
| <a name="provider_kubernetes"></a> [kubernetes](#provider\_kubernetes) | >= 2.10 |
| <a name="provider_null"></a> [null](#provider\_null) | >= 3.1 |
| <a name="provider_random"></a> [random](#provider\_random) | >= 3.1 |
Expand All @@ -29,6 +30,7 @@
|------|--------|---------|
| <a name="module_data_addons"></a> [data\_addons](#module\_data\_addons) | aws-ia/eks-data-addons/aws | ~> 1.32.0 |
| <a name="module_ebs_csi_driver_irsa"></a> [ebs\_csi\_driver\_irsa](#module\_ebs\_csi\_driver\_irsa) | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks | ~> 5.20 |
| <a name="module_efs"></a> [efs](#module\_efs) | terraform-aws-modules/efs/aws | ~> 1.6 |
| <a name="module_eks"></a> [eks](#module\_eks) | terraform-aws-modules/eks/aws | ~> 19.15 |
| <a name="module_eks_blueprints_addons"></a> [eks\_blueprints\_addons](#module\_eks\_blueprints\_addons) | aws-ia/eks-blueprints-addons/aws | ~> 1.2 |
| <a name="module_s3_bucket"></a> [s3\_bucket](#module\_s3\_bucket) | terraform-aws-modules/s3-bucket/aws | 4.1.2 |
Expand All @@ -43,12 +45,17 @@
| [aws_iam_policy.triton](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_secretsmanager_secret.grafana](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/secretsmanager_secret) | resource |
| [aws_secretsmanager_secret_version.grafana](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/secretsmanager_secret_version) | resource |
| [helm_release.nim_llm](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource |
| [kubernetes_annotations.disable_gp2](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/annotations) | resource |
| [kubernetes_namespace.nim](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/namespace) | resource |
| [kubernetes_namespace_v1.triton](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/namespace_v1) | resource |
| [kubernetes_persistent_volume_claim_v1.efs_pvc](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume_claim_v1) | resource |
| [kubernetes_secret_v1.huggingface_token](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/secret_v1) | resource |
| [kubernetes_secret_v1.triton](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/secret_v1) | resource |
| [kubernetes_service_account_v1.triton](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/service_account_v1) | resource |
| [kubernetes_storage_class.default_gp3](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/storage_class) | resource |
| [kubernetes_storage_class_v1.efs](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/storage_class_v1) | resource |
| [null_resource.download_nim_deploy](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource |
| [null_resource.sync_local_to_s3](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource |
| [random_password.grafana](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/password) | resource |
| [aws_availability_zones.available](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/availability_zones) | data source |
Expand All @@ -65,8 +72,11 @@
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.30"` | no |
| <a name="input_enable_nvidia_nim"></a> [enable\_nvidia\_nim](#input\_enable\_nvidia\_nim) | Toggle to enable or disable NVIDIA NIM pattern resource creation | `bool` | `false` | no |
| <a name="input_enable_nvidia_triton_server"></a> [enable\_nvidia\_triton\_server](#input\_enable\_nvidia\_triton\_server) | Toggle to enable or disable NVIDIA Triton server resource creation | `bool` | `true` | no |
| <a name="input_huggingface_token"></a> [huggingface\_token](#input\_huggingface\_token) | Hugging Face Secret Token | `string` | `"DUMMY_TOKEN_REPLACE_ME"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"nvidia-triton-server"` | no |
| <a name="input_ngc_api_key"></a> [ngc\_api\_key](#input\_ngc\_api\_key) | NGC API Key | `string` | `"DUMMY_NGC_API_KEY_REPLACE_ME"` | no |
| <a name="input_region"></a> [region](#input\_region) | region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |
Expand Down
24 changes: 23 additions & 1 deletion ai-ml/nvidia-triton-server/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,11 @@ module "eks_blueprints_addons" {
vpc-cni = {}
}

#---------------------------------------
# AWS EFS CSI Add-on
#---------------------------------------
enable_aws_efs_csi_driver = true

#---------------------------------------
# AWS Load Balancer Controller Add-on
#---------------------------------------
Expand Down Expand Up @@ -128,7 +133,8 @@ module "eks_blueprints_addons" {
kube_prometheus_stack = {
values = [
templatefile("${path.module}/helm-values/kube-prometheus.yaml", {
storage_class_type = kubernetes_storage_class.default_gp3.id
storage_class_type = kubernetes_storage_class.default_gp3.id
nim_llm_dashbaord_json = indent(10, file("${path.module}/monitoring/nim-llm-dashboard.json"))
})
]
chart_version = "48.1.1"
Expand All @@ -140,6 +146,22 @@ module "eks_blueprints_addons" {
],
}

helm_releases = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

"prometheus-adapter" = {
repository = "https://prometheus-community.github.io/helm-charts"
chart = "prometheus-adapter"
namespace = module.eks_blueprints_addons.kube_prometheus_stack.namespace
version = "4.10.0"
values = [
templatefile(
"${path.module}/helm-values/prometheus-adapter.yaml", {
prometheus_namespace = module.eks_blueprints_addons.kube_prometheus_stack.namespace
}
)
]
}
}

}

#---------------------------------------------------------------
Expand Down
23 changes: 17 additions & 6 deletions ai-ml/nvidia-triton-server/cleanup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

read -p "Enter the region: " region
export AWS_DEFAULT_REGION=$region
export STACK_NAME="nvidia-triton-server"

echo "Destroying RayService..."
echo "Destroying LoadBalancer type service from Nginx ingress controller..."

# Delete the Ingress/SVC before removing the addons
TMPFILE=$(mktemp)
Expand All @@ -12,7 +13,7 @@ terraform output -raw configure_kubectl > "$TMPFILE"
if [[ ! $(cat $TMPFILE) == *"No outputs found"* ]]; then
echo "No outputs found, skipping kubectl delete"
source "$TMPFILE"
kubectl delete -f src/service/ray-service.yaml
kubectl delete svc -n ingress-nginx ingress-nginx-controller
fi


Expand All @@ -21,7 +22,6 @@ targets=(
"module.data_addons"
"module.eks_blueprints_addons"
"module.eks"
"module.vpc"
)

# Destroy modules in sequence
Expand All @@ -41,7 +41,7 @@ echo "Destroying Load Balancers..."

for arn in $(aws resourcegroupstaggingapi get-resources \
--resource-type-filters elasticloadbalancing:loadbalancer \
--tag-filters "Key=elbv2.k8s.aws/cluster,Values=jark-stack" \
--tag-filters "Key=elbv2.k8s.aws/cluster,Values=${STACK_NAME}" \
--query 'ResourceTagMappingList[].ResourceARN' \
--output text); do \
aws elbv2 delete-load-balancer --load-balancer-arn "$arn"; \
Expand All @@ -50,19 +50,30 @@ for arn in $(aws resourcegroupstaggingapi get-resources \
echo "Destroying Target Groups..."
for arn in $(aws resourcegroupstaggingapi get-resources \
--resource-type-filters elasticloadbalancing:targetgroup \
--tag-filters "Key=elbv2.k8s.aws/cluster,Values=jark-stack" \
--tag-filters "Key=elbv2.k8s.aws/cluster,Values=${STACK_NAME}" \
--query 'ResourceTagMappingList[].ResourceARN' \
--output text); do \
aws elbv2 delete-target-group --target-group-arn "$arn"; \
done

echo "Destroying Security Groups..."
for sg in $(aws ec2 describe-security-groups \
--filters "Name=tag:elbv2.k8s.aws/cluster,Values=jark-stack" \
--filters "Name=tag:elbv2.k8s.aws/cluster,Values=${STACK_NAME}" \
--query 'SecurityGroups[].GroupId' --output text); do \
aws ec2 delete-security-group --group-id "$sg"; \
done

## Destroy to VPC module
echo "Destroying VPC related resources..."
target="module.vpc"
destroy_output=$(terraform destroy -target="$target" -var="region=$region" -auto-approve 2>&1 | tee /dev/tty)
if [[ ${PIPESTATUS[0]} -eq 0 && $destroy_output == *"Destroy complete"* ]]; then
echo "SUCCESS: Terraform destroy of $target completed successfully"
else
echo "FAILED: Terraform destroy of $target failed"
exit 1
fi

## Final destroy to catch any remaining resources
echo "Destroying remaining resources..."
destroy_output=$(terraform destroy -var="region=$region" -auto-approve 2>&1 | tee /dev/tty)
Expand Down
41 changes: 17 additions & 24 deletions ai-ml/nvidia-triton-server/helm-values/kube-prometheus.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,28 +21,21 @@ alertmanager:

grafana:
enabled: true
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
defaultDashboardsEnabled: true
prometheus:
prometheusSpec:
retention: 5h
scrapeInterval: 30s
evaluationInterval: 30s
scrapeTimeout: 10s
serviceMonitorSelectorNilUsesHelmValues: false # This is required to use the serviceMonitorSelector
storageSpec:
volumeClaimTemplate:
metadata:
name: data
spec:
storageClassName: ${storage_class_type}
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
alertmanager:
enabled: false

grafana:
enabled: true
defaultDashboardsEnabled: true
dashboards:
hustshawn marked this conversation as resolved.
Show resolved Hide resolved
default:
nim-llm-monitoring:
json: |
${nim_llm_dashbaord_json}
53 changes: 53 additions & 0 deletions ai-ml/nvidia-triton-server/helm-values/nim-llm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# ref: https://github.com/NVIDIA/nim-deploy/blob/main/helm/nim-llm/values.yaml
image:
repository: nvcr.io/nim/meta/llama3-8b-instruct
tag: latest
model:
ngcAPIKey: ${ngc_api_key}
nimCache: /model-store
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
statefulSet:
enabled: true
persistence:
enabled: true
existingClaim: ${pvc_name}
nodeSelector:
NodeGroupType: g5-gpu-karpenter
type: karpenter
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
app: prometheus
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 5
scaleDownStabilizationSecs: 300
metrics:
- type: Pods
pods:
metric:
name: num_requests_running
target:
type: Value
averageValue: 5
ingress:
enabled: true
className: nginx
annotations: {}
hosts:
- paths:
- path: /
pathType: ImplementationSpecific
serviceType: openai
14 changes: 14 additions & 0 deletions ai-ml/nvidia-triton-server/helm-values/prometheus-adapter.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# ref: https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus-adapter/values.yaml
prometheus:
url: http://kube-prometheus-stack-prometheus.${prometheus_namespace}
port: 9090
rules:
default: false
custom:
- seriesQuery: '{__name__=~"num_requests_running"}'
resources:
template: <<.Resource>>
name:
matches: "num_requests_running"
as: ""
metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
11 changes: 10 additions & 1 deletion ai-ml/nvidia-triton-server/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,18 @@ if [[ "${TF_VAR_huggingface_token}" = "DUMMY_TOKEN_REPLACE_ME" ]] ; then
exit 1
fi

if [ "$TF_VAR_enable_nvidia_nim" = true ]; then
# Check if server_token does not start with "nvapi-"
# Obtain your NVIDIA NGC API key from https://docs.nvidia.com/ai-enterprise/deployment-guide-spark-rapids-accelerator/0.1.0/appendix-ngc.html
if [[ ! "$TF_VAR_ngc_api_key" == nvapi-* ]]; then
echo "FAILED: TF_VAR_ngc_api_key must start with 'nvapi-'"
exit 1
fi
fi

echo "Proceed with deployment of targets..."

List of Terraform modules to apply in sequence
# List of Terraform modules to apply in sequence
targets=(
"module.vpc"
"module.eks"
Expand Down
Loading
Loading