Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: aws_sagemaker_endpoint maximum_execution_timeout_in_seconds value is ignored #39040

Open
r-archer37 opened this issue Aug 26, 2024 · 10 comments
Labels
bug Addresses a defect in current functionality. service/sagemaker Issues and PRs that pertain to the sagemaker service.

Comments

@r-archer37
Copy link

Terraform Core Version

1.19

AWS Provider Version

5.63.0,5.63.1

Affected Resource(s)

  • aws_sagemaker_endpoint

Expected Behavior

Modifying a SageMaker endpoint should time out if maximum_execution_timeout_in_seconds is reached, and not before.

Actual Behavior

Modifying a SageMaker endpoint times out after 600 seconds, regardless of the value of maximum_execution_timeout_in_seconds

Relevant Error/Panic Output Snippet

aws_sagemaker_endpoint.e: Still modifying... [id=your-endpoint-name-here, 9m41s elapsed]
aws_sagemaker_endpoint.e: Still modifying... [id=your-endpoint-name-here, 9m51s elapsed]
aws_sagemaker_endpoint.e: Still modifying... [id=your-endpoint-name-here, 10m1s elapsed]
╷
│ Error: waiting for SageMaker Endpoint (your-endpoint-name-here) to be in service: timeout while waiting for state to become 'InService' (last state: 'Updating', timeout: 10m0s)
│
│   with aws_sagemaker_endpoint.e,
│   on main.tf line 252, in resource "aws_sagemaker_endpoint" "e":
│  252: resource "aws_sagemaker_endpoint" "e" {
│
╵
Releasing state lock. This may take a few moments...
ERRO[0627] 1 error occurred:
	* exit status 1

Terraform Configuration Files

Sample main.tf

  required_version = "= 1.1.9"
  backend "s3" {
  }
}

provider "aws" {
  # The AWS region in which all resources will be created
  region = var.aws_region

  # Only these AWS Account IDs may be operated on by this template
  allowed_account_ids = [var.aws_account_id]
}

data "terraform_remote_state" "vpc" {
  backend = "s3"
  config = {
    region = var.terraform_state_aws_region
    bucket = var.terraform_state_s3_bucket
    key    = "${var.aws_region}/${var.vpc_name}/vpc/terraform.tfstate"
  }
}

resource "aws_security_group" "sagemaker_endpoint" {
  name   = "sagemaker-endpoints-${var.endpoint_name}-allow-tls"
  vpc_id = data.terraform_remote_state.vpc.outputs.vpc_id

  tags = {
    Environment = var.vpc_name
  }
}

resource "aws_security_group_rule" "sagemaker_allow_inbound_tls" {
  type              = "ingress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  cidr_blocks       = data.terraform_remote_state.vpc.outputs.private_app_subnet_cidr_blocks
  security_group_id = aws_security_group.sagemaker_endpoint.id
}

resource "aws_security_group_rule" "sagemaker_allow_all_outbound" {
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.sagemaker_endpoint.id
}

resource "aws_sagemaker_model" "model" {
  name               = var.model_name
  execution_role_arn = aws_iam_role.assume_sagemaker_role_for_model_endpoint.arn
  primary_container {
    image          = var.base_model_image_url
    model_data_url = "model-data-url-goes-here"
  }
  vpc_config {
    subnets            = data.terraform_remote_state.vpc.outputs.private_app_subnet_ids
    security_group_ids = [aws_security_group.sagemaker_endpoint.id]
  }
  tags = {
    Environment = var.vpc_name
  }
}

resource "aws_iam_role" "assume_sagemaker_role_for_model_endpoint" {
  name               = "sagemaker-endpoint-${var.endpoint_name}"
  assume_role_policy = data.aws_iam_policy_document.assume_sagemaker_role_for_model_endpoint.json
}

data "aws_iam_policy_document" "additional_sagemaker_permissions" {
  statement {
    actions = [
      "ec2:DescribeSubnets",
      "ec2:DescribeSecurityGroups",
      "ec2:DescribeVpcs",
      "ec2:CreateTags",
      "ec2:DescribeVpcEndpointConnections",
      "ec2:DescribeVpcEndpointConnectionNotifications",
      "ec2:DescribeVpcEndpointServices",
      "ec2:DescribeVpcEndpoints",
      "ec2:DescribeVpcEndpointServiceConfigurations",
      "ec2:DescribeDhcpOptions",
      "ec2:DescribeNetworkInterfaces",
      "ec2:DescribeRouteTables"
    ]
    resources = [
      "*",
    ]
  }
  statement {
    actions = [
      "sagemaker:ListModels",
      "sagemaker:DescribeModel",
      "ecr:BatchGetImage",
      "ecr:GetAuthorizationToken",
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
    ]
    resources = ["*"]
  }
  statement {
    actions = [
      "ec2:CreateNetworkInterface",
      "ec2:CreateNetworkInterfacePermission",
      "ec2:DeleteNetworkInterface",
      "ec2:DeleteNetworkInterfacePermission",
      "ec2:ModifyVpcEndpointServicePermissions",
      "ec2:ModifyVpcEndpointServiceConfiguration",
      "ec2:CreateVpcEndpointConnectionNotification",
      "ec2:AcceptVpcEndpointConnections",
      "ec2:DeleteVpcEndpoints",
      "ec2:DeleteVpcEndpointServiceConfigurations",
      "ec2:ModifyVpcEndpointConnectionNotification",
      "ec2:CreateVpcEndpointServiceConfiguration",
      "ec2:DeleteVpcEndpointConnectionNotifications",
      "ec2:CreateVpcEndpoint",
      "ec2:StartVpcEndpointServicePrivateDnsVerification",
      "ec2:RejectVpcEndpointConnections",
      "ec2:ModifyVpcEndpoint"
    ]
    resources = ["*"]
  }
  statement {
    actions = [
      "route53:DisassociateVPCFromHostedZone",
      "route53:AssociateVPCWithHostedZone"
    ]
    resources = ["*"]
  }
  statement {
    actions   = ["elasticloadbalancing:DescribeLoadBalancers"]
    resources = ["*"]
  }
  statement {
    actions = ["kms:CreateGrant",
      "kms:Encrypt",
      "kms:Decrypt",
      "kms:ReEncrypt*",
      "kms:GenerateDataKey*",
      "kms:DescribeKey"]
    resources = [var.endpoint_configuration_kms_key_arn, var.model_registry_s3_bucket_kms_key_arn]
  }
  statement {
    actions = [
      "cloudwatch:DescribeAlarms",
      "cloudwatch:PutMetricData",
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "logs:CreateLogGroup",
      "logs:DescribeLogStreams",
    ]
    resources = ["*"]
  }
  statement {
    actions = [
      "s3:GetObject",
      "s3:ListBucket",
      "s3:PutObject",
      "s3:DeleteObject",
      "s3:AbortMultipartUpload",
      "s3:GetBucketLocation",
      "s3:ListAllMyBuckets",
      "s3:GetBucketAcl",
      "s3:PutObjectAcl",
    ]
    resources = [
      "arn:aws:s3:::${var.model_registry_bucket_name}/*",
      "arn:aws:s3:::${var.model_registry_bucket_name}"
    ]
  }
}

resource "aws_iam_policy" "additional_sagemaker_policy" {
  name   = "additional_sagemaker_policy_for_${var.endpoint_name}"
  policy = data.aws_iam_policy_document.additional_sagemaker_permissions.json
}


data "aws_iam_policy_document" "assume_sagemaker_role_for_model_endpoint" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

resource "aws_iam_role_policy_attachment" "attach_additional_sagemaker_permissions" {
  role       = aws_iam_role.assume_sagemaker_role_for_model_endpoint.name
  policy_arn = aws_iam_policy.additional_sagemaker_policy.arn
}

resource "aws_appautoscaling_target" "sagemaker_target" {
  min_capacity       = var.min_instance_count
  max_capacity       = var.max_instance_count
  resource_id        = "resource-id-goes-here"
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"
}

resource "aws_appautoscaling_policy" "sagemaker_policy" {
  name               = "${aws_sagemaker_endpoint.e.name}-autoscaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.sagemaker_target.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
    }
    target_value       = var.target_invocations
    scale_in_cooldown  = var.target_scale_in_cooldown
    scale_out_cooldown = var.target_scale_out_cooldown
  }
}

resource "aws_sagemaker_endpoint_configuration" "ec" {
  name        = "ec-name-goes-here"
  kms_key_arn = var.endpoint_configuration_kms_key_arn
  production_variants {
    # Variant name should remain static when we use CANARY deployment
    variant_name           = var.endpoint_name
    initial_instance_count = var.initial_instance_count
    instance_type          = var.instance_type
    model_name             = aws_sagemaker_model.model.name
  }

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Environment = var.vpc_name
  }

}

resource "aws_sagemaker_endpoint" "e" {
  name                 = var.endpoint_name
  endpoint_config_name = aws_sagemaker_endpoint_configuration.ec.name
  tags = {
    Environment = var.vpc_name
  }

  deployment_config {
	  blue_green_update_policy {
      traffic_routing_configuration {
        type = var.sagemaker_blue_green_traffic_routing_type
        wait_interval_in_seconds = var.sagemaker_deployment_wait_interval_in_seconds

        # Canary size configuration when traffic routing type is CANARY
        dynamic "canary_size" {
                for_each = var.sagemaker_blue_green_traffic_routing_type == "CANARY" ? [1] : []
                content {
                  type = var.sagemaker_blue_green_canary_size_type
                  value = var.sagemaker_blue_green_canary_size_value
                }
        }

        dynamic "linear_step_size" {
                for_each = var.sagemaker_blue_green_traffic_routing_type == "LINEAR" ? [1] : []
                content {
                  type = var.sagemaker_blue_green_linear_size_type
                  value = var.sagemaker_blue_green_linear_size_value
                }
        }
      }
      maximum_execution_timeout_in_seconds = var.sagemaker_deployment_maximum_execution_timeout_in_seconds
      termination_wait_in_seconds = var.sagemaker_termination_wait_in_seconds
    }
    auto_rollback_configuration {
      alarms {
        alarm_name = "${var.endpoint_name}_invocation_model_errors"
      }
    }
  }
}


data "terraform_remote_state" "sns_region" {
  backend = "s3"
  config = {
    region = var.terraform_state_aws_region
    bucket = var.terraform_state_s3_bucket
    key    = "data-key-goes-here"
  }
}

resource "aws_cloudwatch_metric_alarm" "cpu_utilization_alarm" {
  count = var.disable_high_cpu_utilization_alarm ? 0 : 1
  alarm_name                = "${var.endpoint_name}_cpu_utilization"
  metric_name               = "CPUUtilization"
  namespace                 = "/aws/sagemaker/Endpoints"
  comparison_operator       = "GreaterThanThreshold"
  unit                      = "Percent"
  evaluation_periods        = var.high_cpu_utilization_evaluation_periods
  period                    = var.high_cpu_utilization_period
  statistic                 = var.high_cpu_utilization_statistic
  threshold                 = var.high_cpu_utilization_threshold
  alarm_actions             = [data.terraform_remote_state.sns_region.outputs.arn]
  ok_actions                = [data.terraform_remote_state.sns_region.outputs.arn]
  insufficient_data_actions = [data.terraform_remote_state.sns_region.outputs.arn]
  dimensions = {
    EndpointName            = var.endpoint_name
    VariantName             = aws_sagemaker_endpoint_configuration.ec.production_variants[0].variant_name
  }
}

resource "aws_cloudwatch_metric_alarm" "memory_utilization_alarm" {
  count = var.disable_high_memory_utilization_alarm ? 0 : 1
  alarm_name                = "${var.endpoint_name}_memory_utilization"
  metric_name               = "MemoryUtilization"
  namespace                 = "/aws/sagemaker/Endpoints"
  comparison_operator       = "GreaterThanThreshold"
  unit                      = "Percent"
  evaluation_periods        = var.high_memory_utilization_evaluation_periods
  period                    = var.high_memory_utilization_period
  statistic                 = var.high_memory_utilization_statistic
  threshold                 = var.high_memory_utilization_threshold
  alarm_actions             = [data.terraform_remote_state.sns_region.outputs.arn]
  ok_actions                = [data.terraform_remote_state.sns_region.outputs.arn]
  insufficient_data_actions = [data.terraform_remote_state.sns_region.outputs.arn]
  dimensions = {
    EndpointName            = var.endpoint_name
    VariantName             = aws_sagemaker_endpoint_configuration.ec.production_variants[0].variant_name
  }
}

resource "aws_cloudwatch_metric_alarm" "disk_utilization_alarm" {
  count = var.disable_high_disk_utilization_alarm ? 0 : 1
  alarm_name                = "${var.endpoint_name}_disk_utilization"
  metric_name               = "DiskUtilization"
  namespace                 = "/aws/sagemaker/Endpoints"
  comparison_operator       = "GreaterThanThreshold"
  unit                      = "Percent"
  evaluation_periods        = var.high_disk_utilization_evaluation_periods
  period                    = var.high_disk_utilization_period
  statistic                 = var.high_disk_utilization_statistic
  threshold                 = var.high_disk_utilization_threshold
  alarm_actions             = [data.terraform_remote_state.sns_region.outputs.arn]
  ok_actions                = [data.terraform_remote_state.sns_region.outputs.arn]
  insufficient_data_actions = [data.terraform_remote_state.sns_region.outputs.arn]
  dimensions = {
    EndpointName            = var.endpoint_name
    VariantName             = aws_sagemaker_endpoint_configuration.ec.production_variants[0].variant_name
  }
}

resource "aws_cloudwatch_metric_alarm" "invocation_model_errors_alarm" {
  count = var.disable_invocation_model_errors_alarm ? 0 : 1
  alarm_name                = "${var.endpoint_name}_invocation_model_errors"
  metric_name               = "InvocationModelErrors"
  namespace                 = "AWS/SageMaker"
  comparison_operator       = "GreaterThanThreshold"
  evaluation_periods        = var.invocation_model_errors_evaluation_periods
  period                    = var.invocation_model_errors_period
  statistic                 = var.invocation_model_errors_statistic
  threshold                 = var.invocation_model_errors_threshold
  alarm_actions             = [data.terraform_remote_state.sns_region.outputs.arn]
  ok_actions                = [data.terraform_remote_state.sns_region.outputs.arn]
  insufficient_data_actions = [data.terraform_remote_state.sns_region.outputs.arn]
  dimensions = {
    EndpointName            = var.endpoint_name
    VariantName             = aws_sagemaker_endpoint_configuration.ec.production_variants[0].variant_name
  }
}

resource "aws_cloudwatch_metric_alarm" "model_latency_alarm" {
  count = var.disable_model_latency_alarm ? 0 : 1
  alarm_name                = "${var.endpoint_name}_model_latency"
  metric_name               = "ModelLatency"
  namespace                 = "AWS/SageMaker"
  comparison_operator       = "GreaterThanThreshold"
  unit                      = "Microseconds"
  evaluation_periods        = var.model_latency_evaluation_periods
  period                    = var.model_latency_period
  statistic                 = var.model_latency_statistic
  threshold                 = var.model_latency_threshold
  alarm_actions             = [data.terraform_remote_state.sns_region.outputs.arn]
  ok_actions                = [data.terraform_remote_state.sns_region.outputs.arn]
  insufficient_data_actions = [data.terraform_remote_state.sns_region.outputs.arn]
  dimensions = {
    EndpointName            = var.endpoint_name
    VariantName             = aws_sagemaker_endpoint_configuration.ec.production_variants[0].variant_name
  }
}

resource "aws_cloudwatch_metric_alarm" "overhead_latency_alarm" {
  count = var.disable_overhead_latency_alarm ? 0 : 1
  alarm_name                = "${var.endpoint_name}_overhead_latency"
  metric_name               = "OverheadLatency"
  namespace                 = "AWS/SageMaker"
  comparison_operator       = "GreaterThanThreshold"
  unit                      = "Microseconds"
  evaluation_periods        = var.overhead_latency_evaluation_periods
  period                    = var.overhead_latency_period
  statistic                 = var.overhead_latency_statistic
  threshold                 = var.overhead_latency_threshold
  alarm_actions             = [data.terraform_remote_state.sns_region.outputs.arn]
  ok_actions                = [data.terraform_remote_state.sns_region.outputs.arn]
  insufficient_data_actions = [data.terraform_remote_state.sns_region.outputs.arn]
  dimensions = {
    EndpointName            = var.endpoint_name
    VariantName             = aws_sagemaker_endpoint_configuration.ec.production_variants[0].variant_name
  }
}

Sample vars.tf

  description = "The AWS region in which all resources will be created"
  type        = string
}

variable "aws_account_id" {
  description = "An AWS Account ID. Only this ID may be operated on by this template, where all the resources should be created."
  type        = string
}

variable "terraform_state_aws_region" {
  description = "The AWS region of the S3 bucket used to store Terraform remote state"
  type        = string
}

variable "terraform_state_s3_bucket" {
  description = "The name of the S3 bucket used to store Terraform remote state"
  type        = string
}

variable "vpc_name" {
    type = string
    description = "Name of the vpc."
}

variable "initial_instance_count" {
    type = number
    default = 2
    description = "Number of instances (EC2s) to create; if auto-scaling is enabled, this is the number it starts with."
}

variable "min_instance_count" {
    type = number
    default = 2
    description = "Minimum number of instances (EC2s) to scale down to."
}

variable "max_instance_count" {
    type = number
    default = 2
    description = "Maximum number of instances (EC2s) to scale up to."
}

variable "target_invocations" {
    type = number
    description = "The desired average number of times per minute that each instance for a variant is invoked"
}

variable "target_scale_in_cooldown" {
    type = number
    default = 300
    description = "The amount of time, in seconds, after a scale-in activity completes before another scale-in activity can start. 300 is the AWS default."
}

variable "target_scale_out_cooldown" {
    type = number
    default = 300
    description = "The amount of time, in seconds, to wait for a previous scale-out activity to take effect. 300 is the AWS default."
}

variable "instance_type" {
    type = string
    description = "The AWS instance type you want to use to host the Sagemaker endpoint."
}

variable "endpoint_name" {
    type = string
    description = "The name of the Sagemaker endpoint. This ends up in the URL for the endpoint."
}

variable "model_name" {
    type = string
    description = "The AWS Sagemaker Model name from the Sagemaker model registry."
}

variable "endpoint_configuration_kms_key_arn" {
    type = string
    description = "The kms key arn that encrypts the storage device attached to the instance the endpoint is hosted on."
}

variable "base_model_image_url" {
    type = string
    description = "The aws ecr URL to the base container image for your model. This image has to be compatible with your saved/logged model."
}

variable "model_registry_bucket_name" {
    type = string
    description = "The model artifact is copied from the user's s3 location to the s3 'model registry.' Model registry location is model_registry_bucket_name + model_registry_key_prefix + model_name + model_version."
}

variable "model_registry_key_prefix" {
    type = string
    description = "The model artifact is copied from the user's s3 location to the s3 'model registry.' Model registry location is model_registry_bucket_name + model_registry_key_prefix + model_name + model_version."
}

variable "model_registry_s3_bucket_kms_key_arn" {
    type = string
    description = "The kms key that is encrypting the model sitting in s3."
}

variable "model_version" {
    type = string
    description = "The model artifact is copied from the user's s3 location to the s3 'model registry.' Model registry location is model_registry_bucket_name + model_registry_key_prefix + model_name + model_version."
}

variable "sagemaker_blue_green_traffic_routing_type" {
    type = string
    description = "Traffic routing strategy type. Supported values are 'ALL_AT_ONCE', 'CANARY', 'LINEAR'"
    default = "LINEAR"
}

# Canary routing type specific variables

variable "sagemaker_blue_green_canary_size_type" {
  type        = string
  description = "(Required for CANARY) Specifies the endpoint capacity type. Supported values are 'INSTANCE_COUNT', 'CAPACITY_PERCENT'"
  default     = "CAPACITY_PERCENT"
}

variable "sagemaker_blue_green_canary_size_value" {
  type        = number
  description = "(Required for CANARY) Defines the capacity size, either as a number of instances or a capacity percentage."
  default     = 50
}

# End of canary routing type specific variables

# Linear routing type specific variables

variable "sagemaker_blue_green_linear_size_type" {
  type        = string
  description = "(Required for LINEAR) Specifies the endpoint capacity type. Supported values are 'INSTANCE_COUNT', 'CAPACITY_PERCENT'"
  default     = "CAPACITY_PERCENT"
}

variable "sagemaker_blue_green_linear_size_value" {
  type        = number
  description = "(Required for LINEAR) Defines the capacity size, either as a number of instances or a capacity percentage."
  default     = 20
}

# End of linear routing type specific variables

variable "sagemaker_deployment_maximum_execution_timeout_in_seconds" {
    type = number
    description = "Maximum execution timeout for the deployment. Note that the timeout value should be larger than the total waiting time specified in termination_wait_in_seconds and wait_interval_in_seconds. Valid values are between 600 and 14400."
    default = 14400
}

variable "sagemaker_termination_wait_in_seconds" {
    type = number
    description = "(Optional) Additional waiting time in seconds after the completion of an endpoint deployment before terminating the old endpoint fleet. Default is 0. Valid values are between 0 and 3600."
    default = 300
}

variable "sagemaker_deployment_wait_interval_in_seconds" {
    type = number
    description = "(Required) The waiting time (in seconds) between incremental steps to turn on traffic on the new endpoint fleet. Valid values are between 0 and 3600."
    default = 300
}

variable "high_cpu_utilization_threshold" {
  description = "Trigger an alarm if the endpoint instance has a CPU utilization percentage above this threshold"
  default     = 90
  type        = number
}

variable "high_cpu_utilization_period" {
  description = "The period, in seconds, over which to measure the CPU utilization percentage"
  default     = 60
  type        = number
}

variable "high_cpu_utilization_evaluation_periods" {
  description = "The number of periods over which data is compared to the specified threshold."
  default     = 3
  type        = number
}

variable "high_cpu_utilization_statistic" {
  description = "The statistic to apply to the alarm's associated metric. [SampleCount, Average, Sum, Minimum, Maximum]"
  default     = "Average"
  type        = string
}

variable "disable_high_cpu_utilization_alarm" {
  description = "Toggle for the high cpu utilization alarm"
  default     = false
  type        = bool
}

variable "high_memory_utilization_threshold" {
  description = "Trigger an alarm if the endpoint instance has a Memory utilization percentage above this threshold"
  default     = 90
  type        = number
}

variable "high_memory_utilization_period" {
  description = "The period, in seconds, over which to measure the Memory utilization percentage"
  default     = 300
  type        = number
}

variable "high_memory_utilization_evaluation_periods" {
  description = "The number of periods over which data is compared to the specified threshold."
  default     = 3
  type        = number
}

variable "high_memory_utilization_statistic" {
  description = "The statistic to apply to the alarm's associated metric. [SampleCount, Average, Sum, Minimum, Maximum]"
  default     = "Average"
  type        = string
}

variable "disable_high_memory_utilization_alarm" {
  description = "Toggle for the high memory utilization alarm"
  default     = false
  type        = bool
}

variable "high_disk_utilization_threshold" {
  description = "Trigger an alarm if the endpoint instance has a disk utilization percentage above this threshold"
  default     = 90
  type        = number
}

variable "high_disk_utilization_period" {
  description = "The period, in seconds, over which to measure the disk utilization percentage"
  default     = 300
  type        = number
}

variable "high_disk_utilization_evaluation_periods" {
  description = "The number of periods over which data is compared to the specified threshold."
  default     = 3
  type        = number
}

variable "high_disk_utilization_statistic" {
  description = "The statistic to apply to the alarm's associated metric. [SampleCount, Average, Sum, Minimum, Maximum]"
  default     = "Maximum"
  type        = string
}

variable "disable_high_disk_utilization_alarm" {
  description = "Toggle for the high disk utilization alarm"
  default     = false
  type        = bool
}

variable "invocation_model_errors_threshold" {
  description = "Trigger an alarm if the the number of model invocation requests which did not result in 2XX HTTP response exceeds this threshhold."
  default     = 1
  type        = number
}

variable "invocation_model_errors_period" {
  description = "The period, in seconds, over which to measure the invocation model errors"
  default     = 60
  type        = number
}

variable "invocation_model_errors_evaluation_periods" {
  description = "The number of periods over which data is compared to the specified threshold."
  default     = 3
  type        = number
}

variable "invocation_model_errors_statistic" {
  description = "The statistic to apply to the alarm's associated metric. [Average, Sum]"
  default     = "Sum"
  type        = string
}

variable "disable_invocation_model_errors_alarm" {
  description = "Toggle for the invocation model errors alarm"
  default     = false
  type        = bool
}

variable "model_latency_threshold" {
  description = "Trigger an alarm if the interval of time taken by a model to respond as viewed from SageMaker exceeds this threshhold in microseconds."
  default     = 10000000
  type        = number
}

variable "model_latency_period" {
  description = "The period, in seconds, over which to measure the model latency"
  default     = 300
  type        = number
}

variable "model_latency_evaluation_periods" {
  description = "The number of periods over which data is compared to the specified threshold."
  default     = 3
  type        = number
}

variable "model_latency_statistic" {
  description = "The statistic to apply to the alarm's associated metric. [Average, Sum, Min, Max, Sample Count]"
  default     = "Maximum"
  type        = string
}

variable "disable_model_latency_alarm" {
  description = "Toggle for the model latency alarm"
  default     = false
  type        = bool
}

variable "overhead_latency_threshold" {
  description = "Trigger an alarm if the interval of time added to the time taken to respond to a client request by SageMaker overheads exceeds this threshhold in microseconds."
  default     = 10000000
  type        = number
}

variable "overhead_latency_period" {
  description = "The period, in seconds, over which to measure the overhead latency"
  default     = 300
  type        = number
}

variable "overhead_latency_evaluation_periods" {
  description = "The number of periods over which data is compared to the specified threshold."
  default     = 3
  type        = number
}

variable "overhead_latency_statistic" {
  description = "The statistic to apply to the alarm's associated metric. [Average, Sum, Min, Max, Sample Count]"
  default     = "Maximum"
  type        = string
}

variable "disable_overhead_latency_alarm" {
  description = "Toggle for the overhead latency alarm"
  default     = false
  type        = bool
}

Sample terragrunt.hcl

  source = module-path-goes-here
}

terragrunt_version_constraint = ">= 0.37.2, < 0.37.3"
terraform_version_constraint  = ">= 1.1.9, < 1.1.10"

include { # Include all settings from the root terraform.tfvars file
  path = find_in_parent_folders()
}
inputs = merge(merge(
  read_terragrunt_config(find_in_parent_folders("region.hcl")).inputs,
  read_terragrunt_config(find_in_parent_folders("environment.hcl")).inputs),
  {
    endpoint_name                        = "endpoint-name-goes-here"
    model_name                           = "model-name-goes-here"
    instance_type                        = "ml.r5.2xlarge"
    initial_instance_count               = 2
    min_instance_count                   = 2
    max_instance_count                   = 4
    target_invocations                   = 50
    endpoint_configuration_kms_key_arn   = "arn-id-goes-here"
    base_model_image_url                 = "base-model-image-url-goes-here"
    model_registry_bucket_name           = "bucket-name-goes-here"
    model_registry_key_prefix            = "key-prefix-goes-here"
    model_registry_s3_bucket_kms_key_arn = "key-arn-goes-here"
    model_version                        = "model-version-goes-here"
    sagemaker_deployment_wait_interval_in_seconds = 900
    sagemaker_termination_wait_in_seconds = 900
  }
)

Steps to Reproduce

Using AWS provider v5.63.0 or v5.63.1, run an update of a SageMaker endpoint with Blue-Green deploy configured like so:

    maximum_execution_timeout_in_seconds = 14400
    termination_wait_in_seconds          = 900

    traffic_routing_configuration {
        type                     = "LINEAR"
        wait_interval_in_seconds = 900

        linear_step_size {
            type  = "CAPACITY_PERCENT"
            value = 20
        }
    }
}```

Observe that timeout happens at 10 minutes.

### Debug Output

_No response_

### Panic Output

_No response_

### Important Factoids

_No response_

### References

_No response_

### Would you like to implement a fix?

No
@r-archer37 r-archer37 added the bug Addresses a defect in current functionality. label Aug 26, 2024
Copy link

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added service/appautoscaling Issues and PRs that pertain to the appautoscaling service. service/cloudwatch Issues and PRs that pertain to the cloudwatch service. service/iam Issues and PRs that pertain to the iam service. service/sagemaker Issues and PRs that pertain to the sagemaker service. service/vpc Issues and PRs that pertain to the vpc service. labels Aug 26, 2024
@r-archer37 r-archer37 changed the title [Bug]: maximum_execution_timeout_in_seconds value is ignored [Bug]: aws_sagemaker_endpoint maximum_execution_timeout_in_seconds value is ignored Aug 26, 2024
@terraform-aws-provider terraform-aws-provider bot added the needs-triage Waiting for first response or review from a maintainer. label Aug 26, 2024
@justinretzolk justinretzolk removed service/iam Issues and PRs that pertain to the iam service. service/cloudwatch Issues and PRs that pertain to the cloudwatch service. needs-triage Waiting for first response or review from a maintainer. service/appautoscaling Issues and PRs that pertain to the appautoscaling service. service/vpc Issues and PRs that pertain to the vpc service. labels Aug 26, 2024
@acwwat
Copy link
Contributor

acwwat commented Aug 27, 2024

@r-archer37 I am not too familiar with SageMaker, but looking at the documentation it seems that maximum_execution_timeout_in_seconds would only control deployment timeouts when the endpoint, after it's been created, is used to provision resources and deploy models.

The timeout error you see pertains specifically to the creation of the endpoint itself and is dictated by the provider. According to the code, the wait time for the endpoint to become in service is set/hardcoded to 10 minutes. Same goes for waiting for endpoint deletion. I am not sure how long creating an endpoint generally takes but we have an option to increase it to a value within reason.

Is there any way you can try to create the same endpoint in the Management Console or CLI just to see how much longer than 10 minutes it takes?

@admirationmr
Copy link

I'm still facing the same issue when trying to create an endpoint through terraform ( using last version of aws provider ) . It takes more than 10 minutes, and as a result, the GitHub Actions workflow fails. However, the endpoint is already created on the AWS side. If I execute my Terraform plan, it will attempt to recreate the resources because it failed and it wasnot saved in the state, even though the resource already exists.
Anything we can do? for now i can just import the existing resource in terraform but i'll want to avoid that ..

@acwwat
Copy link
Contributor

acwwat commented Aug 30, 2024

Hmmm I tried to match the acceptance test case as closely to @r-archer37's config as much as possible, but I still can't get it to time out. Unless someone can help create the endpoint manually and try to measure the time, it might be best to just double the timeout to wait for endpoint to become in service from 10 minutes to 20 minutes. I'll submit a PR shortly.

@r-archer37
Copy link
Author

Hi @acwwat, thanks for your reply! Also, sorry that mine is so delayed.

Referring back to your first comment, I wanted to be sure I was being clear: the maximum_execution_timeout_in_seconds that I am referring to is part of the BlueGreenUpdatePolicy object (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BlueGreenUpdatePolicy.html#sagemaker-Type-BlueGreenUpdatePolicy-MaximumExecutionTimeoutInSeconds). It is not for the creation of the endpoint, but for the time-period where it's being modified using a Blue/Green strategy. We can see this, for example, by setting out timeouts in a way that's not compatible with it: maximum_execution_timeout_in_seconds = 600, termination_wait_in_seconds = 900. It will produce an error along the lines of:

│ 	status code: 400, request id: 27ba5001-076e-4eab-9196-d1e5b610b1c0
│
│   with aws_sagemaker_endpoint.e,
│   on main.tf line 253, in resource "aws_sagemaker_endpoint" "e":
│  253: resource "aws_sagemaker_endpoint" "e" {
│
╵
Releasing state lock. This may take a few moments...
ERRO[0052] 1 error occurred:
exit status 1

I don't think this is the same timeout as you're referring to, or if it is then it should be being overridden by the value we're supplying to this argument, and increasing it would not help.

I am still able to recreate the error by going up versions from 5.62.0 to 5.63.1, and to hopefully help demonstrate that, I am attaching two files showing as much. You can see in each of them the deprecation warning that demonstrates which version of the provider is in use. Additionally, you can see in the file with the error that before running the apply, we upgrade to 5.63.1

plan_success_5620.txt
plan_error_5631.txt

@acwwat
Copy link
Contributor

acwwat commented Sep 2, 2024

@r-archer37 Thanks for your reply.

To confirm, I understand that you are changing maximum_execution_timeout_in_seconds and termination_wait_in_seconds that's part of blue_green_update_policy in your configuration but applying the configuration (i.e. updating the endpoint) times out.

My earlier tests were aligned with configuration that includes these two arguments in the blue_green_update_policy block, however I am testing creation instead of update, which worked for me. The update scenario seems more involved behind the scenes on the AWS side because it may shift traffic for existing compute resources as described in UpdateEndpoint. All this logic is encapsulated in the UpdateEndpoint (async) API which is what the Terraform calls. however the timeout (in the resource code) to wait for the endpoint to change from Updating to InService (polling from DescribeEndpoint API) is still 10 minutes, a hardcoded value shared for both creation and update. Thus those two deployment-specific timeouts might very well be working as intended and will finish processing even if Terraform times out.

It also looks like that other folks are having similar timeout issue but for creation, that it took slightly longer than 10 minutes to complete. I would thus recommend that we see if 20 minutes is sufficient for now. If not, I think we should look into configuration timeout for more flexibility since it's YMMV.

I hope this assessment of the situation is correct, but let me know if I am not viewing it from the right angle. Thanks.

@admirationmr
Copy link

@acwwat Sorry for the confusion. I was referring to the creation of the endpoint, not the blue-green setting. The first two times I tried to create the endpoint through GH Actions, it took more than 10 minutes and timed out. However, the API call was still ongoing on the AWS side, so after a few moments, the endpoint was created. Unfortunately, in the Terraform state, it was marked as tainted. As a result, if I retried the job, it would attempt to create the endpoint again, but since it was already present in AWS, the attempt would fail. On the third attempt, the creation of the endpoint took just 7 minutes. I’m not sure why it took longer on the first two occasions and was quicker on the third one.

I guess it's luck? Everything was tried on us-east-1 ...

@acwwat
Copy link
Contributor

acwwat commented Sep 2, 2024

@admirationmr Thanks, your scenario is just another scenario that triggers the same problem with timeout being too short :) #39090 is another similar case I spotted while working on the PR.

Perhaps the 10-min wait for creation is at the border line and YMMV depending on time of use, region, endpoint configuration, etc. We probably won't know so we just need to pad the timeout a bit. For update 10 minutes is likely not sufficient if existing compute resources are being shifted/rotated. I hope 20 minutes can address both cases, but if it doesn't then we'll need configurable timeouts to address it once and for all. Hopefully the low-effort fix is sufficient for the time being.

@r-archer37
Copy link
Author

Thanks @acwwat, yes you have the correct understanding of my situation. Now please help me check my own understanding!

It looks like the value for endpointInServiceTimeout is used for both endpoint creation and updating. It was hard-coded to 10 minutes in v5.63.0 (71da162) (this would explain the error that I reported). Now, as of #39090, it seems like the timeout is hard-coded to 1 hour.

But it seems you weren't able to reproduce my error? You can see in the text files in my last comment that with v5.62.0 my endpoint took ~18 minutes to update, but that with v5.63.1 it timed out right at 10 minutes. Regardless of the specific value, I believe this is because we are overriding the user-supplied maximum_execution_timeout_in_seconds parameter.

Apologies for not being very go-literate, but can we confirm via code whether that parameter is being supplied correctly?

@r-archer37
Copy link
Author

I did some GPT-assisted digging and I am more confident in the conclusion that the original title I gave this report is correct: the value supplied in maximum_execution_timeout_in_seconds is ignored. It is correctly set and validated in internal/service/sagemaker/endpoint.go, but it is not used.

Instead, resourceEndpointUpdate calls waitEndpointInService, which relies on endpointInServiceTimeout, which is hard-coded to 60 minutes in https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/sagemaker/wait.go#L28

This doesn't align with AWS or Terraform's documentation, and should be considered a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Addresses a defect in current functionality. service/sagemaker Issues and PRs that pertain to the sagemaker service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants