Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add readiness probe on TGIS container (caikit+tgis) #156

Open
dtrifiro opened this issue Nov 13, 2023 · 10 comments
Open

add readiness probe on TGIS container (caikit+tgis) #156

dtrifiro opened this issue Nov 13, 2023 · 10 comments
Assignees
Labels
kind/feature New feature

Comments

@dtrifiro
Copy link
Contributor

  • model should be loaded
  • more?

@Xaenalt

@dtrifiro dtrifiro added the kind/feature New feature label Nov 13, 2023
@dtrifiro dtrifiro self-assigned this Nov 13, 2023
@Xaenalt
Copy link
Member

Xaenalt commented Nov 13, 2023

All we should need is model loaded in TGIS, caikit will make the API available kind of regardless of whether or not it's loaded (since it attempts to lazy load the connections, etc), so ensuring the foundation model at least is loaded on TGIS before declaring container ready seems to be the best bet here

@Xaenalt
Copy link
Member

Xaenalt commented Nov 13, 2023

This was observed in perfscale where they have a large amount of contention on the GPU, so it may take a long time for models to be able to reserve space on the GPU, this would potentially also be a problem for a very large FM

@dtrifiro
Copy link
Contributor Author

If text-generation-inference supports grpc healthchecks it should be enough to add liveness/readiness probes to kserve-container for the caikit+tgis and tgis-only ServingRuntimes.

      livenessProbe:
        grpc:
          port: 8033
        initialDelaySeconds: 10
      readinessProbe:
        grpc:
          port: 8033
        initialDelaySeconds: 10

A liveness/readiness probe could also be added for caikit (transformer-container in caikit+tgis servingruntime).
The definition would be similar to the above for grpc, switching grpc port to 8085. Otherwise, to run the healthcheck on the http server:

       readinessProbe:  
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
      livenessProbe:
        httpGet:
          path: /health
          port: 8080

@vaibhavjainwiz
Copy link
Member

I am going to replicate this issue on my env and will try to add liveness/readiness probe to see if it solve the problem.

@dtrifiro
Copy link
Contributor Author

dtrifiro commented Nov 29, 2023

It seems there's an issue with how kserve handles the ServingRuntime. Adding a grpc ReadinessProbe to kserve-container results in an error in the event log:

failed to create deployment "caikit-example-isvc-predictor-00001-deployment": Deployment.apps "caikit
-example-isvc-predictor-00001-deployment" is invalid: spec.template.spec.containers[0].readinessProbe: Required value: must specify a handler type

It seems that readinessProbe.grpc is removed somewhere:

KsvcReconciler log entry
{
  "level": "info",
  "ts": "2023-11-29T12:27:27Z",
  "logger": "KsvcReconciler",
  "msg": "knative service configuration diff (-desired, +observed):",
  "diff": "  v1.ConfigurationSpec{\n  \tTemplate: v1.RevisionTemplateSpec{\n  \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kserve.io/inferenceservice\": \"caikit-example-isvc\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling
.knative.dev/min-scale\": \"1\", \"internal.serving.kserve.io/storage-initializer-sourceuri\": \"pvc://caikit-claim/flan-t5-small-caikit/\", \"sidecar.istio.io/inject\": \"true\", ...}},\n  \t\tSpec: v1.RevisionSpec{\n  \t\t\tPodSpec: v1.PodSpec{\n  \t\t\t\tVolumes:        {{Name: \"config-volu
me\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"caikit-tgis-config\"}}}}},\n  \t\t\t\tInitContainers: nil,\n  \t\t\t\tContainers: []v1.Container{\n  \t\t\t\t\t{\n  \t\t\t\t\t\t... // 6 identical fields\n  \t\t\t\t\t\tEnvFrom: nil,\n  \t\t\t\t\t\tEnv:     {{Name: \"TRANSFORMERS_
CACHE\", Value: \"/tmp/transformers_cache\"}},\n  \t\t\t\t\t\tResources: v1.ResourceRequirements{\n- \t\t\t\t\t\t\tLimits:   v1.ResourceList{},\n+ \t\t\t\t\t\t\tLimits:   nil,\n- \t\t\t\t\t\t\tRequests: v1.ResourceList{},\n+ \t\t\t\t\t\t\tRequests: nil,\n  \t\t\t\t\t\t\tClaims:   nil,\n  \t\t\t
\t\t\t},\n  \t\t\t\t\t\tVolumeMounts:  nil,\n  \t\t\t\t\t\tVolumeDevices: nil,\n  \t\t\t\t\t\tLivenessProbe: nil,\n  \t\t\t\t\t\tReadinessProbe: &v1.Probe{\n  \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n  \t\t\t\t\t\t\t\tExec:      nil,\n  \t\t\t\t\t\t\t\tHTTPGet:   nil,\n  \t\t\t\t\t\t\t\tTC
PSocket: nil,\n- \t\t\t\t\t\t\t\tGRPC:      s\"&GRPCAction{Port:8083,Service:nil,}\",\n+ \t\t\t\t\t\t\t\tGRPC:      nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n  \t\t\t\t\t\t\tTimeoutSeconds:      0,\n  \t\t\t\t\t\t\t... // 4 identical fields\n  \t\t\t\t\t\t},\n  \t\t\t\t
\t\tStartupProbe: nil,\n  \t\t\t\t\t\tLifecycle:    nil,\n  \t\t\t\t\t\t... // 7 identical fields\n  \t\t\t\t\t},\n  \t\t\t\t\t{\n  \t\t\t\t\t\t... // 6 identical fields\n  \t\t\t\t\t\tEnvFrom: nil,\n  \t\t\t\t\t\tEnv:     nil,\n  \t\t\t\t\t\tResources: v1.ResourceRequirements{\n- \t\t\t\t\t\t\
tLimits:   v1.ResourceList{},\n+ \t\t\t\t\t\t\tLimits:   nil,\n- \t\t\t\t\t\t\tRequests: v1.ResourceList{},\n+ \t\t\t\t\t\t\tRequests: nil,\n  \t\t\t\t\t\t\tClaims:   nil,\n  \t\t\t\t\t\t},\n  \t\t\t\t\t\tVolumeMounts:  {{Name: \"config-volume\", ReadOnly: true, MountPath: \"/caikit/config/\"}}
,\n  \t\t\t\t\t\tVolumeDevices: nil,\n  \t\t\t\t\t\tLivenessProbe: nil,\n  \t\t\t\t\t\tReadinessProbe: &v1.Probe{\n  \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n  \t\t\t\t\t\t\t\tExec:      nil,\n  \t\t\t\t\t\t\t\tHTTPGet:   nil,\n  \t\t\t\t\t\t\t\tTCPSocket: &{},\n- \t\t\t\t\t\t\t\tGRPC:
  s\"&GRPCAction{Port:8085,Service:nil,}\",\n+ \t\t\t\t\t\t\t\tGRPC:      nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n  \t\t\t\t\t\t\tTimeoutSeconds:      0,\n  \t\t\t\t\t\t\t... // 4 identical fields\n  \t\t\t\t\t\t},\n  \t\t\t\t\t\tStartupProbe: nil,\n  \t\t\t\t\t\tLife
cycle:    nil,\n  \t\t\t\t\t\t... // 7 identical fields\n  \t\t\t\t\t},\n  \t\t\t\t},\n  \t\t\t\tEphemeralContainers: nil,\n  \t\t\t\tRestartPolicy:       \"\",\n  \t\t\t\t... // 34 identical fields\n  \t\t\t},\n  \t\t\tContainerConcurrency: &0,\n  \t\t\tTimeoutSeconds:       &300,\n  \t\t\t...
 // 2 identical fields\n  \t\t},\n  \t},\n  }\n"
}

Which, when formatted in a sane manner becomes:

reconciler diff
v1.ConfigurationSpec{
    Template: v1.RevisionTemplateSpec{
      ObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kserve.io/inferenceservice\": \"caikit-example-isvc\"}, Annotations: {\"autoscali
ng.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/min-scale\": \"1\", \"internal.serving.kserve.io/storage-initializer-sourceuri\": \"pvc://caikit-claim/flan-t5-small-caikit/\", \"sidecar.isti
o.io/inject\": \"true\", ...}},
      Spec: v1.RevisionSpec{
        PodSpec: v1.PodSpec{
          Volumes:        {{Name: \"config-volume\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"caikit-tgis-config\"
}}}}},
          InitContainers: nil,
          Containers: []v1.Container{
            {
              ... // 6 identical fields
              EnvFrom: nil,
              Env:     {{Name: \"TRANSFORMERS_CACHE\", Value:
 \"/tmp/transformers_cache\"}},
              Resources: v1.ResourceRequirements{
-               Limits:   v1.ResourceList{},
+               Limits:   nil,
-               Requests: v1.ResourceList{},
+               R
equests: nil,
                Claims:   nil,
              },
              VolumeMounts:  nil,
              VolumeDevices: nil,
              LivenessProbe: nil,
              ReadinessProbe: &v1.Probe{
            \
t  ProbeHandler: v1.ProbeHandler{
                  Exec:      nil,
                  HTTPGet:   nil,
                  TCPSocket: nil,
-                 GRPC:      s\"&GRPCAction{Port:8083,Service:nil,}\",
+                 GRPC:      nil,
                },
                InitialDelaySeconds: 0,
                TimeoutSeconds:      0,
                ... // 4 identical fields
              },
              StartupProbe: nil,

        Lifecycle:    nil,
              ... // 7 identical fields
            },
            {
              ... // 6 identical fields
              EnvFrom: nil,
              Env:     nil,
              Resources: v
1.ResourceRequirements{
-               Limits:   v1.ResourceList{},
+               Limits:   nil,
-               Requests: v1.ResourceList{},
+               Requests: nil,
                Claims:   nil,

  },
              VolumeMounts:  {{Name: \"config-volume\", ReadOnly: true, MountPath: \"/caikit/config/\"}},
              VolumeDevices: nil,
              LivenessProbe: nil,
              ReadinessProbe: &v1.Probe{

                ProbeHandler: v1.ProbeHandler{
                  Exec:      nil,
                  HTTPGet:   nil,
                  TCPSocket: &{},
-                 GRPC:      s\"&GRPCAction{Port:8085,Service:nil,}\",
+                 GRPC:      nil,
                },
                InitialDelaySeconds: 0,
                TimeoutSeconds:      0,
                ... // 4 identical fields
              },
              StartupProbe:
 nil,
              Lifecycle:    nil,
              ... // 7 identical fields
            },
          },
          EphemeralContainers: nil,
          RestartPolicy:       \"\",
          ... // 34 identical fields

        },
        ContainerConcurrency: &0,
        TimeoutSeconds:       &300,
        ... // 2 identical fields
      },
    },
  }

@heyselbi
Copy link
Contributor

related: #170

@heyselbi heyselbi assigned dtrifiro and unassigned vaibhavjainwiz Nov 30, 2023
@dtrifiro
Copy link
Contributor Author

dtrifiro commented Dec 5, 2023

Seems we are currently blocked by knative/serving#8949 (comment)

@dtrifiro dtrifiro moved this from In Progress to To-do/Groomed in ODH Model Serving Planning Dec 5, 2023
@skonto
Copy link

skonto commented Dec 5, 2023

@dtrifiro hi do you define probes in multiple containers (just trying to confirm)?

@dtrifiro
Copy link
Contributor Author

dtrifiro commented Dec 6, 2023

@skonto Just one probe in one container, but with multiple containers. I now understand what's going on though. I was initially confused because the behaviour for the grpc probes was different (see #156 (comment) and the grpc section below).

Here's minimal knative serving examples:

http probes

  • two containers, one has an exposed port, the other has a readiness probe
# probes-testing.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: probes-testing
  namespace: probes-testing
spec:
  template:
    spec:
      containers:
        - image: python:3.11-slim
          name: container1
          command: ["python", "-m", "http.server", "8001"]
          # ports:
          #   - containerPort: 8001
          #     protocol: TCP
          readinessProbe:
            httpGet:
              path: /
              port: 8001
        - image: python:3.11-slim
          name: container2
          command: ["python", "-m", "http.server"]
          ports:
            - containerPort: 8000
              protocol: TCP
          # readinessProbe:
          #   httpGet:
          #     path: /
          #     port: 8000

Trying to kubectl apply -f probes-testing.yaml

Error from server (BadRequest): error when creating "probes-testing.yaml": admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].readinessProbe

The request does not go through, which I guess is not what I'd like, but it makes sense.

grpc probes

A setup similar to the above, but with grpc probes:

# grpc-probes-testing.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: probes-testing
  namespace: probes-testing
spec:
  template:
    spec:
      containers:
        - name: agnhost
          image: registry.k8s.io/e2e-test-images/agnhost:2.35
          command: ["/agnhost", "grpc-health-checking"]
          ports:
            # - containerPort: 5000
            - containerPort: 8080
          readinessProbe:
            grpc:
              port: 5000
        - name: agnhost1
          image: registry.k8s.io/e2e-test-images/agnhost:2.35
          command: ["/agnhost", "grpc-health-checking", "--port", "5001"]
          # ports:
          # - containerPort: 5001
          # - containerPort: 8080
          readinessProbe:
            grpc:
              port: 5001

This manifest applies with no errors.

Getting all resources:

$ kubectl get all
NAME                                               LATESTCREATED          LATESTREADY   READY     REASON
configuration.serving.knative.dev/probes-testing   probes-testing-00001                 Unknown

NAME                                                CONFIG NAME      K8S SERVICE NAME   GENERATION   READY     REASON      ACTUAL REPLICAS   DESIRED REPLICAS
revision.serving.knative.dev/probes-testing-00001   probes-testing                      1            Unknown   Deploying

NAME                                       URL                                             READY     REASON
route.serving.knative.dev/probes-testing   http://probes-testing.kserve-test.example.com   Unknown   RevisionMissing

NAME                                         URL                                             LATESTCREATED          LATESTREADY   READY     REASON
service.serving.knative.dev/probes-testing   http://probes-testing.kserve-test.example.com   probes-testing-00001                 Unknown   RevisionMissing

No pod is scheduled. If I look at the event log, I see the following:

InternalError             Revision/probes-testing-00001                           failed to create deployment "probes-testing-00001-deployment": Deployment.apps "probes-testing-00001-deployment" is invalid: spec.template.spec.containers[1].readinessProbe: Required value: must specify a handler type```

which is the confusing behaviour I was experiencing in the first place

@ReToCode
Copy link

@dtrifiro fyi WIP in Knative Serving: knative/serving#14853

@dtrifiro dtrifiro changed the title add readiness probe on TGIS add readiness probe on TGIS container (caikit+tgis Mar 8, 2024
@dtrifiro dtrifiro changed the title add readiness probe on TGIS container (caikit+tgis add readiness probe on TGIS container (caikit+tgis) Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature
Projects
Status: To-do/Groomed
Development

No branches or pull requests

6 participants