Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First request with 0 replicas sometimes takes ~20s #586

Closed
abatilo opened this issue Jan 29, 2023 · 5 comments · Fixed by #598
Closed

First request with 0 replicas sometimes takes ~20s #586

abatilo opened this issue Jan 29, 2023 · 5 comments · Fixed by #598
Assignees
Labels
bug Something isn't working

Comments

@abatilo
Copy link
Contributor

abatilo commented Jan 29, 2023

Report

Sometimes when I send a request to an application that I have configured with scale to 0, the first request will take around 20 seconds to respond, even though the pod has been scheduled and ready after only 2-3 seconds.

Expected Behavior

I would expect that if I send a request to an application with 0 replicas, that once the pod is up and healthy that the interceptor will return immediately

Actual Behavior

Instead, it might take between 20-25 seconds before a response is made

Steps to Reproduce the Problem

I have a website up that hosts just a stock boilerplate React application + Golang backed service. Feel free to hit it. It's a domain/codebase that I like to use for testing things.

curl https://friendlyfaces.app

Below is the entire YAML that I use for deploying the site:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.kubernetes.io/topology-aware-hints: auto
  name: friendlyfaces
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: friendlyfaces
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: friendlyfaces
  name: friendlyfaces
spec:
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: friendlyfaces
  template:
    metadata:
      labels:
        app: friendlyfaces
    spec:
      containers:
      - command:
        - chamber
        - exec
        - honeycomb
        - --
        - friendlyfaces
        - --static-assets-path
        - /static
        image: 911907402684.dkr.ecr.us-west-2.amazonaws.com/friendlyfaces:03db570bbd7677cc5a45f1939de03dc474ee03da
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
        name: friendlyfaces
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
        resources:
          limits:
            memory: 128Mi
          requests:
            cpu: 0.1
            memory: 128Mi
      serviceAccountName: friendlyfaces
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: friendlyfaces
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      - labelSelector:
          matchLabels:
            app: friendlyfaces
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: friendlyfaces
spec:
  dnsNames:
  - friendlyfaces.app
  - www.friendlyfaces.app
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt
  secretName: friendlyfaces-cert
---
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
  name: friendlyfaces
spec:
  host: friendlyfaces.app
  replicas:
    max: 5
    min: 0
  scaleTargetRef:
    deployment: friendlyfaces
    port: 80
    service: friendlyfaces
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: friendlyfaces
spec:
  entryPoints:
  - websecure
  routes:
  - kind: Rule
    match: Host(`friendlyfaces.app`, `www.friendlyfaces.app`)
    middlewares:
    - name: friendlyfaces
    - name: security-headers
      namespace: traefik
    services:
    - name: keda-add-ons-http-interceptor-proxy
      namespace: keda
      port: 8080
  tls:
    secretName: friendlyfaces-cert
---
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: friendlyfaces
spec:
  headers:
    customRequestHeaders:
      host: friendlyfaces.app

Logs from KEDA HTTP operator

No logs about this as far as I can tell

What version of the KEDA HTTP Add-on are you running?

v0.4.0

Kubernetes Version

1.24

Platform

Amazon Web Services

Anything else?

I'd love to be a contributor and try to debug this on my own. I'm going to do my best to figure out what's happening but if anyone could give me ideas or point me in the right direction that'd be amazing too.

Thank you all for this project. It's really neat and I like it a lot.

@abatilo abatilo added the bug Something isn't working label Jan 29, 2023
@abatilo
Copy link
Contributor Author

abatilo commented Jan 29, 2023

Okay, something that I've noticed is that it appears that if I restart the interceptors, it's fast again:

k -n keda rollout restart deployment/keda-add-ons-http-interceptor

So maybe something with caching? Hmm

@JorTurFer
Copy link
Member

Hi,
Thanks for reporting :)
Are you willing to contribute with the fix?

@abatilo
Copy link
Contributor Author

abatilo commented Jan 31, 2023

Hi @JorTurFer!

Yes, sorry, I should have highlighted that more in my original issue submission:

I'd love to be a contributor and try to debug this on my own. I'm going to do my best to figure out what's happening but if anyone could give me ideas or point me in the right direction that'd be

I'd love to contribute the fix. I'm working through trying to find the problem. If you have any tips or ideas, I'd love to hear them though.

@JorTurFer
Copy link
Member

JorTurFer commented Jan 31, 2023

oh, sorry, I read it fast and I didn't get that part :(
In this case, I'd check what component needs the restart (In this case, it seems that is the interceptor). Restarting the interceptor, you are removing caches but also enforcing new connections between interceptor and metrics server. I'd start with this communication, maybe adding some logs to the interceptor/metrics server to detect possible dropped connections.

TBH, I'm not sure about what kind of cache could be in the interceptor

abatilo added a commit to abatilo/http-add-on that referenced this issue Feb 7, 2023
The backoff is created in `interceptor/main.go` and `.Step()` is a
pointer receiver, so `.Step()` will mutate the underling number of
`.Steps` within the struct. When you have 0 replicas, the default 500ms
timeout of the Dial is hit which basically guarantees that you will end
up calling `.Step()`. The value inside the the struct will never be
reset, and so every single time you're scaling from 0, you're guaranteed
to reduce the number of `.Steps` by 1 and this will never reset until
the application itself has been restarted.

This change makes sure that every time we execute
`DialContextWithRetry`, we start with a fresh `backoff` which will start
the `.Steps` at 5 since it's a clone of the original `backoff`. This
`backoff` is available in the context of the function returned by
`DialContextWithRetry`, which will be the one that gets decremented, and
then will be garbage collected and we get a brand new 5 steps the next
time we execute `DialContextWithRetry`.

Closes kedacore#586
@abatilo
Copy link
Contributor Author

abatilo commented Feb 7, 2023

@JorTurFer We did it!

#598

I found the problem that was causing this and opened a PR to fix the problem. Please take a look when you can!

abatilo added a commit to abatilo/http-add-on that referenced this issue Feb 7, 2023
The backoff is created in `interceptor/main.go` and `.Step()` is a
pointer receiver, so `.Step()` will mutate the underling number of
`.Steps` within the struct. When you have 0 replicas, the default 500ms
timeout of the Dial is hit which basically guarantees that you will end
up calling `.Step()`. The value inside the the struct will never be
reset, and so every single time you're scaling from 0, you're guaranteed
to reduce the number of `.Steps` by 1 and this will never reset until
the application itself has been restarted.

This change makes sure that every time we execute
`DialContextWithRetry`, we start with a fresh `backoff` which will start
the `.Steps` at 5 since it's a clone of the original `backoff`. This
`backoff` is available in the context of the function returned by
`DialContextWithRetry`, which will be the one that gets decremented, and
then will be garbage collected and we get a brand new 5 steps the next
time we execute `DialContextWithRetry`.

Closes kedacore#586

Signed-off-by: Aaron Batilo <[email protected]>
abatilo added a commit to abatilo/http-add-on that referenced this issue Feb 7, 2023
The backoff is created in `interceptor/main.go` and `.Step()` is a
pointer receiver, so `.Step()` will mutate the underling number of
`.Steps` within the struct. When you have 0 replicas, the default 500ms
timeout of the Dial is hit which basically guarantees that you will end
up calling `.Step()`. The value inside the the struct will never be
reset, and so every single time you're scaling from 0, you're guaranteed
to reduce the number of `.Steps` by 1 and this will never reset until
the application itself has been restarted.

This change makes sure that every time we execute
`DialContextWithRetry`, we start with a fresh `backoff` which will start
the `.Steps` at 5 since it's a clone of the original `backoff`. This
`backoff` is available in the context of the function returned by
`DialContextWithRetry`, which will be the one that gets decremented, and
then will be garbage collected and we get a brand new 5 steps the next
time we execute `DialContextWithRetry`.

Closes kedacore#586

Signed-off-by: Aaron Batilo <[email protected]>
JorTurFer pushed a commit that referenced this issue Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants