Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Emergency fix: Use stable Docker images (#318)" #319

Merged
merged 13 commits into from
Apr 19, 2021

Conversation

mtojek
Copy link
Contributor

@mtojek mtojek commented Apr 12, 2021

This reverts commit 003c019 - latest emergency fix.

Required changes:

  • kibana config: use xpack.fleet.agents.fleet_server.hosts
  • kibana config: remove deprecated fields
  • increase resource limits for agent deployed in Kubernetes
  • k8s agent yaml: add RBAC role
  • regenerate expected test results for AWS integration (updated GeoDB)
  • add missing field for kubernetes.pod

@mtojek mtojek self-assigned this Apr 12, 2021
@mtojek mtojek requested a review from nchaulet April 12, 2021 08:45
@mtojek
Copy link
Contributor Author

mtojek commented Apr 12, 2021

@nchaulet This PR should go back to green once the issue around Kibana/Fleet/Agent is solved.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 12, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #319 updated

  • Start Time: 2021-04-19T15:48:42.493+0000

  • Duration: 28 min 18 sec

  • Commit: 5122fcb

Test stats 🧪

Test Results
Failed 0
Passed 316
Skipped 1
Total 317

Trends 🧪

Image of Build Times

Image of Tests

@nchaulet
Copy link
Member

@mtojek @blakerouse from what I tested this is working correctly with agent 8.0.0-SNAPSHOT it's not with 7.13 did we miss a backport somewhere or is there a build that failed?

@blakerouse
Copy link

@nchaulet I have not found a missing backport, I believe it is a snapshot build issue.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 12, 2021

It seems that there was a successful build yesterday:

43 - 7.13.0-b51da292
2021-04-11 00:12

@mtojek
Copy link
Contributor Author

mtojek commented Apr 13, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 14, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 15, 2021

/test

2 similar comments
@mtojek
Copy link
Contributor Author

mtojek commented Apr 18, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 18, 2021

/test

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

I think it is the kibana.host config that causes an issue. I assume it was renamed. Let me check the code: https://github.com/elastic/elastic-package/blob/master/internal/install/static_kibana_config_yml.go#L21

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

If so, then it's breaking for all parties including test environments and e2e tests.

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

I think fleet-server.hosts must be used: https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/server/index.ts#L58 But the other one is only deprecated 🤔 Let me pull down your code an try it out.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

I tweaked the config based on the file you linked here. Let's see.

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

I added the following and removed the kibana part:

xpack.fleet.agents.fleet_server.hosts: ["http://fleet-server:8220"]

It should be tested if it also works with the Kibana one.

@@ -19,7 +19,7 @@
"ip": "127.0.0.1"
},
"event": {
"ingested": "2021-03-18T12:21:57.668559300Z",
"ingested": "2021-04-19T09:58:42.209230300Z",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regenerated test results as it's simple for us to operate, but it still failed somewhere. Investigating.

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

I tried with keeping kibana.host in addition but didn't work. Lets continue the discussion around this in the Kibana issue. Hoping this goes green, we should move forward.

@mtojek You mentioned the e2e tests. Do you know where exactly we need to update it?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

@mtojek You mentioned the e2e tests. Do you know where exactly we need to update it?

I noticed this issue that Manu created: elastic/e2e-testing#1048

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

Yes, I updated the GeoDB references. Let's wait for the CI.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

The Kubernetes service deployer complains about the unhealthy agent's pod. Investigating.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

I didn't expect this (OOMKilled):

Name:         kind-fleet-agent-clusterscope-bf6fdc5c7-5q87q
Namespace:    kube-system
Priority:     0
Node:         kind-control-plane/172.20.0.2
Start Time:   Mon, 19 Apr 2021 11:59:32 +0000
Labels:       app=kind-fleet-agent-clusterscope
              group=fleet
              pod-template-hash=bf6fdc5c7
Annotations:  <none>
Status:       Running
IP:           10.244.0.5
IPs:
  IP:           10.244.0.5
Controlled By:  ReplicaSet/kind-fleet-agent-clusterscope-bf6fdc5c7
Containers:
  kind-fleet-agent-clusterscope:
    Container ID:   containerd://6a0b8a291fdcb44f88b73d46b61a08bebed2f34af6c13e3d05a39f94f503955d
    Image:          docker.elastic.co/beats/elastic-agent:7.13.0-SNAPSHOT
    Image ID:       docker.elastic.co/beats/elastic-agent@sha256:c41329c125066539715de67a53a314e857c49390189a4f9f17d633599b356f14
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 19 Apr 2021 12:07:18 +0000
      Finished:     Mon, 19 Apr 2021 12:07:33 +0000
    Ready:          False
    Restart Count:  6
    Limits:
      memory:  200Mi
    Requests:
      cpu:     100m
      memory:  100Mi
    Startup:   exec [sh -c grep "Agent is starting" -r . --include=elastic-agent-json.log] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      FLEET_ENROLL:    1
      FLEET_INSECURE:  1
      FLEET_URL:       http://fleet-server:8220
      NODE_NAME:        (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kind-fleet-agent-token-xf4zj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kind-fleet-agent-token-xf4zj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kind-fleet-agent-token-xf4zj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  10m                    default-scheduler  Successfully assigned kube-system/kind-fleet-agent-clusterscope-bf6fdc5c7-5q87q to kind-control-plane
  Normal   Pulling    10m                    kubelet            Pulling image "docker.elastic.co/beats/elastic-agent:7.13.0-SNAPSHOT"
  Normal   Pulled     9m55s                  kubelet            Successfully pulled image "docker.elastic.co/beats/elastic-agent:7.13.0-SNAPSHOT" in 11.176921115s
  Warning  Unhealthy  9m32s                  kubelet            Startup probe failed:
  Normal   Created    7m7s (x5 over 9m52s)   kubelet            Created container kind-fleet-agent-clusterscope
  Normal   Pulled     7m7s (x4 over 9m35s)   kubelet            Container image "docker.elastic.co/beats/elastic-agent:7.13.0-SNAPSHOT" already present on machine
  Normal   Started    7m6s (x5 over 9m52s)   kubelet            Started container kind-fleet-agent-clusterscope
  Warning  BackOff    5m1s (x15 over 9m16s)  kubelet            Back-off restarting failed container

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

If you increase the memory, will it go through? Is this the container with fleet-server or without?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

If you increase the memory, will it go through?

It went through.

Is this the container with fleet-server or without?

Without.

I'm investigating the next level issue. It looks like related to the Kubernetes pod data stream.

[2021-04-19T13:21:53.343Z] Error: error running package system tests: could not complete test run: failed to validate fields: test case failed: one or more errors found in documents stored in metrics-kubernetes.pod-ep data stream

Unfortunately that's the price (technical debt) we're paying for not using snapshots immediately after releasing.

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

So the failing test is related to the package or to our setup?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

Currently it's the test integration:

[0] field "kubernetes.pod.ip" is undefined
[1] field "kubernetes.pod.ip" is undefined
[2] field "kubernetes.pod.ip" is undefined
[3] field "kubernetes.pod.ip" is undefined
[4] field "kubernetes.pod.ip" is undefined
[5] field "kubernetes.pod.ip" is undefined
[6] field "kubernetes.pod.ip" is undefined
[7] field "kubernetes.pod.ip" is undefined
[8] field "kubernetes.pod.ip" is undefined
[9] field "kubernetes.pod.ip" is undefined
[10] field "kubernetes.pod.ip" is undefined
[11] field "kubernetes.pod.ip" is undefined
[12] field "kubernetes.pod.ip" is undefined
[13] field "kubernetes.pod.ip" is undefined

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2021

I wonder how this is related to the "stack" change we made?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

I confirmed it with @ChrsMark that there was a field added in Beats in the mean time.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mtojek
Copy link
Contributor Author

mtojek commented Apr 19, 2021

/test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants