Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] kube-burner audit execution #439

Closed
jtaleric opened this issue Aug 16, 2023 · 3 comments
Closed

[RFE] kube-burner audit execution #439

jtaleric opened this issue Aug 16, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@jtaleric
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Sort of...

We need kube-burner to help users understand if their run was "valid". We now have 2 different latency values (v1, v2)... However we also spit out a log message if the latency value is less than zero.

Describe the solution you'd like
If we run 1000 iterations, and 500 of the latency values we retrieve are less than zero (which kube-burner will set zero), we should have kube-burner spit out a message stating that the values captured during the run are unreliable and not meaningful. This could indicate an environmental problem

@jtaleric jtaleric added the enhancement New feature or request label Aug 16, 2023
@rsevilla87
Copy link
Member

Values less than 0 shouldn't been printed anymore

@jtaleric
Copy link
Contributor Author

Values less than 0 shouldn't been printed anymore

Along with not printing, we should also not have those zeros calculated in the latency.

@vishnuchalla
Copy link
Collaborator

vishnuchalla commented Aug 21, 2023

Case study full log: nd-200-ocp4.10.txt
Yes, fortunately those were just trace logs till now. We can definitely remove them with the resolution suggested below. Attached is a recent case study, where there was no proper ntp sync or some setup issue on their system under test. And here is an example instance where a pod has very unrealistic times recorded, and the same happened with many other pods in the environment.

  creationTimestamp: "2023-08-17T00:56:37Z"
  status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:35Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:35Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:36Z"
    status: "True"
    type: PodScheduled

Full describe log of the pod: pod-describe-and-yaml-output.txt
And the alert which we saw was related to etcdCommitLatency which indicates there is something wrong with both api-server and etcd coordination in updating those timestamps

time="2023-08-12 03:25:41" level=warning msg="🚨 Alert at 2023-08-12T03:18:59Z: '10 minutes avg. 99th etcd fsync latency on etcd-openupi-j2lgw-master-1 higher than 10ms. 0.013s'" file="alert_manager.go:188"
[{"timestamp":"2023-08-12T03:18:59.504Z","uuid":"4ce7c9b2-1131-48ee-952b-06ef1526fd34","severity":"warning","description":"10 minutes avg. 99th etcd fsync latency on etcd-openupi-j2lgw-master-1 higher than 10ms. 0.013s","metricName":"alert"},{"timestamp":"2023-08-12T03:22:59.504Z","uuid":"4ce7c9b2-1131-48ee-952b-06ef1526fd34","severity":"warning","description":"10 minutes avg. 99th etcd commit latency on etcd-openupi-j2lgw-master-1 higher than 30ms. 0.038s","metricName":"alert"}]

Summary:

  • On top of the existing errors in their SUT, V1 reported 0.74% of them as negative latencies, which is wrong due to extra round-trip time again being added by the client-go watcher events and time.now() set in our code. (i.e. 19788/26716)
  • V2 reported 0.35% of them as negative latencies, which was exact same number of pods that contain error in their system. (i.e. 9414/26716)
    And also V1 latencies had atleast >=2s error with the actual timestamps in almost every instance (please search for Difference between V1 and V2 text in the attached full log for more info) which indicates V1 would report wrong latencies even if their SUT was perfect.

Resolution:

So, as a solution we should run an audit using V2 latencies (as V1 latencies are unreliable due the error added by the client-go api round trip times), and report a log indicating our confidence on the results. Based on the fact that, those negative latencies should never occur in real-time, i.e (all life cycle stage timestamps including Scheduled, Initialized, ContainersReady & PodReady should always be >= CreationTimeStamp) We should have threshold based on V2 latencies, for example if 10% of the latencies are falling under negative cases, we should log our confidence level for user's understanding as it is completely on their SUT. As usual, I am open to other thoughts too and for deciding a % value on the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants