[RFE] kube-burner audit execution #439

jtaleric · 2023-08-16T21:37:21Z

Is your feature request related to a problem? Please describe.
Sort of...

We need kube-burner to help users understand if their run was "valid". We now have 2 different latency values (v1, v2)... However we also spit out a log message if the latency value is less than zero.

Describe the solution you'd like
If we run 1000 iterations, and 500 of the latency values we retrieve are less than zero (which kube-burner will set zero), we should have kube-burner spit out a message stating that the values captured during the run are unreliable and not meaningful. This could indicate an environmental problem

rsevilla87 · 2023-08-17T07:32:19Z

Values less than 0 shouldn't been printed anymore

jtaleric · 2023-08-17T16:43:09Z

Values less than 0 shouldn't been printed anymore

Along with not printing, we should also not have those zeros calculated in the latency.

vishnuchalla · 2023-08-21T00:56:30Z

Case study full log: nd-200-ocp4.10.txt
Yes, fortunately those were just trace logs till now. We can definitely remove them with the resolution suggested below. Attached is a recent case study, where there was no proper ntp sync or some setup issue on their system under test. And here is an example instance where a pod has very unrealistic times recorded, and the same happened with many other pods in the environment.

  creationTimestamp: "2023-08-17T00:56:37Z"
  status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:35Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:35Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-08-17T00:56:36Z"
    status: "True"
    type: PodScheduled

Full describe log of the pod: pod-describe-and-yaml-output.txt
And the alert which we saw was related to etcdCommitLatency which indicates there is something wrong with both api-server and etcd coordination in updating those timestamps

time="2023-08-12 03:25:41" level=warning msg="🚨 Alert at 2023-08-12T03:18:59Z: '10 minutes avg. 99th etcd fsync latency on etcd-openupi-j2lgw-master-1 higher than 10ms. 0.013s'" file="alert_manager.go:188"
[{"timestamp":"2023-08-12T03:18:59.504Z","uuid":"4ce7c9b2-1131-48ee-952b-06ef1526fd34","severity":"warning","description":"10 minutes avg. 99th etcd fsync latency on etcd-openupi-j2lgw-master-1 higher than 10ms. 0.013s","metricName":"alert"},{"timestamp":"2023-08-12T03:22:59.504Z","uuid":"4ce7c9b2-1131-48ee-952b-06ef1526fd34","severity":"warning","description":"10 minutes avg. 99th etcd commit latency on etcd-openupi-j2lgw-master-1 higher than 30ms. 0.038s","metricName":"alert"}]

Summary:

On top of the existing errors in their SUT, V1 reported 0.74% of them as negative latencies, which is wrong due to extra round-trip time again being added by the client-go watcher events and time.now() set in our code. (i.e. 19788/26716)
V2 reported 0.35% of them as negative latencies, which was exact same number of pods that contain error in their system. (i.e. 9414/26716)
And also V1 latencies had atleast >=2s error with the actual timestamps in almost every instance (please search for Difference between V1 and V2 text in the attached full log for more info) which indicates V1 would report wrong latencies even if their SUT was perfect.

Resolution:

So, as a solution we should run an audit using V2 latencies (as V1 latencies are unreliable due the error added by the client-go api round trip times), and report a log indicating our confidence on the results. Based on the fact that, those negative latencies should never occur in real-time, i.e (all life cycle stage timestamps including Scheduled, Initialized, ContainersReady & PodReady should always be >= CreationTimeStamp) We should have threshold based on V2 latencies, for example if 10% of the latencies are falling under negative cases, we should log our confidence level for user's understanding as it is completely on their SUT. As usual, I am open to other thoughts too and for deciding a % value on the threshold.

jtaleric added the enhancement New feature or request label Aug 16, 2023

vishnuchalla assigned dry923, smalleni, afcollins and vishnuchalla Aug 21, 2023

vishnuchalla mentioned this issue Aug 23, 2023

Adding audit functionality on latencies #446

Merged

7 tasks

vishnuchalla closed this as completed Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] kube-burner audit execution #439

[RFE] kube-burner audit execution #439

jtaleric commented Aug 16, 2023

rsevilla87 commented Aug 17, 2023

jtaleric commented Aug 17, 2023

vishnuchalla commented Aug 21, 2023 •

edited

Loading

[RFE] kube-burner audit execution #439

[RFE] kube-burner audit execution #439

Comments

jtaleric commented Aug 16, 2023

rsevilla87 commented Aug 17, 2023

jtaleric commented Aug 17, 2023

vishnuchalla commented Aug 21, 2023 • edited Loading

Summary:

Resolution:

vishnuchalla commented Aug 21, 2023 •

edited

Loading