[BUG] Improve measurement calculation in benchmark #413

vishnuchalla · 2023-08-09T02:19:58Z

Bug Description

Version: latest
Git Commit: ffa2d6a
Build Date: 2023-08-08-21:35:14
Go Version: go1.20.4
OS/Arch: linux amd64

Describe the bug

In our current implementation, we register measurements on all the pods that are present in the namespace specific to a job in our benchmark run. The problem with this approach is, name space is not unique always and we are landing into situations where we are also considering pods from a previous run that are present in a namespace with same name for latency calculations.

To Reproduce

Execute initial run with

kube-burner ocp node-density-heavy --timeout 1h --metrics-endpoint ~/metrics-endpoints.yaml --pods-per-node=50 --log-level=trace --gc=false 2>&1 | tee -a kube-burner-$(date +"%F_%H-%M-%S")

Initial run logs: https://gist.github.com/vishnuchalla/efdac4f963d9a1292bf9fadd0e4ec039

Execute a follow up run with

kube-burner ocp node-density-heavy --timeout 1h --metrics-endpoint ~/metrics-endpoints.yaml --pods-per-node=50 --log-level=trace --gc=true 2>&1 | tee -a kube-burner-$(date +"%F_%H-%M-%S")

Follow up run logs: https://gist.github.com/vishnuchalla/1e41f9f501ac71a3218514a6904aafb9

Now from both the runs we must be able to find a pod that is being considered for measurements. For example: Pod perfapp-1-0-7c449cc684-h5v2c is ready

Expected behavior

Measurements should only be calculated on the resources that are specific to that benchmark run.

Screenshots or output

Initial run logs: https://gist.github.com/vishnuchalla/efdac4f963d9a1292bf9fadd0e4ec039
Follow up run logs: https://gist.github.com/vishnuchalla/1e41f9f501ac71a3218514a6904aafb9

Additional context

I think we should ideally have UUID label attached to all our resources that are created in a benchmark, so that we can easily distinguish them among others and will be easy to perform any kind of action on them programatically.

The text was updated successfully, but these errors were encountered:

afcollins · 2023-08-09T15:46:09Z

Thanks, Vishnu!

For context, the issue that I noticed is that measurements are captured for pods from a previous run when they are cleaned up during a following run. Which means latency measurements for a run can include pods from the previous run without GC. This appears to be one use of the "latency < 0" check because the measurements I see all have latencies v1=0 and v2>0.

I agree with the approach of filtering by UUID of the current run, but also (what I thought might be) a simple re-order of newMeasurementFactory and Cleanup would suffice. https://github.com/cloud-bulldozer/kube-burner/blob/master/pkg/burner/job.go#L87-L119

vishnuchalla · 2023-08-09T15:48:30Z

Thanks, Vishnu!

For context, the issue that I noticed is that measurements are captured for pods from a previous run when they are cleaned up during a following run. Which means latency measurements for a run can include pods from the previous run without GC. This appears to be one use of the "latency < 0" check because the measurements I see all have latencies v1=0 and v2>0.

I agree with the approach of filtering by UUID of the current run, but also (what I thought might be) a simple re-order of newMeasurementFactory and Cleanup would suffice. https://github.com/cloud-bulldozer/kube-burner/blob/master/pkg/burner/job.go#L87-L119

Yes Andrew! We just need to pick the labels correctly.

rsevilla87 · 2023-08-10T14:37:17Z

Makes sense, as discussed internally, this case only happens when one of the previous workloads is not garbage collected. It's curious, seems like some pods like perfapp-1-0-7c449cc684-h5v2c triggered a Ready event during the 2nd workload runtime.

Also, if we reuse the UUID by chance, the issue could still happen. But I'd say that is very unlikely and not recommended.

vishnuchalla · 2023-08-10T16:26:20Z

Acknowleged! Thinking to generate a unique runID for that specific run in the program runtime and use it for labelling run specific resources.

jtaleric · 2023-08-10T18:07:29Z

Acknowleged! Thinking to generate a unique runID for that specific run in the program runtime and use it for labelling run specific resources.

Yeah - I would prefer not using the UUID. As we can mimic the same behavior you call out here by setting gc=false then gc=true, the user could literally just provide a UUID for both executions.

afcollins · 2023-08-10T18:25:36Z

If a user uses the same UUID for two runs and gets wrong results bc of this edge case, I think the bad measurements are on them. :) We expect UUID to be unique for a run. (that's the second 'U')

jtaleric · 2023-08-10T18:31:35Z

If a user uses the same UUID for two runs and gets wrong results bc of this edge case, I think the bad measurements are on them. :) We expect UUID to be unique for a run. (that's the second 'U')

I would argue the same thing if they had gc=fasle

afcollins · 2023-08-10T19:10:20Z

Argue that gc=false is an edge case so I deserve bad measurements?

Running with gc=false is valid. I occasionally want to check the state of a workload before it gets cleaned up, so I disable GC.

I want a UUID to be random, so I let the tool generate it for me.

We can make an assumption that the UUID is the unique identifier for a run. No need for an additional UID.

Reset.

My initial complaint for this issue wasn't about filtering pods for the run, but changing the order of operations:
First, clean up any previous workloads.
Next, create pod measurements and run workload

The bug I wanted to highlight was creating podMeasurements before cleaning up workloads, and is getting lost in this discussion.

afcollins · 2023-08-10T19:19:36Z

This is all I wanted: 68dae0f

jtaleric · 2023-08-10T20:26:08Z

Argue that gc=false is an edge case so I deserve bad measurements?

No, but in this situation if you forget to delete the namespace, it is on you.

Running with gc=false is valid. I occasionally want to check the state of a workload before it gets cleaned up, so I disable GC.

No argument there.

I want a UUID to be random, so I let the tool generate it for me.

We can make an assumption that the UUID is the unique identifier for a run. No need for an additional UID.

Well, with that statement, we should remove the user-provided UUID -- so we don't get in the situation of the user providing the same UUID for both runs (similar, if they ran with gc=false but forgot to delete the left over ns).

Reset.

My initial complaint for this issue wasn't about filtering pods for the run, but changing the order of operations: First, clean up any previous workloads. Next, create pod measurements and run workload

The bug I wanted to highlight was creating podMeasurements before cleaning up workloads, and is getting lost in this discussion.

ack - ok.

vishnuchalla · 2023-08-10T23:03:14Z

This is all I wanted: 68dae0f

I wish it was this simple. But doing a cleanup early would impact other job types like patch, and also I think we should never give our users chance to modify information related to the cluster resources and run into similar issues. So introducing a unique id in the label (which is internal to the app's program memory) so that we can use it to only filter out the pods that fall under our run and perform measurements on them.

Related PR: #421

vishnuchalla added the bug Something isn't working label Aug 9, 2023

vishnuchalla assigned vishnuchalla, afcollins, smalleni and rsevilla87 Aug 9, 2023

vishnuchalla assigned jtaleric Aug 10, 2023

afcollins added a commit to afcollins/kube-burner that referenced this issue Aug 10, 2023

Fixes kube-burner#413

91a876e

afcollins added a commit to afcollins/kube-burner that referenced this issue Aug 10, 2023

Fixes kube-burner#413

68dae0f

vishnuchalla mentioned this issue Aug 10, 2023

Fixing measurement for pod lantencies #421

Merged

7 tasks

vishnuchalla closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Improve measurement calculation in benchmark #413

[BUG] Improve measurement calculation in benchmark #413

vishnuchalla commented Aug 9, 2023

afcollins commented Aug 9, 2023

vishnuchalla commented Aug 9, 2023

rsevilla87 commented Aug 10, 2023 •

edited

Loading

vishnuchalla commented Aug 10, 2023

jtaleric commented Aug 10, 2023

afcollins commented Aug 10, 2023

jtaleric commented Aug 10, 2023

afcollins commented Aug 10, 2023

afcollins commented Aug 10, 2023 •

edited

Loading

jtaleric commented Aug 10, 2023

vishnuchalla commented Aug 10, 2023

[BUG] Improve measurement calculation in benchmark #413

[BUG] Improve measurement calculation in benchmark #413

Comments

vishnuchalla commented Aug 9, 2023

Bug Description

Describe the bug

To Reproduce

Expected behavior

Screenshots or output

Additional context

afcollins commented Aug 9, 2023

vishnuchalla commented Aug 9, 2023

rsevilla87 commented Aug 10, 2023 • edited Loading

vishnuchalla commented Aug 10, 2023

jtaleric commented Aug 10, 2023

afcollins commented Aug 10, 2023

jtaleric commented Aug 10, 2023

afcollins commented Aug 10, 2023

afcollins commented Aug 10, 2023 • edited Loading

jtaleric commented Aug 10, 2023

vishnuchalla commented Aug 10, 2023

rsevilla87 commented Aug 10, 2023 •

edited

Loading

afcollins commented Aug 10, 2023 •

edited

Loading