Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Improve measurement calculation in benchmark #413

Closed
vishnuchalla opened this issue Aug 9, 2023 · 11 comments
Closed

[BUG] Improve measurement calculation in benchmark #413

vishnuchalla opened this issue Aug 9, 2023 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@vishnuchalla
Copy link
Collaborator

Bug Description

Version: latest
Git Commit: ffa2d6a
Build Date: 2023-08-08-21:35:14
Go Version: go1.20.4
OS/Arch: linux amd64

Describe the bug

In our current implementation, we register measurements on all the pods that are present in the namespace specific to a job in our benchmark run. The problem with this approach is, name space is not unique always and we are landing into situations where we are also considering pods from a previous run that are present in a namespace with same name for latency calculations.

To Reproduce

Execute initial run with

kube-burner ocp node-density-heavy --timeout 1h --metrics-endpoint ~/metrics-endpoints.yaml --pods-per-node=50 --log-level=trace --gc=false 2>&1 | tee -a kube-burner-$(date +"%F_%H-%M-%S")

Initial run logs: https://gist.github.com/vishnuchalla/efdac4f963d9a1292bf9fadd0e4ec039

Execute a follow up run with

kube-burner ocp node-density-heavy --timeout 1h --metrics-endpoint ~/metrics-endpoints.yaml --pods-per-node=50 --log-level=trace --gc=true 2>&1 | tee -a kube-burner-$(date +"%F_%H-%M-%S")

Follow up run logs: https://gist.github.com/vishnuchalla/1e41f9f501ac71a3218514a6904aafb9

Now from both the runs we must be able to find a pod that is being considered for measurements. For example: Pod perfapp-1-0-7c449cc684-h5v2c is ready

Expected behavior

Measurements should only be calculated on the resources that are specific to that benchmark run.

Screenshots or output

Initial run logs: https://gist.github.com/vishnuchalla/efdac4f963d9a1292bf9fadd0e4ec039
Follow up run logs: https://gist.github.com/vishnuchalla/1e41f9f501ac71a3218514a6904aafb9

Additional context

I think we should ideally have UUID label attached to all our resources that are created in a benchmark, so that we can easily distinguish them among others and will be easy to perform any kind of action on them programatically.

@vishnuchalla vishnuchalla added the bug Something isn't working label Aug 9, 2023
@afcollins
Copy link
Contributor

Thanks, Vishnu!

For context, the issue that I noticed is that measurements are captured for pods from a previous run when they are cleaned up during a following run. Which means latency measurements for a run can include pods from the previous run without GC. This appears to be one use of the "latency < 0" check because the measurements I see all have latencies v1=0 and v2>0.

I agree with the approach of filtering by UUID of the current run, but also (what I thought might be) a simple re-order of newMeasurementFactory and Cleanup would suffice. https://github.com/cloud-bulldozer/kube-burner/blob/master/pkg/burner/job.go#L87-L119

@vishnuchalla
Copy link
Collaborator Author

Thanks, Vishnu!

For context, the issue that I noticed is that measurements are captured for pods from a previous run when they are cleaned up during a following run. Which means latency measurements for a run can include pods from the previous run without GC. This appears to be one use of the "latency < 0" check because the measurements I see all have latencies v1=0 and v2>0.

I agree with the approach of filtering by UUID of the current run, but also (what I thought might be) a simple re-order of newMeasurementFactory and Cleanup would suffice. https://github.com/cloud-bulldozer/kube-burner/blob/master/pkg/burner/job.go#L87-L119

Yes Andrew! We just need to pick the labels correctly.

@rsevilla87
Copy link
Member

rsevilla87 commented Aug 10, 2023

Makes sense, as discussed internally, this case only happens when one of the previous workloads is not garbage collected. It's curious, seems like some pods like perfapp-1-0-7c449cc684-h5v2c triggered a Ready event during the 2nd workload runtime.

Also, if we reuse the UUID by chance, the issue could still happen. But I'd say that is very unlikely and not recommended.

@vishnuchalla
Copy link
Collaborator Author

Acknowleged! Thinking to generate a unique runID for that specific run in the program runtime and use it for labelling run specific resources.

@jtaleric
Copy link
Contributor

Acknowleged! Thinking to generate a unique runID for that specific run in the program runtime and use it for labelling run specific resources.

Yeah - I would prefer not using the UUID. As we can mimic the same behavior you call out here by setting gc=false then gc=true, the user could literally just provide a UUID for both executions.

@afcollins
Copy link
Contributor

If a user uses the same UUID for two runs and gets wrong results bc of this edge case, I think the bad measurements are on them. :) We expect UUID to be unique for a run. (that's the second 'U')

@jtaleric
Copy link
Contributor

If a user uses the same UUID for two runs and gets wrong results bc of this edge case, I think the bad measurements are on them. :) We expect UUID to be unique for a run. (that's the second 'U')

I would argue the same thing if they had gc=fasle

@afcollins
Copy link
Contributor

Argue that gc=false is an edge case so I deserve bad measurements?

Running with gc=false is valid. I occasionally want to check the state of a workload before it gets cleaned up, so I disable GC.

I want a UUID to be random, so I let the tool generate it for me.

We can make an assumption that the UUID is the unique identifier for a run. No need for an additional UID.

Reset.

My initial complaint for this issue wasn't about filtering pods for the run, but changing the order of operations:
First, clean up any previous workloads.
Next, create pod measurements and run workload

The bug I wanted to highlight was creating podMeasurements before cleaning up workloads, and is getting lost in this discussion.

afcollins added a commit to afcollins/kube-burner that referenced this issue Aug 10, 2023
@afcollins
Copy link
Contributor

afcollins commented Aug 10, 2023

This is all I wanted: 68dae0f

afcollins added a commit to afcollins/kube-burner that referenced this issue Aug 10, 2023
@jtaleric
Copy link
Contributor

Argue that gc=false is an edge case so I deserve bad measurements?

No, but in this situation if you forget to delete the namespace, it is on you.

Running with gc=false is valid. I occasionally want to check the state of a workload before it gets cleaned up, so I disable GC.

No argument there.

I want a UUID to be random, so I let the tool generate it for me.

We can make an assumption that the UUID is the unique identifier for a run. No need for an additional UID.

Well, with that statement, we should remove the user-provided UUID -- so we don't get in the situation of the user providing the same UUID for both runs (similar, if they ran with gc=false but forgot to delete the left over ns).

Reset.

My initial complaint for this issue wasn't about filtering pods for the run, but changing the order of operations: First, clean up any previous workloads. Next, create pod measurements and run workload

The bug I wanted to highlight was creating podMeasurements before cleaning up workloads, and is getting lost in this discussion.

ack - ok.

@vishnuchalla
Copy link
Collaborator Author

This is all I wanted: 68dae0f

I wish it was this simple. But doing a cleanup early would impact other job types like patch, and also I think we should never give our users chance to modify information related to the cluster resources and run into similar issues. So introducing a unique id in the label (which is internal to the app's program memory) so that we can use it to only filter out the pods that fall under our run and perform measurements on them.

Related PR: #421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants