Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A pending workload that's deleted is still listed in the visibility API pendingworkloads results #1555

Closed
astefanutti opened this issue Jan 8, 2024 · 5 comments · Fixed by #1679
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@astefanutti
Copy link
Member

What happened:

If a pending workload is deleted, querying the visibility API stills reports the workload.

What you expected to happen:

The workload should not be listed in the pendingworkloads sub-resource after it's deleted.

How to reproduce it (as minimally and precisely as possible):

  1. Create a workload such that it's going to be pending
  2. Delete the job / workload
  3. Query the visibility API

Anything else we need to know?:

Once the operator Pod is deleted / has restarted, the results are corrected.

Environment:

  • Kubernetes version (use kubectl version): v1.25.3
  • Kueue version (use git describe --tags --dirty --always): v0.6.0-devel-146-ged81667f-dirty
@astefanutti astefanutti added the kind/bug Categorizes issue or PR as related to a bug. label Jan 8, 2024
@mimowo
Copy link
Contributor

mimowo commented Jan 26, 2024

Interesting, because the visibility API serves on-demand the data directly from the in-memory cache used by scheduler, which suggests the cache itself is not updated. Did you have a chance to check if the PendingWorkloads counter is correct:

PendingWorkloads int32 `json:"pendingWorkloads"`
?

@astefanutti
Copy link
Member Author

Interesting, because the visibility API serves on-demand the data directly from the in-memory cache used by scheduler, which suggests the cache itself is not updated.

Right, that'd be my understanding as well. I had traced it down to:

if workload.HasQuotaReservation(wl) {
r.queues.DeleteWorkload(wl)
}

For a pending workload, workload.HasQuotaReservation(wl) returns false, which skips the cache update.

Naively, I'd think that if statement could / should be removed. But I wanted to understand why it's been added in the first place, and if it's correct / safe to remove it.

@mimowo
Copy link
Contributor

mimowo commented Jan 26, 2024

/cc @alculquicondor

@alculquicondor
Copy link
Contributor

oh shoot, this is a bug introduced here: 57c714d#diff-60dd240c20adbd6a189d018d1c216c2d296730f446c341d8bf449fa6657964ffR242

it went from admission==nil to IsWorkloadAdmitted.

I think removing the if (always clear from queues) should be fine too.

And please add an integration test.

@astefanutti
Copy link
Member Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
3 participants