Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS jobs not causing Karpenter to scale nodes #7355

Open
oweng opened this issue Nov 8, 2024 · 3 comments
Open

EKS jobs not causing Karpenter to scale nodes #7355

oweng opened this issue Nov 8, 2024 · 3 comments
Labels
documentation Improvements or additions to documentation triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@oweng
Copy link

oweng commented Nov 8, 2024

Description

I've been looking through the docs, and maybe I am missing something, but we currently have all our node pools being scaled via Karpenter with no issues at all for deployments.
Recently we have started some Dagster deployments and when the data runs start up, they start 25 Batch Jobs. When this happens, they are all pinned to the single node in the node pool, and we don't see Karpenter scaling the nodes. Pod-wise, they all startup and immediately enter a running state it seems, and more or less the instance becomes unresponsive until they eventually finish their work.
Just wondering if there is something I am missing?

@oweng oweng added documentation Improvements or additions to documentation needs-triage Issues that need to be triaged labels Nov 8, 2024
@YuriFrayman
Copy link

Take a look at alternative solutions cast.ai where you gained significant stability coupled with significant savings

@gladiatr72
Copy link

gladiatr72 commented Nov 13, 2024

Sounds like you haven't defined pod.spec.containers.resources.requests.cpu, or, if you have, you've seriously low-balled it. Set ...requests.cpu to 1 and see if that doesn't sort it out. I'm not familiar w/ Dagster but I'd also check the docs to determine how it configures its concurrency without explicit instructions. If it has such a knob, set it to a single worker (or set it how you like but also use that value for ...requests.cpu)

@jmdeal jmdeal added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Nov 14, 2024
@jmdeal
Copy link
Contributor

jmdeal commented Nov 14, 2024

That definitely seems realistic, Karpenter is not responsible for scheduling nodes, kube-scheduler is. So if the pods successfully scheduled, that means Karpenter fulfilled it's purpose in ensuring enough capacity was available on the cluster to fulfill the pods' requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

4 participants