-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure kube pod process per job type #10200
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good approach to me.
In terms of going forward with the env variables I think we should do one of the following:
- Do what you did for check for discover and spec as well.
- Instead of calling it check selectors, call it quick jobs selectors or something.
#2 is a best that we want to just use the same node pool for check, discover, and spec. AKA all the jobs that return quickly. This seems plausible to me. That said, if we are wrong, it'll be a breaking change to get out of it (or at least awkward to get out of). #1 will be a bit more verbose though.
Only comment since a WIP.
airbyte-workers/src/main/java/io/airbyte/workers/WorkerConfigs.java
Outdated
Show resolved
Hide resolved
airbyte-config/models/src/main/java/io/airbyte/config/EnvConfigs.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tag - direction looks good to me.
I'll take a closer look when the PR is ready to be reviewed.
@davinchia @cgardens made some updates to this today and added a few tests. I tried to figure out a good way to test this more thoroughly via KubePodProcessIntegrationTest.java, but since that test is actually running against a local kube cluster, I couldn't actually set job-specific node pools since it would prevent the pod from ever starting. If either of you have suggestions for testing this more thoroughly, please let me know! |
@jrhizor this PR currently has merge conflicts with the resourceRequirement changes you made for replication orchestrator pods. I think your approach conflicts a little bit with the approach I took in this PR, because you're adding additional information to WorkerConfigs, while my PR is instantiating a separate WorkerConfigs object for each pod type. I'm not super opinionated here, but I think we should be consistent. So either I can refactor my PR to use a single WorkerConfigs with new fields like this:
Or I can create a new Let me know if that makes sense, also down to chat through it tomorrow in person if that's easier |
Maybe dumb, but are you allowed to use wildcards with node selectors? If you can then you could set the node selector for the default to something you know will fail: There may be a real way to do this, this is obviously a bit of a workaround. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I left a suggestion that may or may not work for your integration test. @davinchia may have a better idea.
(also you can remove [wip] from the title) |
Agreed with your approach. |
w.r.t testing, I would give a manual test in one of our dev envs a shot as a sanity check. I don't think there is alot of value vis-a-vis the complexity of setting an E2E node pool test since the node selector bits this would mainly assert is mostly hitting the already well tested Kube api. the bit here with more custom logic is the whole env var injection bit and it looks like there are tests to make sure the env vars are as injected - this seems fine to me. |
17856f0
to
9fe0d05
Compare
9fe0d05
to
7a67942
Compare
7a67942
to
3b78abf
Compare
3b78abf
to
40855f2
Compare
@jrhizor can you sanity check this PR for the places it touches Replication Orchestrator configs? Key change is here: https://github.com/airbytehq/airbyte/pull/10200/files#diff-b332562b5e3b34bdc4e253e7186267ab8dbc13ec4364fd0d28e7fa418b5bfa6aR138-R154 @davinchia and @cgardens I will plan on releasing this with a new OSS version tomorrow, and will sanity check that new version on Cloud dev before doing the Cloud upgrade. Then once we have the new node pool with appropriate labels, we should be able to add Of course, let me know if anything there doesn't make sense or should be approached differently! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
… job node selectors * move status check interval to WorkerConfigs and customize for check worker * add scaffolding for spec, discover, and sync configs * optional orElse instead of orElseGet
40855f2
to
f39cf5d
Compare
@pmossman lgtm |
QA Summary:
Also found the pod via GCP console on one of the new quick-jobs nodes: https://console.cloud.google.com/kubernetes/pod/us-west3/dev-2-gke-cluster/jobs/ource-pokeapi-sync-7e551fa5-ba30-4348-9fa5-8861ea56a5de-0-buiol/details?project=dev-2-ab-cloud-proj So, I feel pretty comfortable that this can be merged. Cloud PR should go first, then this one. Will wait until Monday to avoid any issues over the weekend. @davinchia @cgardens for visibility |
What
Issue: #10208
Rather than defining a single WorkerConfigs and KubePodProcess to serve every job type, we want job-type-specific configurations so that we can, for example, run Check job pods in a node pool that is tailored to the needs of that particular job type.
How
Before this PR, the WorkerApp instantiated a single WorkerConfigs instance and a single ProcessFactory instance, which were shared across every type of job.
Now, the WorkerApp instantiates one WorkerConfigs for each job type, and one ProcessFactory for each job type.
For now, only the Check job has new behavior (it will pull node selectors from a new environment variable,
CHECK_JOB_KUBE_NODE_SELECTORS
, and will use a status check interval of 1 second instead of 30). But the scaffolding is now in place to make further job-type-specific customization easier going forward.