Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure kube pod process per job type #10200

Merged
merged 2 commits into from
Feb 15, 2022

Conversation

pmossman
Copy link
Contributor

@pmossman pmossman commented Feb 9, 2022

What

Issue: #10208

Rather than defining a single WorkerConfigs and KubePodProcess to serve every job type, we want job-type-specific configurations so that we can, for example, run Check job pods in a node pool that is tailored to the needs of that particular job type.

How

Before this PR, the WorkerApp instantiated a single WorkerConfigs instance and a single ProcessFactory instance, which were shared across every type of job.

Now, the WorkerApp instantiates one WorkerConfigs for each job type, and one ProcessFactory for each job type.

For now, only the Check job has new behavior (it will pull node selectors from a new environment variable, CHECK_JOB_KUBE_NODE_SELECTORS, and will use a status check interval of 1 second instead of 30). But the scaffolding is now in place to make further job-type-specific customization easier going forward.

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Feb 9, 2022
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good approach to me.

In terms of going forward with the env variables I think we should do one of the following:

  1. Do what you did for check for discover and spec as well.
  2. Instead of calling it check selectors, call it quick jobs selectors or something.

#2 is a best that we want to just use the same node pool for check, discover, and spec. AKA all the jobs that return quickly. This seems plausible to me. That said, if we are wrong, it'll be a breaking change to get out of it (or at least awkward to get out of). #1 will be a bit more verbose though.

Only comment since a WIP.

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tag - direction looks good to me.

I'll take a closer look when the PR is ready to be reviewed.

@pmossman pmossman temporarily deployed to more-secrets February 10, 2022 01:55 Inactive
@pmossman
Copy link
Contributor Author

@davinchia @cgardens made some updates to this today and added a few tests. I tried to figure out a good way to test this more thoroughly via KubePodProcessIntegrationTest.java, but since that test is actually running against a local kube cluster, I couldn't actually set job-specific node pools since it would prevent the pod from ever starting. If either of you have suggestions for testing this more thoroughly, please let me know!

@pmossman
Copy link
Contributor Author

pmossman commented Feb 10, 2022

@jrhizor this PR currently has merge conflicts with the resourceRequirement changes you made for replication orchestrator pods. I think your approach conflicts a little bit with the approach I took in this PR, because you're adding additional information to WorkerConfigs, while my PR is instantiating a separate WorkerConfigs object for each pod type.

I'm not super opinionated here, but I think we should be consistent. So either I can refactor my PR to use a single WorkerConfigs with new fields like this:

WorkerConfigs {
  ...
  private final Map<String, String> workerKubeNodeSelectors;
  private final Map<String, String> specKubeNodeSelectors;
  private final Map<String, String> checkKubeNodeSelectors;
  private final Map<String, String> discoverKubeNodeSelectors;
  ...
}

Or I can create a new WorkerConfigs specifically for the ReplicationWorker that sets its resourceRequirements to the new values (this would be more consistent with the approach in my PR right now). I think I slightly lean towards this approach just because it seems like WorkerConfigs can easily become a huge bloated class with custom configurations for every worker type.

Let me know if that makes sense, also down to chat through it tomorrow in person if that's easier

@cgardens
Copy link
Contributor

I tried to figure out a good way to test this more thoroughly via KubePodProcessIntegrationTest.java, but since that test is actually running against a local kube cluster, I couldn't actually set job-specific node pools since it would prevent the pod from ever starting.

Maybe dumb, but are you allowed to use wildcards with node selectors? If you can then you could set the node selector for the default to something you know will fail:my-fake-node-pool. Then for the job type specific ones you set the selector to be *?

There may be a real way to do this, this is obviously a bit of a workaround.

Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I left a suggestion that may or may not work for your integration test. @davinchia may have a better idea.

@cgardens
Copy link
Contributor

(also you can remove [wip] from the title)

@jrhizor
Copy link
Contributor

jrhizor commented Feb 10, 2022

Or I can create a new WorkerConfigs specifically for the ReplicationWorker that sets its resourceRequirements to the new values (this would be more consistent with the approach in my PR right now). I think I slightly lean towards this approach just because it seems like WorkerConfigs can easily become a huge bloated class with custom configurations for every worker type.

Agreed with your approach.

@pmossman pmossman changed the title [WIP] Configure kube pod process per job type Configure kube pod process per job type Feb 10, 2022
@davinchia
Copy link
Contributor

davinchia commented Feb 10, 2022

w.r.t testing, I would give a manual test in one of our dev envs a shot as a sanity check. I don't think there is alot of value vis-a-vis the complexity of setting an E2E node pool test since the node selector bits this would mainly assert is mostly hitting the already well tested Kube api.

the bit here with more custom logic is the whole env var injection bit and it looks like there are tests to make sure the env vars are as injected - this seems fine to me.

@pmossman pmossman force-pushed the parker/check-connection-speedup branch from 17856f0 to 9fe0d05 Compare February 10, 2022 23:20
@pmossman pmossman temporarily deployed to more-secrets February 10, 2022 23:22 Inactive
@pmossman pmossman temporarily deployed to more-secrets February 11, 2022 00:12 Inactive
@pmossman pmossman force-pushed the parker/check-connection-speedup branch from 9fe0d05 to 7a67942 Compare February 11, 2022 00:50
@pmossman pmossman temporarily deployed to more-secrets February 11, 2022 00:52 Inactive
@pmossman pmossman force-pushed the parker/check-connection-speedup branch from 7a67942 to 3b78abf Compare February 11, 2022 01:23
@pmossman pmossman temporarily deployed to more-secrets February 11, 2022 01:25 Inactive
@pmossman pmossman force-pushed the parker/check-connection-speedup branch from 3b78abf to 40855f2 Compare February 11, 2022 01:36
@pmossman pmossman temporarily deployed to more-secrets February 11, 2022 01:38 Inactive
@pmossman
Copy link
Contributor Author

@jrhizor can you sanity check this PR for the places it touches Replication Orchestrator configs?

Key change is here: https://github.com/airbytehq/airbyte/pull/10200/files#diff-b332562b5e3b34bdc4e253e7186267ab8dbc13ec4364fd0d28e7fa418b5bfa6aR138-R154

@davinchia and @cgardens I will plan on releasing this with a new OSS version tomorrow, and will sanity check that new version on Cloud dev before doing the Cloud upgrade. Then once we have the new node pool with appropriate labels, we should be able to add CHECK_JOB_KUBE_NODE_SELECTORS here in order to get chose Check pods moved over to the node pool.

Of course, let me know if anything there doesn't make sense or should be approached differently!

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

… job node selectors

* move status check interval to WorkerConfigs and customize for check worker

* add scaffolding for spec, discover, and sync configs

* optional orElse instead of orElseGet
@pmossman pmossman force-pushed the parker/check-connection-speedup branch from 40855f2 to f39cf5d Compare February 11, 2022 16:45
@pmossman pmossman temporarily deployed to more-secrets February 11, 2022 16:47 Inactive
@jrhizor
Copy link
Contributor

jrhizor commented Feb 11, 2022

@pmossman lgtm

@pmossman
Copy link
Contributor Author

QA Summary:

  • Opened Cloud PR to populate CHECK_JOB_KUBE_NODE_SELECTORS with new quick-jobs node pools that Davin recently added
  • Deployed that Cloud branch to dev-2
  • Built, tagged, and pushed OSS airbyte/worker image from this branch
  • Edited dev-2 airbyte-worker deployment to pull this tagged image
  • Created a new source on dev-2 via UI and described the check pod that was spawned:
❯ kl describe pod ource-pokeapi-sync-7e551fa5-ba30-4348-9fa5-8861ea56a5de-0-buiol -n jobs | grep -i selector

Node-Selectors:  cloud.google.com/gke-nodepool=dev-2-quick-jobs-pool

Also found the pod via GCP console on one of the new quick-jobs nodes: https://console.cloud.google.com/kubernetes/pod/us-west3/dev-2-gke-cluster/jobs/ource-pokeapi-sync-7e551fa5-ba30-4348-9fa5-8861ea56a5de-0-buiol/details?project=dev-2-ab-cloud-proj

So, I feel pretty comfortable that this can be merged. Cloud PR should go first, then this one. Will wait until Monday to avoid any issues over the weekend. @davinchia @cgardens for visibility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants