Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add carbonplan cluster + hubs #391

Merged
merged 12 commits into from
May 14, 2021
2 changes: 2 additions & 0 deletions .sops.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
creation_rules:
- path_regex: .*/secrets/.*
gcp_kms: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs
- path_regex: .*/kops/ssh-keys/.*
gcp_kms: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs
- path_regex: config/secrets.yaml$
gcp_kms: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs
182 changes: 182 additions & 0 deletions config/hubs/carbonplan.cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
name: carbonplan
provider: kubeconfig
kubeconfig:
file: secrets/carbonplan.yaml
hubs:
- name: staging
domain: staging.carbonplan.2i2c.cloud
template: daskhub
auth0:
connection: github
config: &carbonPlanHubConfig
scratchBucket:
enabled: false
basehub:
nfsPVC:
nfs:
# from https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
mountOptions:
- rsize=1048576
- wsize=1048576
- timeo=600
- soft # We pick soft over hard, so NFS lockups don't lead to hung processes
- retrans=2
- noresvport
serverIP: fs-8a4e4f8d.efs.us-west-2.amazonaws.com
baseShareName: /
shareCreator:
tolerations:
- key: node-role.kubernetes.io/master
operator: "Exists"
effect: "NoSchedule"
jupyterhub:
homepage:
templateVars:
org:
name: Carbon Plan
logo_url: https://pbs.twimg.com/profile_images/1262387945971101697/5q_X3Ruk_400x400.jpg
url: https://carbonplan.org
designed_by:
name: 2i2c
url: https://2i2c.org
operated_by:
name: 2i2c
url: https://2i2c.org
funded_by:
name: Carbon Plan
url: https://carbonplan.org
singleuser:
initContainers:
# Need to explicitly fix ownership here, since EFS doesn't do anonuid
- name: volume-mount-ownership-fix
image: busybox
command: ["sh", "-c", "id && chown 1000:1000 /home/jovyan && ls -lhd /home/jovyan"]
securityContext:
runAsUser: 0
volumeMounts:
- name: home
mountPath: /home/jovyan
subPath: "{username}"
image:
name: carbonplan/trace-python-notebook
tag: sha-da2d1c9
profileList:
# The mem-guarantees are here so k8s doesn't schedule other pods
# on these nodes.
- display_name: "Small: r5.large"
description: "~2 CPU, ~15G RAM"
kubespawner_override:
# Expllicitly unset mem_limit, so it overrides the default memory limit we set in
# basehub/values.yaml
mem_limit: null
mem_guarantee: 12G
node_selector:
node.kubernetes.io/instance-type: r5.large
- display_name: "Medium: r5.xlarge"
description: "~4 CPU, ~30G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 29G
node_selector:
node.kubernetes.io/instance-type: r5.xlarge
- display_name: "Large: r5.2xlarge"
description: "~8 CPU, ~60G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 60G
node_selector:
node.kubernetes.io/instance-type: r5.2xlarge
- display_name: "Huge: r5.8xlarge"
description: "~32 CPU, ~256G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 250G
node_selector:
node.kubernetes.io/instance-type: r5.8xlarge
scheduling:
userPlaceholder:
enabled: false
replicas: 0
userScheduler:
enabled: false
proxy:
service:
type: LoadBalancer
https:
enabled: true
chp:
nodeSelector: {}
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
traefik:
nodeSelector: {}
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
hub:
allowNamedServers: true
networkPolicy:
# FIXME: For dask gateway
enabled: false
readinessProbe:
enabled: false
nodeSelector: {}
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
dask-gateway:
traefik:
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
controller:
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
gateway:
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
# TODO: figure out a replacement for userLimits.
extraConfig:
optionHandler: |
from dask_gateway_server.options import Options, Integer, Float, String
def cluster_options(user):
def option_handler(options):
if ":" not in options.image:
raise ValueError("When specifying an image you must also provide a tag")
extra_annotations = {
"hub.jupyter.org/username": user.name,
"prometheus.io/scrape": "true",
"prometheus.io/port": "8787",
}
extra_labels = {
"hub.jupyter.org/username": user.name,
}
return {
"worker_cores_limit": options.worker_cores,
"worker_cores": min(options.worker_cores / 2, 1),
"worker_memory": "%fG" % options.worker_memory,
"image": options.image,
"scheduler_extra_pod_annotations": extra_annotations,
"worker_extra_pod_annotations": extra_annotations,
"scheduler_extra_pod_labels": extra_labels,
"worker_extra_pod_labels": extra_labels,
}
return Options(
Integer("worker_cores", 2, min=1, max=16, label="Worker Cores"),
Float("worker_memory", 4, min=1, max=32, label="Worker Memory (GiB)"),
String("image", default="pangeo/pangeo-notebook:latest", label="Image"),
handler=option_handler,
)
c.Backend.cluster_options = cluster_options
idle: |
# timeout after 30 minutes of inactivity
c.KubeClusterConfig.idle_timeout = 1800
- name: prod
domain: carbonplan.2i2c.cloud
template: daskhub
auth0:
connection: github
config: *carbonPlanHubConfig
damianavila marked this conversation as resolved.
Show resolved Hide resolved
10 changes: 5 additions & 5 deletions hub-templates/basehub/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jupyterhub:
userScheduler:
enabled: true
nodeSelector:
hub.jupyter.org/pool-name: core-pool
hub.jupyter.org/node-purpose: core
resources:
requests:
# FIXME: Just unset this?
Expand All @@ -72,7 +72,7 @@ jupyterhub:
type: ClusterIP
chp:
nodeSelector:
hub.jupyter.org/pool-name: core-pool
hub.jupyter.org/node-purpose: core
resources:
requests:
# FIXME: We want no guarantees here!!!
Expand All @@ -83,7 +83,7 @@ jupyterhub:
memory: 1Gi
traefik:
nodeSelector:
hub.jupyter.org/pool-name: core-pool
hub.jupyter.org/node-purpose: core
resources:
requests:
memory: 64Mi
Expand All @@ -102,7 +102,7 @@ jupyterhub:
startTimeout: 600 # 10 mins, because sometimes we have too many new nodes coming up together
defaultUrl: /tree
nodeSelector:
hub.jupyter.org/pool-name: user-pool
hub.jupyter.org/node-purpose: user
image:
name: set_automatically_by_automation
tag: 1b83c4f
Expand Down Expand Up @@ -183,7 +183,7 @@ jupyterhub:
JupyterHub:
authenticator_class: oauthenticator.generic.GenericOAuthenticator
nodeSelector:
hub.jupyter.org/pool-name: core-pool
hub.jupyter.org/node-purpose: core
networkPolicy:
enabled: true
ingress:
Expand Down
11 changes: 5 additions & 6 deletions hub-templates/daskhub/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,10 @@ dask-gateway:
# See https://github.com/dask/dask-gateway/blob/master/resources/helm/dask-gateway/values.yaml
controller:
nodeSelector:
hub.jupyter.org/pool-name: core-pool
k8s.dask.org/node-purpose: core
gateway:
nodeSelector:
hub.jupyter.org/pool-name: core-pool
k8s.dask.org/node-purpose: core
backend:
scheduler:
extraPodConfig:
Expand All @@ -143,8 +143,7 @@ dask-gateway:
value: "user"
effect: "NoSchedule"
nodeSelector:
# Schedulers should be in the user pool
hub.jupyter.org/pool-name: user-pool
k8s.dask.org/node-purpose: scheduler
cores:
request: 0.01
limit: 1
Expand All @@ -171,7 +170,7 @@ dask-gateway:
effect: "NoSchedule"
nodeSelector:
# Dask workers get their own pre-emptible pool
hub.jupyter.org/pool-name: dask-worker-pool
k8s.dask.org/node-purpose: worker

# TODO: figure out a replacement for userLimits.
extraConfig:
Expand Down Expand Up @@ -217,6 +216,6 @@ dask-gateway:
type: jupyterhub # Use JupyterHub to authenticate with Dask Gateway
traefik:
nodeSelector:
hub.jupyter.org/pool-name: core-pool
k8s.dask.org/node-purpose: core
damianavila marked this conversation as resolved.
Show resolved Hide resolved
service:
type: ClusterIP # Access Dask Gateway through JupyterHub. To access the Gateway from outside JupyterHub, this must be changed to a `LoadBalancer`.
37 changes: 32 additions & 5 deletions kops/carbonplan.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ local data = {
name: "carbonplanhub.k8s.local"
},
spec+: {
configBase: "s3://2i2c-carbonplan-kops-state"
// FIXME: Not sure if this is necessary?
configBase: "s3://2i2c-carbonplan-kops-state/%s" % data.cluster.metadata.name
damianavila marked this conversation as resolved.
Show resolved Hide resolved
},
_config+:: {
zone: zone,
Expand All @@ -33,15 +34,16 @@ local data = {
machineType: "t3.medium",
subnets: [zone],
nodeLabels+: {
"hub.jupyter.org/pool-name": "core-pool"
"hub.jupyter.org/node-purpose": "core",
"k8s.dask.org/node-purpose": "core"
},
// Needs to be at least 1
minSize: 1,
maxSize: 3,
role: "Master"
},
},
nodes: [
notebookNodes: [
ig {
local thisIg = self,
metadata+: {
Expand All @@ -56,18 +58,43 @@ local data = {
maxSize: 20,
role: "Node",
nodeLabels+: {
"hub.jupyter.org/pool-name": thisIg.metadata.name
"hub.jupyter.org/node-purpose": "user",
"k8s.dask.org/node-purpose": "scheduler"
},
taints: [
"hub.jupyter.org_dedicated=user:NoSchedule",
"hub.jupyter.org/dedicated=user:NoSchedule"
],
},
} + n for n in nodes
],
daskNodes: [
ig {
local thisIg = self,
metadata+: {
labels+: {
"kops.k8s.io/cluster": data.cluster.metadata.name
},
name: "dask-%s" % std.strReplace(thisIg.spec.machineType, ".", "-")
},
spec+: {
machineType: n.machineType,
subnets: [zone],
maxSize: 20,
role: "Node",
nodeLabels+: {
"k8s.dask.org/node-purpose": "worker"
},
taints: [
"k8s.dask.org_dedicated=worker:NoSchedule",
"k8s.dask.org/dedicated=worker:NoSchedule"
],
},
} + n for n in nodes
]
};

[
data.cluster,
data.master
] + data.nodes
] + data.notebookNodes + data.daskNodes
34 changes: 33 additions & 1 deletion kops/libsonnet/cluster.jsonnet
Original file line number Diff line number Diff line change
@@ -1,4 +1,33 @@
// local cluster(name, configBase, zone, masterIgName, networkCIDR, subnets) = {
// Exports a customizable kops Cluster object.
// https://kops.sigs.k8s.io/cluster_spec/ lists available properties.
//
// The default configuration sets up the following:
//
// 1. One etcd cluster each on the master node for events & api,
// with minimal resource allocations
// 2. Calico for in-cluster networking https://kops.sigs.k8s.io/networking/calico/,
// with the default settings. Explicitly decided against AWS-VPC cluster networking
// due to pod density issues - see https://github.com/2i2c-org/pangeo-hubs/issues/28.
// 3. Nodes in only one subnet in one AZ. Ideally, the master would be multi-AZ but
// the nodes would be single AZ. Multi AZ workers run into problems attaching PVs
// from other AZs (for hub db PVC, for example), and incurs networking cost for no
// clear benefit in our use case. An opinionated set of IP ranges is picked here,
// and the subnet is created in _config.zone.
// 4. kops defaults for networking - a /16 network for the entire cluster,
// with a /19 allocated to the one subnet currently in use. This allows for
// ~8000 currently active pods.
// 5. Kubernetes API and SSH access allowed from everywhere.
// 6. IAM Permissions to pull from ECR.
// 7. Enables feature gates to allow hub services to run on master node as well.
// 8. Docker as the container runtime.
//
damianavila marked this conversation as resolved.
Show resolved Hide resolved
// Supports passing a hidden `_config` object that takes the following
// keys:
// 1. masterInstanceGroupName
// Name of the InstanceGroup that is the master. The etcd clusters will be
// put on this.
// 2. zone
// Zone where the cluster is to be set up
{
_config+:: {
masterInstanceGroupName: "",
Expand Down Expand Up @@ -63,6 +92,9 @@
anonymousAuth: false,
featureGates: {
// These boolean values need to be strings
// Without these, services can't target pods running on the master node.
// We want our hub core services to run on the master node, so we need
// to set these.
LegacyNodeRoleBehavior: "false",
ServiceNodeExclusion: "false"
}
Expand Down
Loading