-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessing stdout from Dask worker and scheduler pods #768
Comments
A small proof of concept on the GCP cluster. These permissions are overly broad, since they grant read access to all the pods in the namespace. Ideally it would be filtered to just pods of a certain type / label (hopefully we can do this, I haven't checked). ---
# rbac-reader-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: prod
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "watch", "list"]
---
# rbac-reader-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: prod
subjects:
- kind: ServiceAccount
name: pangeo
namespace: prod
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io With those permissions import kubernetes.config
import kubernetes.client
from dask_gateway import GatewayCluster
kubernetes.config.load_incluster_config()
cluster = GatewayCluster()
client = cluster.get_client()
cluster.scale(1)
v1 = kubernetes.client.CoreV1Api()
ret = v1.list_namespaced_pod("prod", watch=False)
ret = v1.list_namespaced_pod("prod", watch=False)
pods = ret.items
import os
def filter_pods(cluster_name, pods):
return [
pod for pod in pods if
pod.metadata.labels.get("app.kubernetes.io/name") == "dask-gateway" and
pod.metadata.labels.get("hub.jupyter.org/username") == os.environ["JUPYTERHUB_USER"]
and cluster_name.split(".")[-1] in set(pod.metadata.name.split("-"))
]
mypods = filter_pods(cluster.name, pods)
>>> logs = [v1.read_namespaced_pod_log(pod.metadata.name, "prod") for pod in mypods]
>>> logs
['dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 0\n', ''] I imagine things like fetching the pods and filtering them would be packaged into a small import pangeo_cloud
pangeo_cloud.get_logs(cluster.name) |
Apparently, this is difficult. You can filter on So if I'm correct, then supporting this means we need to be comfortable with everybody reading everyone else's logs, including the logs from our pods (the hub, traefik, dask-gateway, etc.). |
Thanks for looking into this Tom!
Yes, this is partially what motivated the move to dask-gateway. Personally I hope there is an alternative, but don't mind if you go ahead with this. Perhaps the long term solution, which might also address #693, is to give each user their own kubernetes namespace One last idea, maybe instead of |
Thanks @TomAugspurger for organizing thoughts here. Quick question regarding this bit from above?
Is anything like this possible today or is this a hypothetical api? |
Today, the scheduler collects logs from `distributed.scheduler` onto a
deque on the scheduler, and the workers collect logs from
`distributed.worker`. Using `client.get_scheduler_logs()` and
`client.get_worker_logs()` will bring those back to the client.
That doesn't catch anything else that's logged, however (e.g.
`distributed.comm`, other libraries). The hypothetical bit is a way to
collect all the logs on each container somewhere and a way to access them
from the client.
…On Fri, Oct 2, 2020 at 12:51 AM Joe Hamman ***@***.***> wrote:
Thanks @TomAugspurger <https://github.com/TomAugspurger> for organizing
thoughts here. Quick question regarding this bit from above?
Then we'd provide some method for collecting those to the client (either
using client.get_worker_logs() or by client.run(custom_function) to collect
the log output..
Is anything like this possible today or is this a hypothetical api?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#768 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIROM45AD7RKADYRC3DSIVS5FANCNFSM4SAZYMIQ>
.
|
There are plenty of k8s logging scrapers, but the problem becomes that suddenly you need to: a) scrape logs, b) store logs, c) manage access to logs with a separate auth system that still need to relate to the dask's auth system as there should be 1to1 mapping to grant the correct access. |
I think the way to go is to stay within the code base of Dask in general. I think its too easy to end up developing something that is hard to maintain sustainable otherwise as it may lock in too many parameters (k8s, single namespace deployment, or similarly...) with a too small user base to motivate its maintenance. Dask's code base can reasonably keep track of what goes on (logs), and when interacting with it you have already an established identity, and when you interact with it you can avoid locking in to something that becomes specific to the deployment method (k8s worker pods, same server workers processes, hpc jobs). In my, these considerations are similar to a domain I'm more familiar with. Consider a feature request relating to KubeSpawner that creates the pods for JupyterHub where users get their own Jupyter server. One could implement this feature...
Option c) here is the more robust long term solution, and I think for something like collecting logs from a Dask worker, it sounds like we should look for the equivalent of something like option c), or perhaps b), but not a). |
As a small proof of concept, with the following in a config file picked up by dask, all the logging messages (from distributed or other libraries) will be sent to a handler on the root logger that keeps the last logging:
version: 1
handlers:
deque_handler:
class: distributed.utils.DequeHandler
level: INFO
loggers:
"":
level: INFO
handlers:
- deque_handler That could be loaded when we start a pod (singleuser, scheduler, or worker) by including it in the Docker image. With some code, we can get the logs from the workers def get_logs():
import logging
from distributed.utils import DequeHandler
logger = logging.getLogger("")
handler, = [handler for handler in logger.handlers if isinstance(handler, DequeHandler)]
return [(msg.levelname, handler.format(msg)) for msg in handler.deque]
client.run(get_logs) And from the scheduler with But this workflow isn't the nicest. That logging config should be mostly harmless for others, but we aren't the only users of pangeo's docker images. And that So now I'm wondering if we'd be better served by a new page in the scheduler dashboard that aggregates all the logs from the cluster (right now the worker logs page just shows the logs from |
The problem: users can't see text printed to stdout on remote pods (primarily Dask's scheduler and workers). This makes debugging certain problems difficult or impossible if you can't track down someone with access to the logs (i.e. somewhen with access to the kubernetes API).
There are two primary sources of things being printed to stdout
I'd like to make it easy to get both of these. Problem 1 can be somewhat solved through conventional tools. We'd configure Dask's loggers to output somewhere (probably in-memory, maybe over the network to the scheduler?). Then we'd provide some method for collecting those to the client (either using
client.get_worker_logs()
or byclient.run(custom_function)
to collect the log output..The second is harder. If a library is printing directly to
stdout
(say it's some C or fortran library that has no concept of Python's logging), it's essentially too late for Dask to catch it. Kubernetes, however, does catch it, and displays it in the pod's logs. With the default configuration, Dask's logs with a level oflogging.WARNING
or higher also appear there.Given that solving the second problem also solves the first, let's focus on that. My proposal has two components:
get
andlist
) to user's jupyter pods.In the past we had concerns about exposing the kubernetes API to user pods. My hope is that we can be limited in what permissions we grant users. Ideally users would only be able to see (and not modify) objects related to their own pods. I suspect that isolating users from each other won't be feasible, and everyone will be able to read everyone else's pods. Is that a blocker for anyone?
The text was updated successfully, but these errors were encountered: