Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s 1.17 compatibility #4822

Closed
openrory opened this issue Mar 4, 2020 · 15 comments
Closed

K8s 1.17 compatibility #4822

openrory opened this issue Mar 4, 2020 · 15 comments

Comments

@openrory
Copy link

openrory commented Mar 4, 2020

/kind question
/platform k8s

Question:
Hi there,
I really appreciate the work you're putting into Kubeflow. However, we will be moving to k8s 1.17 in a few months: will you be supporting k8s 1.17 anytime soon?

Cheers

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@saschagrunert
Copy link
Contributor

Hey 👋, do we have an overview about what needs to be done to support 1.17 somewhere?

@yanniszark
Copy link
Contributor

Hi @openrory @saschagrunert!
This issue tracks support for K8s 1.16, which needs a lot of changes: kubeflow/manifests#375
This is something we'd appreciate some help with :)
Also see: #4680 (comment)

I would start from there.
My procedure would be:

  1. Deploy a 1.17 K8s cluster
  2. Follow the guide in https://www.kubeflow.org/docs/started/k8s/kfctl-istio-dex/. It's the one I have checked and should be ok with 1.16, so maybe 1.17 as well.

@saschagrunert
Copy link
Contributor

saschagrunert commented Mar 5, 2020

I've deployed Kubeflow 1.0.0 on the latest Kubernetes master (1f2e1967d1c) via ./hack/local-up-cluster.sh (backed with CRI-O) and see (removed completed jobs):

> kubectl get pods --all-namespaces
NAMESPACE         NAME                                                          READY   STATUS             RESTARTS   AGE
cert-manager      cert-manager-cainjector-c578b68fc-x2jnj                       1/1     Running            0          13m
cert-manager      cert-manager-fcc6cd946-69qfp                                  1/1     Running            0          13m
cert-manager      cert-manager-webhook-657b94c676-74tnp                         1/1     Running            0          13m
istio-system      cluster-local-gateway-78f6cbff8d-zdrh4                        1/1     Running            0          13m
istio-system      grafana-68bcfd88b6-blqkq                                      1/1     Running            0          13m
istio-system      istio-citadel-7dd6877d4d-jvr4s                                1/1     Running            0          13m
istio-system      istio-egressgateway-7c888bd9b9-tzrkn                          1/1     Running            0          13m
istio-system      istio-galley-5bc58d7c89-qwszf                                 1/1     Running            0          13m
istio-system      istio-ingressgateway-866fb99878-sr8c9                         1/1     Running            0          13m
istio-system      istio-pilot-67f9bd57b-qg46h                                   2/2     Running            0          13m
istio-system      istio-policy-749ff546dd-wnkpd                                 2/2     Running            2          13m
istio-system      istio-sidecar-injector-cc5ddbc7-6kqqw                         1/1     Running            0          13m
istio-system      istio-telemetry-6f6d8db656-hrj7m                              2/2     Running            2          13m
istio-system      istio-tracing-84cbc6bc8-pg5sk                                 1/1     Running            0          13m
istio-system      kiali-7879b57b46-clrft                                        1/1     Running            0          13m
istio-system      prometheus-744f885d74-57x82                                   1/1     Running            0          13m
knative-serving   activator-58595c998d-rgb7p                                    2/2     Running            1          10m
knative-serving   autoscaler-7ffb4cf7d7-bh62v                                   2/2     Running            2          10m
knative-serving   autoscaler-hpa-686b99f459-8kj7q                               1/1     Running            0          10m
knative-serving   controller-c6d7f946-cxwn4                                     1/1     Running            0          10m
knative-serving   networking-istio-ff8674ddf-fgtxr                              1/1     Running            0          10m
knative-serving   webhook-6d99c5dbbf-kmq6b                                      1/1     Running            0          10m
kube-system       kube-dns-74b889989-gngpx                                      3/3     Running            0          15m
kubeflow          admission-webhook-bootstrap-stateful-set-0                    1/1     Running            0          10m
kubeflow          admission-webhook-deployment-59bc556b94-gm762                 1/1     Running            0          10m
kubeflow          application-controller-stateful-set-0                         1/1     Running            0          13m
kubeflow          argo-ui-5f845464d7-pkntm                                      1/1     Running            0          10m
kubeflow          centraldashboard-d5c6d6bf-2zvc5                               1/1     Running            0          10m
kubeflow          jupyter-web-app-deployment-544b7d5684-8rg72                   1/1     Running            0          10m
kubeflow          katib-controller-6b87947df8-pdbq4                             1/1     Running            1          10m
kubeflow          katib-db-manager-54b64f99b-p96ns                              0/1     CrashLoopBackOff   3          10m
kubeflow          katib-mysql-74747879d7-r9jpw                                  0/1     CrashLoopBackOff   5          10m
kubeflow          katib-ui-76f84754b6-w86wc                                     1/1     Running            0          10m
kubeflow          kfserving-controller-manager-0                                2/2     Running            1          10m
kubeflow          metacontroller-0                                              1/1     Running            0          10m
kubeflow          metadata-db-79d6cf9d94-9pzb5                                  0/1     CrashLoopBackOff   6          10m
kubeflow          metadata-deployment-5dd4c9d4cf-4bsm2                          0/1     Running            0          10m
kubeflow          metadata-envoy-deployment-5b9f9466d9-dlmnm                    1/1     Running            0          10m
kubeflow          metadata-grpc-deployment-66cf7949ff-gxf4x                     0/1     CrashLoopBackOff   6          10m
kubeflow          metadata-ui-8968fc7d9-8xhh9                                   1/1     Running            0          10m
kubeflow          minio-5dc88dd55c-cpwvn                                        1/1     Running            0          10m
kubeflow          ml-pipeline-55b669bf4d-fgtz7                                  1/1     Running            4          10m
kubeflow          ml-pipeline-ml-pipeline-visualizationserver-c489f5dd8-6mpww   1/1     Running            0          10m
kubeflow          ml-pipeline-persistenceagent-f54b4dcf5-2hqnv                  1/1     Running            1          10m
kubeflow          ml-pipeline-scheduledworkflow-7f5d9d967b-ssx9z                1/1     Running            0          10m
kubeflow          ml-pipeline-ui-7bb97bf8d8-bb9qk                               1/1     Running            0          10m
kubeflow          ml-pipeline-viewer-controller-deployment-584cd7674b-b85ch     1/1     Running            0          10m
kubeflow          mysql-66c5c7bf56-f6b2t                                        1/1     Running            0          10m
kubeflow          notebook-controller-deployment-576589db9d-pjj7t               1/1     Running            0          10m
kubeflow          profiles-deployment-769b65b76d-gqhtq                          2/2     Running            0          10m
kubeflow          pytorch-operator-666dd4cd49-7g9jk                             1/1     Running            0          10m
kubeflow          seldon-controller-manager-5d96986d47-8sdbp                    1/1     Running            0          10m
kubeflow          spark-operatorsparkoperator-7c484c6859-cpnkb                  1/1     Running            0          10m
kubeflow          spartakus-volunteer-7465bcbdc-n7mhx                           1/1     Running            0          10m
kubeflow          tensorboard-6549cd78c9-c7wrl                                  1/1     Running            0          10m
kubeflow          tf-job-operator-7574b968b5-w6szj                              1/1     Running            0          10m
kubeflow          workflow-controller-6db95548dd-4j94c                          1/1     Running            0          10m

The first thing I notices is that the metadata-db pod does not run because of:

> kubectl -n kubeflow logs -f metadata-db-79d6cf9d94-9pzb5
mkdir: cannot create directory '/var/lib/mysql': Permission denied

If I change the datadir to something else, then the mysql server comes up. The same applies to the katib mysql server. After that change I'm able to connect to kubeflow.

If I now start a notebook, then I'm getting more or less a simliar issue:

> kubectl -n sascha logs -f another-test-0 another-test
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/traitlets/traitlets.py", line 528, in get
    value = obj._trait_values[self.name]
KeyError: 'runtime_dir'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/jupyter-notebook", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/jupyter_core/application.py", line 268, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 663, in launch_instance
    app.initialize(argv)
  File "</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-7>", line 2, in initialize
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/notebook/notebookapp.py", line 1717, in initialize
    self.init_configurables()
  File "/usr/local/lib/python3.6/dist-packages/notebook/notebookapp.py", line 1372, in init_configurables
    connection_dir=self.runtime_dir,
  File "/usr/local/lib/python3.6/dist-packages/traitlets/traitlets.py", line 556, in __get__
    return self.get(obj, cls)
  File "/usr/local/lib/python3.6/dist-packages/traitlets/traitlets.py", line 535, in get
    value = self._validate(obj, dynamic_default())
  File "/usr/local/lib/python3.6/dist-packages/jupyter_core/application.py", line 99, in _runtime_dir_default
    ensure_dir_exists(rd, mode=0o700)
  File "/usr/local/lib/python3.6/dist-packages/jupyter_core/utils/__init__.py", line 13, in ensure_dir_exists
    os.makedirs(path, mode=mode)
  File "/usr/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/usr/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/usr/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/usr/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'

@saschagrunert
Copy link
Contributor

Seems related #4827

@yanniszark
Copy link
Contributor

@saschagrunert the Notebook Pods rely on the FSGroup being set correctly, in order to set correct permissions on the PVC.

I know for example, that hostPath provisioning doesn't support this.
Perhaps you are also using a provisioner that doesn't support it?

@saschagrunert
Copy link
Contributor

@saschagrunert the Notebook Pods rely on the FSGroup being set correctly, in order to set correct permissions on the PVC.

I know for example, that hostPath provisioning doesn't support this.
Perhaps you are also using a provisioner that doesn't support it?

Thanks for the hint. I'm not running on an NFS provisioner and then the deployment works seamlessly. I'm continuing my tests now.

@saschagrunert
Copy link
Contributor

@yanniszark is there a test-suite I can run agains my local cluster? Looks like the tests running here are relying on Google Cloud, right?

@vaskokj
Copy link

vaskokj commented Mar 6, 2020

@saschagrunert

You are or aren't running on the NFS provisioner? I assume you meant "If I'm not running on a NFS Provisioner and then the dpeloyment works seamlessly".

If so, Kubeflow doesn't work with it due to mysql containers not starting and functioning properly. #4827 (comment)

I'm trying to figure out why, but its extremely slow going.

The other error you are seeing with the PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local' is an issue supposedly with "permissions" on the PV.

#4538
#4538 (comment)

@Minkyu-Choi
Copy link

Hi there,
from manifest #375, it looks like manifests deploy problem has been solved on k8s 1.16.

Except API version deprecation made from k8s 1.16, is there any other incompatibility features blocking kubeflow deploy?

We're currently using kubeflow 1.0 on k8s 1.17 and there wasn't any issues so far.

@jlewi
Copy link
Contributor

jlewi commented May 18, 2020

Removing from 1.1 because we aren't explicitly targeting support for K8s 1.17 in 1.1.

@stale
Copy link

stale bot commented Aug 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/jupyter 0.83

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the area/jupyter Issues related to Jupyter label Aug 16, 2020
@Jeffwan
Copy link
Member

Jeffwan commented Aug 17, 2020

Tested 1.17 with KF 1.1. We don't see breaking changes there. It should be able to run on 1.17.

@stale stale bot removed the lifecycle/stale label Aug 17, 2020
@stale
Copy link

stale bot commented Nov 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants