-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark on Kubernetes integration #1030
Comments
Okay, I just figured out, that it is necessary to provide the self built docker images based on the executing environment (in this case the jupyter pyspark docker image) and to provide it to the |
Thanks for documenting what you learn @h4gen! |
So, some progress over here. I did the following to build the images and provide the necessary information to the
Feel free to skip these steps and use my pre-built images from docker hub for testing out yourself (assuming I made no mistakes so far):
Now I get the following error:
Short google query leads to the asumption that this is a sudo problem in the user pod. Will investigate further the next days. Cheers! |
Hello everyone, I further investigated the error and came up with a running configuration.
So far everything I tried worked pretty nice. I will update this if I encounter any further problems. Cheers! |
Wieeee thank you so luch @h4gen! This may help me out personally a lot in the future! |
A few further thoughts on this.
|
some loose thoughts, no clear answer: writing from mobile The hub.extraConfig can expand the dictionary found in but u van also use the charts singleuser.extraEnv to set it directly from the chart config. if you have a k8s service pointing to some pod, u can reach it with mysvc.mynamespace.svc.cluster.local ar URI btw. note that the the jupyter-username is a pod name, and u cannot access that as a network identifier like google.se. if u would need that u would need to create a service for each user pod pointing to a pod with certain labels. i think the master may always be reached with a fixed ip on GKE and other managed k8s procided by some cloud provider btw |
Kubernetes creates the environment variable See: https://kubernetes.io/docs/concepts/services-networking/service/#discovering-services To get the pod IP, it is probably most convenient to use Kubernetes' downward API: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/#use-pod-fields-as-values-for-environment-variables
(You could also just call |
Slam dunk @dsludwig ! :D |
This is a cool write up with lots of useful bits of information. Could you be persuaded to write it up once more now that you know what "the answer" is and then post it on https://discourse.jupyter.org/c/jupyterhub/z2jh-k8s? Our thinking is that the discourse forum is a lot more discoverable (and better indexed by google?) than GitHub issues. Something to ponder. |
Before I do the documentation I would like to ask again if somebody can give me a hint regarding the Spark UI problem. I assumed it would be working with nbserverextension, but it does not :( I sum up all Information, I have:
One can see that the exposed Any ideas on this? I'm a bit lost. Thank you! |
@h4gen Can you submit this as an issue to nbserverproxy so we can discuss it there? A couple of simple things to try:
|
Not entirely sure if pangeo is sufficiently similar for this to be useful or not, but I'll throw it up here for the record. I've got Spark 2.4 on Kubernetes running nicely with standard Toree kernel and version 0.7 of the jupyterhub helm charts: Create a config.yaml:
Then install. I've elided the spark RBAC setup but it's the usual stuff from the spark-on-kubernetes docs. At least for me, figuring out how to use the service account credentials (and that they were needed) and getting the pod context configuration (namespace, pod name, ip) into the environment variable were tricky since most of the examples I've seen were using downward config maps which don't seem to "fit" into |
@easel This is super helpful! A few questions regarding your setup:
|
Hi @h4gen, answers per above:
|
Thanks for the answers @easel. Few more questions:
but on the pod this results in an environment variabel like:
Does anybody know why this is the case? Thank you! |
Having the same issue with |
@metrofun The best way I found to set this dynamically is to write to a
|
@h4gen were you able to find a way to automatically set SPARK_PUBLIC_DNS? I am running into the same issue where
does not evaluate correctly, and setting environment variables in the postStart lifecycleHooks for the container doesn't seem to work either (it will result in an empty environment variable):
I was thinking of opening a separate issue about setting environment variables that require other environment variables, but wanted to see if you had a solution first. Edit: Of course the second solution won't work as that will only set the environment variable in the postStart shell. My workaround is to set SPARK_PUBLIC_DNS in the single user's Entrypoint script. I've opened an issue about this to see if this functionality is possible with |
@h4gen have you updated this at all using updates over the past year or so? |
Hi, @thejaysmith . Sorry no updates from my side. We switched to kubeflow. |
@h4gen thank you soo much for writing this up publically, too much work is repeated when we work in isolation! ❤️ At this point, I'll go ahead and close this issue as its becoming stale and doesn't involve an action point - it is still very findable with search engines though. Lots of love from Sweden! |
Hello everybody!
I am using pangeo as configuration for my JupyterHub but decided to post this issue here as I think it is not pangeo specific. As some of you maybe know the current version of Spark (2.4) introduces PySpark support for the new kubernetes functionality of spark. I tried to get it running on my cluster by adapting this tutorial. I know this issue is primarily a PySpark issue but I thought it is maybe interesting to discuss it here as I can imagine it is interesting for other JupyterHub users too. Here is what I did:
This extends the rights of the
daskkubernetes
service account which is necessary for pangeo to interact with dask properly.Creating a user pod from the current Jupyter PySpark docker image, which supports Spark 2.4
Getting my master ip with
kubectl cluster-info
Create an new
SparkContext
in the following way:This is the output of the context:
When I do
kubectl get po --all-namespaces
I can not see a new Spark pod running.The last line sadly stucks and when I interrupt it this is the output:
Referring to the tutorial I think that the SparkContext needs more Information to run correctly. Sadly it does not throw any errors when creating it. Is there anybody with more spark/kubernetes knowledge interested in trying it and sharing insights?
Edit:
The main problem seems to be that referring to the pyspark api, there is no way to provide the necessary information to the
SparkContext
.Thank you very much!
The text was updated successfully, but these errors were encountered: