-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Jupyter for Spark #212
Comments
@mariusvniekerk has a progress bar implementation in Spylon that uses the Spark status tracker and prints progress for stages. This works because cell execution is blocked until a job is done, and so the prints wind up as output in the correct cell. Having an API that a lib like Spylon could call to emit progress information to a named cell / progress bar would make this cleaner and go beyond the Spark case. Ref: https://github.com/MaxPoint/spylon/blob/master/spylon/spark/progress.py#L88 We can try to draft a strawman spec / prototype for this. |
oh yay! :) |
That is what I get for working from my phone. :) @rgbkrk kudos for opening this up and your description. I want to stress that any solution here needs to be language agnostic. Extensions to the protocol is one option, but I wonder if taking advantage of COM channel is also a good alternative. It seems that a set of front end widgets that try to connect to COM targets to find their kernel counterparts might work. |
Wrong person! Sounds all very interesting though. Keep up the good work 👍
|
Thanks! I certainly agree about whatever we do ending up language agnostic. There's more than one thing to tackle, and I want to push us towards evolving specs, even if they're versioned and well specified APIs on the comms. |
This is great, thanks @rgbkrk! I'll try to focus on my protocol and kernel maintainer hats, and let other people bring Spark-user and frontend-developer perspectives. I think we can can break out several of these points into concrete Issues/proposals, as has been done with the async/background output Issues already. Apologies for how long this turned out to be. The main question for me in all of these issues is what changes need to happen where, and I think we're getting closer to knowing these things. In particular, I expect that creating a spark-in-ipython package that provides some niceities for using Spark in IPython in Jupyter will make sense (for experimental library-specific stuff like this, there's bound to be some work that belongs in a more agile place than pyspark or ipython). I'd be happy for this to reside on the IPython org, but it doesn't matter much to me. Async/background outputI think the async/background/update output conversations are making progress, and I'll refer to those issues for further conversation. I fully support both proposals; one is primarily kernel APIs, while the second is a small protocol update. Progress barsFor progress-bar primitives, it seems to me that update-in-place HTML output discussed above would do a pretty good job at this, and ultimately be less difficult than defining and supporting progress bar mime-types in the protocol, but I'm open to either approach. While there are bound to continue to be a smaller number of frontends than kernels, if this is a faster-than-glacial spec, I think supporting it as an API in kernel libraries will be easier than supporting it in frontends. Cluster/computational ContextI think the kernelspec == kernel-startup-environment model works relatively well in terms of sysadmins communicating broad categories of available contexts to users. There is still loads of room for improvement in the tooling for administrators to create and manage these kernelspecs, so I think that can be a project to explore. The case this doesn't support well is smaller user-provided parameters, such as memory / CPU allocations as in a traditional HPC environment. This would be a new feature-request for notebooks to be able to declare/store additional parameters for kernel startup. JupyteHub has an answer to this at the notebook-server level, where users can be presented with a form for inputs to influence how their notebook-server is spawned. We could adopt a similar mechanism for kernels, but it would require adapting KernelManager to be more like the Hub's Spawner mechanism. The simple/general version of this is to allow specifying environment variables, but that becomes a security issue, as allowing notebooks to specify things like PATH and PYTHONPATH makes kernel startup a vulnerable action for untrusted notebooks. Kernel Startup and Background ResourcesI think adding IOPub startup messages that declare state is a fine thing to do. One version of this could be to add a The banner is only stored on the kernel and retrieved via Uploading librariesThis also ties to the Notebooks vs Workspaces discussion. As a developer and maintainer, I still vastly prefer the directory/repo as the sharing/environment/workspace entity, and strongly dislike using zip files as single-files-that-are-actually-directories. Since this is so common, I think we do need to improve the tooling for interacting with the workspace. If the main challenge is sharing/moving directories around, adding download-as-zip and upload-and-unzip and/or git integration may go a long way. We could also support zip files in the ContentsManager, so that zip files can be treated transparently as directories. All sorts of stuff breaks if you do this, as can be seen by ZipImport in Python. Timing dataAs real-time collaboration adds the notion of 'live' document state, we can safely preserve lots more information in transients that persist as long as the notebook is 'running', and discard on long-term save to disk. We could do this right now, but information would be lost on page reload, rather than kernel stop, which may be more confusing than helpful. If the information really should be persisted long-term, we have metadata fields available, and a provenance-mode extension could be recording this today. Style of error reportingThis is a proposal that's been on the docket for years now, and of particular interest to @fperez. Talking to other language authors (Julia, Haskell, Scala, X+Spark) will be super important to do this in a way that works. Discussions have ranged from an opaque mime-bundle (strictly an improvement on today) to structured data allowing good references to lines in the input code. I'm not sure how feasible the structured solution will be in general, but it should be explored, at least. The rest is building good exceptions and tracebacks at the library level, which can be improved upon even now, when all we have are plain text tracebacks. |
Hi everyone ... @cameres and I are also quite interested in Jupyter-Spark integration. We met fperez and a few other Jupyter folks on 21 October to get oriented. Meanwhile it seems events have overtaken us :-). |
Hi @SamPenrose, Glad to hear that you and Connor are looking at this. Howdy to @jezdez too, and thanks for the spark extension. Looping in @parente re: notebook experience survey results that could be shared. |
Here's the raw data and analysis notebooks from the late-2015 UX survey about Jupyter Notebook: https://github.com/jupyter/design/tree/master/surveys/2015-notebook-ux A live, dashboard version is available here: http://jupyter.cloudet.xyz/files/2015-notebook-ux-survey/analysis/deployed_dashboard/index.html |
you can install tensorflow for jupyter if you just compile tensorflow under conda environment for 2.7 or 3.5 then you can use tensorflow inside jupyter |
Just want to give personal experience on this, I tried all of the Spark kernels for Jupyter out there, only sparkmagics works out of the box. The only downside is missing tab completion. I spend days just to make other project works, but it still nowhere enough. |
Re: #212 (comment), I added some comments to the |
@willingc that's a good idea. Roadmap is the logical place, I think. Or an Enhancement Proposal. |
Roadmap sounds great. Lots of little PRs or one big PR? |
One PR to start, merge quick and iterate. |
Alright, I'll submit (mostly) my original text and expect others to follow up with new PRs to address things I've missed that we've discussed here. |
I did it in the style of some recent previous ones (to create a separate markdown file). |
Thanks @rgbkrk 🍰 |
From reading this and interacting with folks over the last few months, it seems like a few things occurred:
I'm wondering now where efforts are going and if folks want some channels to communicate efforts/progress on the spark roadmap? I'm certainly interested. |
I am interested as well.
…On Tue, Jan 10, 2017 at 4:35 PM, Kyle Kelley ***@***.***> wrote:
From reading this and interacting with folks over the last few months, it
seems like a few things occurred:
- update displays was fully implemented across jupyter notebook, the
ipython kernel, and nteract notebook
- Folks at MaxPoint, IBM, Microsoft, and Mozilla have all been
building solutions in this space for various aspects
I'm wondering now where efforts are going and if folks want some channels
to communicate efforts/progress on the spark roadmap? I'm certainly
interested.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#212 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKLTrIv1snYcUIpIhwAQXyuqBSFqbrbks5rQ_mYgaJpZM4KbNRL>
.
--
Jacqueline Kazil | @jackiekazil
|
Communication efforts I'm undertaking now, will keep posting:
If others have more to add on to the spark roadmap (even if bringing in more from this thread), I'd love to get more in and communicated! |
I'd love to see standardization of cell-level progress bars with cancellation. This is key feature of Databricks and Zeppelin, and goes beyond Spark. Zeppelin Interpreter API |
Is Zeppelin's |
Zeppelin cancels the underlying Spark or SQL job, whatever the interpreter implements. But perhaps the recommendation should just be for libraries to catch the SIGINT from |
@rgbkrk It just occurred to me that the proposal might be missing a section where the Jupyter community gives a set of architectural recommendations of how to layout the Jupyter notebook atop spark. Things like remote kernels, the role of the kernel gateway and potential changes/extensions needed on the notebook server. This to me is one of the most challenging pieces due to the particular networking requirements of Spark. |
@lbustelo a very worthy addition to the roadmap, we definitely need narrative documentation and operations focused guides. Our setup at Netflix is somewhat particular to our setup, so I won't be the best resource. I'll keep pushing on Jupyter protocols and UI components though. |
@tristanz I'll go ahead and mock up a cancel button for executing cells in nteract as a prototype in the next few days then elicit feedback from you. |
Our current notebook story isn't great: - opening a second notebook causes it to freeze (Bug 1290148); - the Spark progress bar doesn't work reliably and can get stuck or disappear; - Spark jobs can't be cancelled (Bug 1318706) - jupyter can just freeze without returning any error whatsoever; - Scala isn't supported (kernels for it do exist though) The main issue with jupyter is that isn't wasn't build to work with an external query engine. There is an effort underway to improve the status quo [1] but it might take a while until it's production ready. This patch introduces Zeppelin [2] to our stack which might solve some of the above issues. It provides features like multi-language support within the same notebook [3], progress bars, job cancellation and an interesting visualization story. This is not a replacement for jupyter though, merely an experimental tool that one day could graduate to a first-class citizen if it lives up to its marketing claims. [1] jupyter/jupyter#212 (comment) [2] https://zeppelin.apache.org/ [3] https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9naXN0LmdpdGh1YnVzZXJjb250ZW50LmNvbS92aXRpbGxvL2JlODA0N2ZmZDg2MzM3ZDc0MzFhOGI3ZDI2NGUwMTNjL3Jhdy8zODEwNTE1ODM0OGVmYmYzYzFmMTYzZTA3NmNkMDJkZDczMTBmNTM5L0xhbmd1YWdlJTI1MjBtYWRuZXNzLmpzb24
Our current notebook story isn't great: - opening a second notebook causes it to freeze (Bug 1290148); - the Spark progress bar doesn't work reliably and can get stuck or disappear; - Spark jobs can't be cancelled (Bug 1318706) - jupyter can just freeze without returning any error whatsoever; - Scala isn't supported (kernels for it do exist though) The main issue with jupyter is that isn't wasn't build to work with an external query engine. There is an effort underway to improve the status quo [1] but it might take a while until it's production ready. This patch introduces Zeppelin [2] to our stack which might solve some of the above issues. It provides features like multi-language support within the same notebook [3], progress bars, job cancellation and an interesting visualization story. This is not a replacement for jupyter though, merely an experimental tool that one day could graduate to a first-class citizen if it lives up to its marketing claims. [1] jupyter/jupyter#212 (comment) [2] https://zeppelin.apache.org/ [3] https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9naXN0LmdpdGh1YnVzZXJjb250ZW50LmNvbS92aXRpbGxvL2JlODA0N2ZmZDg2MzM3ZDc0MzFhOGI3ZDI2NGUwMTNjL3Jhdy8zODEwNTE1ODM0OGVmYmYzYzFmMTYzZTA3NmNkMDJkZDczMTBmNTM5L0xhbmd1YWdlJTI1MjBtYWRuZXNzLmpzb24
Our current notebook story isn't great: - opening a second notebook causes it to freeze (Bug 1290148); - the Spark progress bar doesn't work reliably and can get stuck or disappear; - Spark jobs can't be cancelled (Bug 1318706) - jupyter can just freeze without returning any error whatsoever; - Scala isn't supported (kernels for it do exist though) The main issue with jupyter is that isn't wasn't build to work with an external query engine. There is an effort underway to improve the status quo [1] but it might take a while until it's production ready. This patch introduces Zeppelin [2] to our stack which might solve some of the above issues. It provides features like multi-language support within the same notebook [3], progress bars, job cancellation and an interesting visualization story. This is not a replacement for jupyter though, merely an experimental tool that one day could graduate to a first-class citizen if it lives up to its marketing claims. [1] jupyter/jupyter#212 (comment) [2] https://zeppelin.apache.org/ [3] https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9naXN0LmdpdGh1YnVzZXJjb250ZW50LmNvbS92aXRpbGxvL2JlODA0N2ZmZDg2MzM3ZDc0MzFhOGI3ZDI2NGUwMTNjL3Jhdy8zODEwNTE1ODM0OGVmYmYzYzFmMTYzZTA3NmNkMDJkZDczMTBmNTM5L0xhbmd1YWdlJTI1MjBtYWRuZXNzLmpzb24
Very minor addition: apache/spark#17662 |
Sticking this on my actual backlog so I come back to it. 😉 |
Exploring where Jupyter protocols, formats, and UI need to adapt to make it so that “integration with Spark” is seamless. The word “Spark” could likely be replaced with “TensorFlow” or “Dask” throughout all of this issue.
This is a culmination of feedback, in progress initiatives, and further R&D that needs to be done to improve how Jupyter provides a wonderful ecosystem for Spark, Hadoop, and other cluster tooling.
Asynchronous/Background Outputs
For environments like Spark and Dask, when a job or multiple jobs are submitted, they provide a progress bar for the jobs with a way to cancel the overall job. They also provide links to external UI (Spark UI or Dask UI). When finished, you’re able to see the resulting table view.
There’s a similar use case for Hive Jobs, Presto Jobs, etc.
As far as Jupyter is concerned, there are things we can do to make updating a display like this easier without having to use widgets:
Progress Bar Primitives
As discussed above in the Async Outputs section, we should seek to provide simple clean-cut library tooling for creating progress bars in the frontend. This is not to seek a replacement of e.g. tqdm, but to provide the necessary additions to the messaging spec (whether an actual message or a specific versioned mimetype).
The nested structure for Spark Jobs is something we can likely represent with a clean format for nested content with progress bars and external links.
Approaches:
Cluster/computational Context
Users need a way to get context about their running spark setup, because you’re dealing with an extra remote resource beyond the kernels. They tend to want to know:
Many users want to think in terms of “clusters” and a notebook that is attached to that cluster. The real distinction is
notebook ← → kernel ← → cluster (or other background resources)
This can be a cognitive burden for the user. One way of solving this is to have “kernels” that are assumed to be attached to clusters.
Kernel Startup and Background Resources
Stages a spark user cares about when opening a notebook and running first cell:
This seems to me like additional startup messages from the kernel (yes, new message spec) about driver context.
We do have a banner for kernel_info, perhaps we can do banner updates with links to:
Uploading libraries
For a user of Spark, they need a way to upload JARs to the current spark cluster (and, many times, they need to restart the spark cluster).
This is out of scope for the Jupyter protocols, would need to be implemented on the outside of the notebook. We need to be able to provide more context, likely through the kernel, about the operating environment and how the cluster is initialized for each kernel.
Unsure of what/how we can handle this, needs exploration.
Timing Data
The transient information for how long a command took (as well as the user) are available in the message spec, but not persisted to the notebook document. We can certainly show this in the UI, but there is another reason for keeping this output consistent in serialized state:
This is super useful, and part of why it’s in the message spec. An earlier implementation of IPython notebooks (2006) included and displayed all of this information. We found the information useful enough to keep it in the current protocol implementation, but not enough to put it in the UI or persist it to disk.
Downsides of storing it in the document:
Options: opt-in ‘provenance mode’, where we persist a whole bunch of extra metadata. Super unpleasant for version control, but useful in notebooks that run as jobs, especially for analysts.
Tables and Simple Charts
For many people, being able to plot simple results before getting to deeper visualization frameworks is pretty important: e.g. in pandas we can use the
df.plot()
,.hist()
,.bar()
, etc. methods for quick and easy vis.There are a couple approaches to this right now, including mime types for plotly and vega being available and easy to use in both nteract and jupyterlab. For tables, there is an open discussion on Pandas: pandas-dev/pandas#14386
There's also the spark magics incubation project.
Scala Kernel
Massive effort needs to be put into making the Scala kernel(s) for Jupyter first class citizens. Competitive angle: It should be so good that Zeppelin or Beaker would adopt it (or create it) and contribute to it. Current contenders:
Style of error reporting
If you’ve ever worked with PySpark, you know the joys of having python tracebacks embedded in JVM tracebacks. In order to make development easier, we need to bring attention to the right lines to go straight to in a traceback. Spark centric notebooks do this, showing exactly the error text the user needs and a collapsible area for tracebacks. Jupyter has some open issues and discussion about improving error semantics - it’s really important for kernels and the libraries building on them - to be able to expose more useful context to users.
At a minimum, we should expose links and/or popups to the spark driver’s stdout/stderr, Spark UI, and YARN application master where relevant. Without these essentials, a notebook user can’t debug issues that aren’t returned in the notebook output cell.
POC ➡️ Production
This part is more JVM centric and not language agnostic - people need a way to turn code segments into a Scala class, compile it, and generate a JAR. My knowledge of how this can be done is fairly old/outdated.
The text was updated successfully, but these errors were encountered: