Improving Jupyter for Spark #212

rgbkrk · 2016-10-19T16:50:20Z

Exploring where Jupyter protocols, formats, and UI need to adapt to make it so that “integration with Spark” is seamless. The word “Spark” could likely be replaced with “TensorFlow” or “Dask” throughout all of this issue.

This is a culmination of feedback, in progress initiatives, and further R&D that needs to be done to improve how Jupyter provides a wonderful ecosystem for Spark, Hadoop, and other cluster tooling.

Asynchronous/Background Outputs

For environments like Spark and Dask, when a job or multiple jobs are submitted, they provide a progress bar for the jobs with a way to cancel the overall job. They also provide links to external UI (Spark UI or Dask UI). When finished, you’re able to see the resulting table view.

There’s a similar use case for Hive Jobs, Presto Jobs, etc.

As far as Jupyter is concerned, there are things we can do to make updating a display like this easier without having to use widgets:

Routing background outputs: Routing ThreadPoolExecutor results to originating parent_header/cell ipython/ipython#9969
Ability to update prior displays: Provide an update_display message jupyter_client#209

Progress Bar Primitives

As discussed above in the Async Outputs section, we should seek to provide simple clean-cut library tooling for creating progress bars in the frontend. This is not to seek a replacement of e.g. tqdm, but to provide the necessary additions to the messaging spec (whether an actual message or a specific versioned mimetype).

The nested structure for Spark Jobs is something we can likely represent with a clean format for nested content with progress bars and external links.

Approaches:

https://github.com/mozilla/jupyter-spark

Cluster/computational Context

Users need a way to get context about their running spark setup, because you’re dealing with an extra remote resource beyond the kernels. They tend to want to know:

What cluster am I attached to?
What resources do I have (memory, CPU, Spark version, Hadoop version)?
How do I access the UI/logs for that cluster?

Many users want to think in terms of “clusters” and a notebook that is attached to that cluster. The real distinction is

notebook ← → kernel ← → cluster (or other background resources)

This can be a cognitive burden for the user. One way of solving this is to have “kernels” that are assumed to be attached to clusters.

Kernel Startup and Background Resources

Stages a spark user cares about when opening a notebook and running first cell:

Runtime started (kernel)
Preparing driver/Attaching to cluster $NAME/preparing Spark Context
Libraries uploading
Attached to cluster $NAME/Spark ready

This seems to me like additional startup messages from the kernel (yes, new message spec) about driver context.

We do have a banner for kernel_info, perhaps we can do banner updates with links to:

Cluster/driver logs
Spark UI/Dask UI

Uploading libraries

For a user of Spark, they need a way to upload JARs to the current spark cluster (and, many times, they need to restart the spark cluster).

This is out of scope for the Jupyter protocols, would need to be implemented on the outside of the notebook. We need to be able to provide more context, likely through the kernel, about the operating environment and how the cluster is initialized for each kernel.

Unsure of what/how we can handle this, needs exploration.

Timing Data

The transient information for how long a command took (as well as the user) are available in the message spec, but not persisted to the notebook document. We can certainly show this in the UI, but there is another reason for keeping this output consistent in serialized state:

Lifecycle management of notebooks -- need to be able to know if execution results are stale relative to data, APIs, etc.
In multi-user collaboration, this must be shared amongst all clients

This is super useful, and part of why it’s in the message spec. An earlier implementation of IPython notebooks (2006) included and displayed all of this information. We found the information useful enough to keep it in the current protocol implementation, but not enough to put it in the UI or persist it to disk.

Downsides of storing it in the document:

100% chance of every re-run causing git changes, merge conflicts

Options: opt-in ‘provenance mode’, where we persist a whole bunch of extra metadata. Super unpleasant for version control, but useful in notebooks that run as jobs, especially for analysts.

Tables and Simple Charts

For many people, being able to plot simple results before getting to deeper visualization frameworks is pretty important: e.g. in pandas we can use the df.plot(), .hist(), .bar(), etc. methods for quick and easy vis.

There are a couple approaches to this right now, including mime types for plotly and vega being available and easy to use in both nteract and jupyterlab. For tables, there is an open discussion on Pandas: pandas-dev/pandas#14386

There's also the spark magics incubation project.

Scala Kernel

Massive effort needs to be put into making the Scala kernel(s) for Jupyter first class citizens. Competitive angle: It should be so good that Zeppelin or Beaker would adopt it (or create it) and contribute to it. Current contenders:

Style of error reporting

If you’ve ever worked with PySpark, you know the joys of having python tracebacks embedded in JVM tracebacks. In order to make development easier, we need to bring attention to the right lines to go straight to in a traceback. Spark centric notebooks do this, showing exactly the error text the user needs and a collapsible area for tracebacks. Jupyter has some open issues and discussion about improving error semantics - it’s really important for kernels and the libraries building on them - to be able to expose more useful context to users.

At a minimum, we should expose links and/or popups to the spark driver’s stdout/stderr, Spark UI, and YARN application master where relevant. Without these essentials, a notebook user can’t debug issues that aren’t returned in the notebook output cell.

POC ➡️ Production

This part is more JVM centric and not language agnostic - people need a way to turn code segments into a Scala class, compile it, and generate a JAR. My knowledge of how this can be done is fairly old/outdated.

The text was updated successfully, but these errors were encountered:

rgbkrk · 2016-10-19T16:52:49Z

/cc @rdblue @minrk @ellisonbg @fperez @captainsafia @atronchi @jflittner @danielcweeks @charsmith @parente @mariusvniekerk @aggFTW @holdenk @lbustelo @alexarchambault @jasongrout @willingc

parente · 2016-10-19T17:48:37Z

@mariusvniekerk has a progress bar implementation in Spylon that uses the Spark status tracker and prints progress for stages. This works because cell execution is blocked until a job is done, and so the prints wind up as output in the correct cell. Having an API that a lib like Spylon could call to emit progress information to a named cell / progress bar would make this cleaner and go beyond the Spark case.

Ref: https://github.com/MaxPoint/spylon/blob/master/spylon/spark/progress.py#L88

We can try to draft a strawman spec / prototype for this.

holdenk · 2016-10-20T00:28:51Z

oh yay! :)

lbustelo · 2016-10-20T00:33:58Z

@rgbk

lbustelo · 2016-10-20T00:40:56Z

That is what I get for working from my phone. :)

@rgbkrk kudos for opening this up and your description.

I want to stress that any solution here needs to be language agnostic. Extensions to the protocol is one option, but I wonder if taking advantage of COM channel is also a good alternative. It seems that a set of front end widgets that try to connect to COM targets to find their kernel counterparts might work.

rgbk · 2016-10-20T03:12:51Z

Wrong person! Sounds all very interesting though. Keep up the good work 👍
On Thu, 20 Oct 2016 at 1:40 am, Gino Bustelo [email protected]
wrote:

That is what I get for working from my phone. :)

@rgbkrk https://github.com/rgbkrk kudos for opening this up and your
description.

I want to stress that any solution here needs to be language agnostic.
Extensions to the protocol is one option, but I wonder if taking advantage
of COM channel is also a good alternative. It seems that a set of front end
widgets that try to connect to COM targets to find their kernel
counterparts might work.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#212 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AArsffeuUX6TEvqnI6yxNMMSX1WrtlVnks5q1riagaJpZM4KbNRL
.

rgbkrk · 2016-10-20T03:54:41Z

Thanks!

I certainly agree about whatever we do ending up language agnostic. There's more than one thing to tackle, and I want to push us towards evolving specs, even if they're versioned and well specified APIs on the comms.

minrk · 2016-10-21T09:54:54Z

This is great, thanks @rgbkrk! I'll try to focus on my protocol and kernel maintainer hats, and let other people bring Spark-user and frontend-developer perspectives. I think we can can break out several of these points into concrete Issues/proposals, as has been done with the async/background output Issues already. Apologies for how long this turned out to be.

The main question for me in all of these issues is what changes need to happen where, and I think we're getting closer to knowing these things. In particular, I expect that creating a spark-in-ipython package that provides some niceities for using Spark in IPython in Jupyter will make sense (for experimental library-specific stuff like this, there's bound to be some work that belongs in a more agile place than pyspark or ipython). I'd be happy for this to reside on the IPython org, but it doesn't matter much to me.

Async/background output

I think the async/background/update output conversations are making progress, and I'll refer to those issues for further conversation. I fully support both proposals; one is primarily kernel APIs, while the second is a small protocol update.

Progress bars

For progress-bar primitives, it seems to me that update-in-place HTML output discussed above would do a pretty good job at this, and ultimately be less difficult than defining and supporting progress bar mime-types in the protocol, but I'm open to either approach. While there are bound to continue to be a smaller number of frontends than kernels, if this is a faster-than-glacial spec, I think supporting it as an API in kernel libraries will be easier than supporting it in frontends.

Cluster/computational Context

I think the kernelspec == kernel-startup-environment model works relatively well in terms of sysadmins communicating broad categories of available contexts to users. There is still loads of room for improvement in the tooling for administrators to create and manage these kernelspecs, so I think that can be a project to explore.

The case this doesn't support well is smaller user-provided parameters, such as memory / CPU allocations as in a traditional HPC environment. This would be a new feature-request for notebooks to be able to declare/store additional parameters for kernel startup. JupyteHub has an answer to this at the notebook-server level, where users can be presented with a form for inputs to influence how their notebook-server is spawned. We could adopt a similar mechanism for kernels, but it would require adapting KernelManager to be more like the Hub's Spawner mechanism. The simple/general version of this is to allow specifying environment variables, but that becomes a security issue, as allowing notebooks to specify things like PATH and PYTHONPATH makes kernel startup a vulnerable action for untrusted notebooks.

Kernel Startup and Background Resources

I think adding IOPub startup messages that declare state is a fine thing to do. One version of this could be to add a message field to the status messages, and display those messages in frontends.

The banner is only stored on the kernel and retrieved via kernel_info_request. So updates to the banner on the kernel will affect future retrieval, though there is no convention to re-fetch the banner. We could add a banner message to IOPub, such that an always-displayed banner can be updated via kernel-initiated messages. This could be a new banner message, and effectively a display_data message with a singular frontend-defined destination.

Uploading libraries

This also ties to the Notebooks vs Workspaces discussion. As a developer and maintainer, I still vastly prefer the directory/repo as the sharing/environment/workspace entity, and strongly dislike using zip files as single-files-that-are-actually-directories. Since this is so common, I think we do need to improve the tooling for interacting with the workspace. If the main challenge is sharing/moving directories around, adding download-as-zip and upload-and-unzip and/or git integration may go a long way.

We could also support zip files in the ContentsManager, so that zip files can be treated transparently as directories. All sorts of stuff breaks if you do this, as can be seen by ZipImport in Python.

Timing data

As real-time collaboration adds the notion of 'live' document state, we can safely preserve lots more information in transients that persist as long as the notebook is 'running', and discard on long-term save to disk. We could do this right now, but information would be lost on page reload, rather than kernel stop, which may be more confusing than helpful.

If the information really should be persisted long-term, we have metadata fields available, and a provenance-mode extension could be recording this today.

Style of error reporting

This is a proposal that's been on the docket for years now, and of particular interest to @fperez. Talking to other language authors (Julia, Haskell, Scala, X+Spark) will be super important to do this in a way that works. Discussions have ranged from an opaque mime-bundle (strictly an improvement on today) to structured data allowing good references to lines in the input code. I'm not sure how feasible the structured solution will be in general, but it should be explored, at least.

The rest is building good exceptions and tracebacks at the library level, which can be improved upon even now, when all we have are plain text tracebacks.

SamPenrose · 2016-10-24T20:38:44Z

Hi everyone ... @cameres and I are also quite interested in Jupyter-Spark integration. We met fperez and a few other Jupyter folks on 21 October to get oriented. Meanwhile it seems events have overtaken us :-).
Background: Connor and I work at Mozilla. Another colleague created a simple, useful spark extension. At Fernando's suggestion, Connor is looking at what can be done with magic functions in the existing notebook.
The features proposed so far sound excellent to us. I also wonder if there has been / will be any structured requirements gathering / UX design beyond GH. I work with data scientists who use notebooks as a primary interface for Spark clusters, and they lose a lot of time to UX issues. Super-smart folks, but they don't automatically develop an accurate high-resolution mental model of what's going on in these tools. Several of the features here could help very much, but it's not clear to me which are most important, how they will be experienced by the thousands of users who don't think about this stuff nearly as much as we do, or what aspects of the notebook UX that aren't on this list might be major pain points (though I have some guesses).
Do we have data and / or studies about notebook UX experiences? If we had some, would the community have any interest in using it to drive decisions? How about UI prototypes: has the community in the past used them to vet new directions? Would they be interested in doing so for this project?

willingc · 2016-10-24T21:06:26Z

Hi @SamPenrose, Glad to hear that you and Connor are looking at this. Howdy to @jezdez too, and thanks for the spark extension. Looping in @parente re: notebook experience survey results that could be shared.

parente · 2016-10-24T21:47:44Z

Do we have data and / or studies about notebook UX experiences? If we had some, would the community have any interest in using it to drive decisions?

Here's the raw data and analysis notebooks from the late-2015 UX survey about Jupyter Notebook: https://github.com/jupyter/design/tree/master/surveys/2015-notebook-ux

A live, dashboard version is available here: http://jupyter.cloudet.xyz/files/2015-notebook-ux-survey/analysis/deployed_dashboard/index.html

NikolaMandic · 2016-10-24T23:22:31Z

you can install tensorflow for jupyter if you just compile tensorflow under conda environment for 2.7 or 3.5 then you can use tensorflow inside jupyter

napjon · 2016-10-25T09:50:34Z

Just want to give personal experience on this,

I tried all of the Spark kernels for Jupyter out there, only sparkmagics works out of the box. The only downside is missing tab completion.

I spend days just to make other project works, but it still nowhere enough.

parente · 2016-11-01T02:32:24Z

Re: #212 (comment), I added some comments to the update_display message and APIs discussed in jupyter/jupyter_client#209. I believe that addition will make the display of Spark progress details straightforward in any library and language (among other things).

willingc · 2016-11-02T06:12:26Z

@minrk, @rgbkrk, and others: Would it make sense to create a markdown file with the contents of this thread that could be iterated on? If so, where do we want such a planning doc to live?

minrk · 2016-11-02T12:18:28Z

@willingc that's a good idea. Roadmap is the logical place, I think. Or an Enhancement Proposal.

rgbkrk · 2016-11-02T14:19:30Z

Roadmap sounds great. Lots of little PRs or one big PR?

minrk · 2016-11-02T14:26:59Z

One PR to start, merge quick and iterate.

rgbkrk · 2016-11-02T18:07:55Z

Alright, I'll submit (mostly) my original text and expect others to follow up with new PRs to address things I've missed that we've discussed here.

rgbkrk · 2016-11-02T18:14:10Z

jupyter/roadmap#30

rgbkrk · 2016-11-02T18:14:54Z

I did it in the style of some recent previous ones (to create a separate markdown file).

willingc · 2016-11-03T00:48:58Z

Thanks @rgbkrk 🍰

rgbkrk · 2017-01-10T21:35:19Z

From reading this and interacting with folks over the last few months, it seems like a few things occurred:

update displays was fully implemented across jupyter notebook, the ipython kernel, and nteract notebook
Folks at MaxPoint, IBM, Microsoft, and Mozilla have all been building solutions in this space for various aspects

I'm wondering now where efforts are going and if folks want some channels to communicate efforts/progress on the spark roadmap? I'm certainly interested.

jackiekazil · 2017-01-10T23:41:12Z

I am interested as well.

…

On Tue, Jan 10, 2017 at 4:35 PM, Kyle Kelley ***@***.***> wrote: From reading this and interacting with folks over the last few months, it seems like a few things occurred: - update displays was fully implemented across jupyter notebook, the ipython kernel, and nteract notebook - Folks at MaxPoint, IBM, Microsoft, and Mozilla have all been building solutions in this space for various aspects I'm wondering now where efforts are going and if folks want some channels to communicate efforts/progress on the spark roadmap? I'm certainly interested. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#212 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKLTrIv1snYcUIpIhwAQXyuqBSFqbrbks5rQ_mYgaJpZM4KbNRL> .

-- Jacqueline Kazil | @jackiekazil

rgbkrk · 2017-01-10T23:59:58Z

Communication efforts I'm undertaking now, will keep posting:

Updates to the ROADMAP: Update spark roadmap roadmap#35
Spark and Jupyter blog post in progress

If others have more to add on to the spark roadmap (even if bringing in more from this thread), I'd love to get more in and communicated!

tristanz · 2017-01-11T01:28:10Z

I'd love to see standardization of cell-level progress bars with cancellation. This is key feature of Databricks and Zeppelin, and goes beyond Spark. Zeppelin Interpreter API

rgbkrk · 2017-01-11T02:52:29Z

Is Zeppelin's cancel the same as Jupyter's interrupt?

tristanz · 2017-01-11T06:07:03Z

Zeppelin cancels the underlying Spark or SQL job, whatever the interpreter implements. But perhaps the recommendation should just be for libraries to catch the SIGINT from interrupt if a single global cancel is ok. I think a global cancel is fine since non-blocking jobs can likely be cancelled in code.

lbustelo · 2017-01-11T13:22:26Z

@rgbkrk It just occurred to me that the proposal might be missing a section where the Jupyter community gives a set of architectural recommendations of how to layout the Jupyter notebook atop spark. Things like remote kernels, the role of the kernel gateway and potential changes/extensions needed on the notebook server.

This to me is one of the most challenging pieces due to the particular networking requirements of Spark.

rgbkrk · 2017-01-11T17:14:48Z

@lbustelo a very worthy addition to the roadmap, we definitely need narrative documentation and operations focused guides. Our setup at Netflix is somewhat particular to our setup, so I won't be the best resource. I'll keep pushing on Jupyter protocols and UI components though.

rgbkrk · 2017-01-11T17:16:31Z

@tristanz I'll go ahead and mock up a cancel button for executing cells in nteract as a prototype in the next few days then elicit feedback from you.

Our current notebook story isn't great: - opening a second notebook causes it to freeze (Bug 1290148); - the Spark progress bar doesn't work reliably and can get stuck or disappear; - Spark jobs can't be cancelled (Bug 1318706) - jupyter can just freeze without returning any error whatsoever; - Scala isn't supported (kernels for it do exist though) The main issue with jupyter is that isn't wasn't build to work with an external query engine. There is an effort underway to improve the status quo [1] but it might take a while until it's production ready. This patch introduces Zeppelin [2] to our stack which might solve some of the above issues. It provides features like multi-language support within the same notebook [3], progress bars, job cancellation and an interesting visualization story. This is not a replacement for jupyter though, merely an experimental tool that one day could graduate to a first-class citizen if it lives up to its marketing claims. [1] jupyter/jupyter#212 (comment) [2] https://zeppelin.apache.org/ [3] https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9naXN0LmdpdGh1YnVzZXJjb250ZW50LmNvbS92aXRpbGxvL2JlODA0N2ZmZDg2MzM3ZDc0MzFhOGI3ZDI2NGUwMTNjL3Jhdy8zODEwNTE1ODM0OGVmYmYzYzFmMTYzZTA3NmNkMDJkZDczMTBmNTM5L0xhbmd1YWdlJTI1MjBtYWRuZXNzLmpzb24

rgbkrk · 2017-04-17T21:11:59Z

Very minor addition: apache/spark#17662

rgbkrk · 2017-04-17T21:16:01Z

I'll go ahead and mock up a cancel button for executing cells in nteract as a prototype in the next few days then elicit feedback from you.

Sticking this on my actual backlog so I come back to it. 😉

parente mentioned this issue Nov 1, 2016

Provide an update_display message jupyter/jupyter_client#209

Closed

rgbkrk mentioned this issue Nov 2, 2016

propose(spark): initiatives to improve spark support jupyter/roadmap#30

Merged

willingc closed this as completed Nov 3, 2016

rgbkrk mentioned this issue Jan 10, 2017

Update spark roadmap jupyter/roadmap#35

Merged

vitillo mentioned this issue Jan 30, 2017

Deploy Zeppelin notebook mozilla/emr-bootstrap-spark#59

Merged

rgbkrk mentioned this issue May 31, 2017

Resource Info Request #264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Jupyter for Spark #212

Improving Jupyter for Spark #212

rgbkrk commented Oct 19, 2016 •

edited

Loading

rgbkrk commented Oct 19, 2016 •

edited

Loading

parente commented Oct 19, 2016

holdenk commented Oct 20, 2016

lbustelo commented Oct 20, 2016

lbustelo commented Oct 20, 2016

rgbk commented Oct 20, 2016

rgbkrk commented Oct 20, 2016

minrk commented Oct 21, 2016 •

edited by rgbkrk

Loading

SamPenrose commented Oct 24, 2016

willingc commented Oct 24, 2016

parente commented Oct 24, 2016

NikolaMandic commented Oct 24, 2016

napjon commented Oct 25, 2016

parente commented Nov 1, 2016

willingc commented Nov 2, 2016

minrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

minrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

willingc commented Nov 3, 2016

rgbkrk commented Jan 10, 2017

jackiekazil commented Jan 10, 2017 via email

rgbkrk commented Jan 10, 2017

tristanz commented Jan 11, 2017

rgbkrk commented Jan 11, 2017

tristanz commented Jan 11, 2017

lbustelo commented Jan 11, 2017

rgbkrk commented Jan 11, 2017

rgbkrk commented Jan 11, 2017

rgbkrk commented Apr 17, 2017

rgbkrk commented Apr 17, 2017

Improving Jupyter for Spark #212

Improving Jupyter for Spark #212

Comments

rgbkrk commented Oct 19, 2016 • edited Loading

Asynchronous/Background Outputs

Progress Bar Primitives

Cluster/computational Context

Kernel Startup and Background Resources

Uploading libraries

Timing Data

Tables and Simple Charts

Scala Kernel

Style of error reporting

POC ➡️ Production

rgbkrk commented Oct 19, 2016 • edited Loading

parente commented Oct 19, 2016

holdenk commented Oct 20, 2016

lbustelo commented Oct 20, 2016

lbustelo commented Oct 20, 2016

rgbk commented Oct 20, 2016

rgbkrk commented Oct 20, 2016

minrk commented Oct 21, 2016 • edited by rgbkrk Loading

Async/background output

Progress bars

Cluster/computational Context

Kernel Startup and Background Resources

Uploading libraries

Timing data

Style of error reporting

SamPenrose commented Oct 24, 2016

willingc commented Oct 24, 2016

parente commented Oct 24, 2016

NikolaMandic commented Oct 24, 2016

napjon commented Oct 25, 2016

parente commented Nov 1, 2016

willingc commented Nov 2, 2016

minrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

minrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

rgbkrk commented Nov 2, 2016

willingc commented Nov 3, 2016

rgbkrk commented Jan 10, 2017

jackiekazil commented Jan 10, 2017 via email

rgbkrk commented Jan 10, 2017

tristanz commented Jan 11, 2017

rgbkrk commented Jan 11, 2017

tristanz commented Jan 11, 2017

lbustelo commented Jan 11, 2017

rgbkrk commented Jan 11, 2017

rgbkrk commented Jan 11, 2017

rgbkrk commented Apr 17, 2017

rgbkrk commented Apr 17, 2017

rgbkrk commented Oct 19, 2016 •

edited

Loading

rgbkrk commented Oct 19, 2016 •

edited

Loading

minrk commented Oct 21, 2016 •

edited by rgbkrk

Loading