From a950a853aa039ddd88cda525a7d8bbb318ecb744 Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Thu, 10 Sep 2015 08:45:37 +0200 Subject: [PATCH 01/11] Description of the proposed format --- ...ormat-for-improved-workflow-integration.md | 97 +++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md new file mode 100644 index 00000000..afdf500b --- /dev/null +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -0,0 +1,97 @@ +# A new notebook document format for improved workflow integration + +## Problem + +Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Examples of such tools are workflow managers, provenance trackers, and version control systems. The core of the problem is that Jupyter's on-disk notebook format is closely tied to Jupyter's functionality and design. Much information of use for other tools, although available at execution time, is not preserved in notebook files. + +## Proposed Enhancement + +Define a data model and file format for notebooks as digital documents. Include as much relevant information as possible that is available during a Jupyter session, in particular a complete log of code executed by the kernel. Design the file format with the needs of other tools in mind. + + +## Detailed Explanation + +### Notebooks as digital documents + +Currently, a Jupyter notebook file is merely an on-disk representation of the internal state of the Jupyter Web tool. The file format mixes user-edited and computationally generated information with insufficient distinction, and does not preserve the history of the computation in enough detail to permit replication and other validation approaches. + +The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter Web tool should become just one out of many possible tools that process such notebook documents. + +### A three-layer data model + +The proposed data model for notebook documents consists of three layers: + + 1. A sequence of code blocks in execution order. + 2. A sequence of outputs produced by these code blocks, in execution order + 3. A narrative containing references to specific code blocks and/or + outputs. + +Layers 1 and 3 are user-edited content, subject to version control. Layer 2 consists entirely of computational results. In principle, it can be recomputed at any time. However, since recomputation can be time-consuming, and is often unreliable due to today's fragile computational enviromnents, layer 2 should be archived as well under version control, as a foundation for layer 3. + +Conceptually, each layer is an independent electronic document, with each layer depending on information from lower layers. A layer 3 document could depend on multiple layer 1/2 documents, e.g. in a multiuser setting. A provenance tracking system would treat a layer 1 document exactly like a script, and a layer 2 document exactly like the console output from a script. Provenance trackers may this need to store the layers as separate files or datasets. The default notebook file format should combine all three layers, but facilitate extraction of individual layers. + +Note that today's Jupyter file format contains only level 3, with only the sequence numbers of the output cells preserving a partial trace of the execution order. However, all the information of layers 1 and 2 is available to the kernel. + +In the following, the three layers are described in more detail. + +#### Layer 1 + +A layer 1 document consists of + + 1. a language tag for choosing the right kernel for execution + 2. a sequence of code blocks + +Each code block needs a unique identifier to permit layers 2 and 3 to refer to it. A cryptographic hash function such as SHA-1 can be used to generate such a unique identifier, which has the advantage of making a layer 1 document a content-addressable read-only storage. References to code blocks can thus easily be validated, as any change to a code block modifies its unique identifier. + +#### Layer 2 + +A layer 2 document consists of + + 1. information about the computational environment (kernel type and version, machine name, date, ...) + 2. a reference to the layer 1 document containing the code blocks + 3. a complete sequence of execution records for all code executed since the start of the kernel + +An execution record contains the following information: + + 1. the unique identifier of the code block that was executed + 2. a set of outputs produced by the code blocks + +In the set of outputs, each output item contains: + + 1. a label defining the output type + 2. the output data, conforming to a data model specific to the output type + +Note: this section must be complemented with data models for the standard output types. Overall, outputs are handled very much like in the current notebook format. + +#### Layer 3 + +A layer 3 document consists of + + 1. a list of references to layer 1 documents + 2. a list of references to layer 2 documents + 3. a sequence of cells. + +Each cell is one of + + 1. a documentation cell, containing text content plus a label identifying the format (Markdown etc.) + 2. a code cell, containing the unique identifier of a code block in one of the referenced layer 1 documents + 3. an output cell, containing (1) the index of the layer 2 document that contains the output and (2) the sequence index of the output item inside the layer 2 document + + +### File formats + +### Implementation + + +## Pros and Cons + +Pros: +* The computations in notebooks become replicable, at least in an identical computational environment. +* Notebooks can be managed as alternatives to scripts by workflow management tools. +* Alternative notebook editing tools can be developed that support different tastes or needs, while maintaining document compatibility and thus avoiding lock-in to any particular tool. + +Cons: +* A data model and file format defined independently of the Jupyter implementation creates constraints on the future evolution of Jupyter. + +## Interested Contributors +@khinsen From ea3931868203f3c87586ac2c4b9369ddd322d6f6 Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Thu, 10 Sep 2015 10:24:59 +0200 Subject: [PATCH 02/11] General revision --- ...ormat-for-improved-workflow-integration.md | 37 ++++++++++--------- 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index afdf500b..465ec57c 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -13,24 +13,24 @@ Define a data model and file format for notebooks as digital documents. Include ### Notebooks as digital documents -Currently, a Jupyter notebook file is merely an on-disk representation of the internal state of the Jupyter Web tool. The file format mixes user-edited and computationally generated information with insufficient distinction, and does not preserve the history of the computation in enough detail to permit replication and other validation approaches. +Currently, a Jupyter notebook file is essentially an on-disk representation of the internal state of the Web-based Jupyter editor. The file format mixes human-edited and computationally generated information with insufficient distinction, and does not preserve the history of the computation in enough detail to permit replication and other validation techniques. -The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter Web tool should become just one out of many possible tools that process such notebook documents. +The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter editor should become just one out of many possible tools that process such notebook documents. ### A three-layer data model The proposed data model for notebook documents consists of three layers: - 1. A sequence of code blocks in execution order. + 1. A sequence of code blocks, in execution order. 2. A sequence of outputs produced by these code blocks, in execution order - 3. A narrative containing references to specific code blocks and/or + 3. A narrative containing references to specific code blocks and outputs. -Layers 1 and 3 are user-edited content, subject to version control. Layer 2 consists entirely of computational results. In principle, it can be recomputed at any time. However, since recomputation can be time-consuming, and is often unreliable due to today's fragile computational enviromnents, layer 2 should be archived as well under version control, as a foundation for layer 3. +Layers 1 and 3 are human-edited content, subject to version control. Layer 2 consists entirely of computational results. In principle, it can be recomputed at any time. However, since recomputation can be time-consuming, and is often unreliable due to today's fragile computational enviromnents, layer 2 should be archived as well under version control, as a foundation for layer 3. -Conceptually, each layer is an independent electronic document, with each layer depending on information from lower layers. A layer 3 document could depend on multiple layer 1/2 documents, e.g. in a multiuser setting. A provenance tracking system would treat a layer 1 document exactly like a script, and a layer 2 document exactly like the console output from a script. Provenance trackers may this need to store the layers as separate files or datasets. The default notebook file format should combine all three layers, but facilitate extraction of individual layers. +Conceptually, each layer is an independent electronic document, with each layer depending on information from lower layers. A layer 3 document can depend on multiple layer 1/2 documents. A provenance tracking system would treat a layer 1 document exactly like a script, and a layer 2 document exactly like the console output from a script. Provenance trackers may thus need to store the layers as separate files or datasets. The default notebook file format should combine all three layers, but facilitate extraction of individual layers. -Note that today's Jupyter file format contains only level 3, with only the sequence numbers of the output cells preserving a partial trace of the execution order. However, all the information of layers 1 and 2 is available to the kernel. +Note that today's Jupyter file format resembles layer 3. It contains some information about execution order in the form of the prompt numbers. However, since the executed code is not stored anywhere, replication of the computation is impossible. Even if the prompt numbers in the notebook are sequential and start with 1, the code cells might have been edited after execution. The only guarantee that a notebook file makes is that the outputs were obtained from *some* computation. In the following, the three layers are described in more detail. @@ -41,21 +41,21 @@ A layer 1 document consists of 1. a language tag for choosing the right kernel for execution 2. a sequence of code blocks -Each code block needs a unique identifier to permit layers 2 and 3 to refer to it. A cryptographic hash function such as SHA-1 can be used to generate such a unique identifier, which has the advantage of making a layer 1 document a content-addressable read-only storage. References to code blocks can thus easily be validated, as any change to a code block modifies its unique identifier. - #### Layer 2 A layer 2 document consists of - 1. information about the computational environment (kernel type and version, machine name, date, ...) - 2. a reference to the layer 1 document containing the code blocks + 1. a reference to the layer 1 document containing the code blocks + 2. information about the computational environment (kernel type and version, machine name, date, ...) 3. a complete sequence of execution records for all code executed since the start of the kernel An execution record contains the following information: - 1. the unique identifier of the code block that was executed + 1. the SHA-1 hash of the code block that was executed 2. a set of outputs produced by the code blocks +The SHA-1 hash makes it possible to verify consistency with the underlying layer-1 document. + In the set of outputs, each output item contains: 1. a label defining the output type @@ -65,18 +65,19 @@ Note: this section must be complemented with data models for the standard output #### Layer 3 -A layer 3 document consists of +A layer 3 document consists of: - 1. a list of references to layer 1 documents + 1. a language tag for choosing the right kernel for execution 2. a list of references to layer 2 documents 3. a sequence of cells. -Each cell is one of +Each cell has one of the following types: - 1. a documentation cell, containing text content plus a label identifying the format (Markdown etc.) - 2. a code cell, containing the unique identifier of a code block in one of the referenced layer 1 documents - 3. an output cell, containing (1) the index of the layer 2 document that contains the output and (2) the sequence index of the output item inside the layer 2 document + - a documentation cell, containing text content plus a label identifying the format (Markdown etc.) + - an code cell, containing a code block + - a reference to an execution record, consisting of (1) the index of the layer 2 document that contains the record and (2) the sequence index of the record inside the layer 2 document +Code cells are for code that has never been executed. Executed code blocks can be retrieved through the execution record from layer 2. ### File formats From 0a6639153bf5fa50c5174bdf550fdc9a3f9825fa Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Thu, 10 Sep 2015 10:25:27 +0200 Subject: [PATCH 03/11] Implementation notes --- ...k-document-format-for-improved-workflow-integration.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index 465ec57c..ea1d78af 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -83,6 +83,14 @@ Code cells are for code that has never been executed. Executed code blocks can b ### Implementation +Layers 1 and 2 are managed by the kernel, layer 3 is managed by the notebook client. + +When a kernel is started, it creates a fresh layer 1 and layer 2 document. All code submitted to the kernel, whether through the notebook client or by other means, is appended to layer 1. Outputs are appended to layer 2. For execution requests coming from the notebook client, the kernel returns updates to layers 1 and 2 since the previous execution request, permitting the client to reconstruct layers 1 and 2 as soon as possible. This limits information loss in case of a kernel shutdown or crash. + +The Web client creates new notebooks as layer 3 documents with no attached lower layers. User-edited content is stored as documentation cells or code cells. When a kernel is started, its layer 2 document is attached to the client's layer 3 document. When a code cell is sent to the kernel, it is replaced by a reference to the resulting execution record that is returned by the kernel. The client reconstructs layers 1 and 2 incrementally from this information. + +When a kernel is restarted for an existing notebook, its layers 1 and 2 are attached to the client's layer 3 in addition to layers 1 and 2 from earlier kernels. Existing layer 1/2 attachments can be deleted only when no reference to them exists any more in layer 3. + ## Pros and Cons From a9bca1b0e2f773cad61af801aa548c7d27767d65 Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Thu, 10 Sep 2015 10:39:22 +0200 Subject: [PATCH 04/11] More on implementation --- ...tebook-document-format-for-improved-workflow-integration.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index ea1d78af..b3d8aee5 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -91,6 +91,9 @@ The Web client creates new notebooks as layer 3 documents with no attached lower When a kernel is restarted for an existing notebook, its layers 1 and 2 are attached to the client's layer 3 in addition to layers 1 and 2 from earlier kernels. Existing layer 1/2 attachments can be deleted only when no reference to them exists any more in layer 3. +When already executed code is edited, the execution reference is reverted to a code cell again. To maintain the current notebook functionality identically, the outputs from the execution record would have to be copied and stored in an additional type of cell ("stale output cell"). However, in the interest of consistency, it seems preferable to modify the current behavior and delete stale output immediately. + +A cleanup operation ("remove all outputs") replaces execution records by code cells and deletes all layer 1/2 attachments. ## Pros and Cons From d0371ce7818f32d43a2d939d8531293098ba8a4c Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Thu, 10 Sep 2015 11:15:27 +0200 Subject: [PATCH 05/11] More on implementation --- ...ormat-for-improved-workflow-integration.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index b3d8aee5..0038ea1d 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -53,8 +53,9 @@ An execution record contains the following information: 1. the SHA-1 hash of the code block that was executed 2. a set of outputs produced by the code blocks - -The SHA-1 hash makes it possible to verify consistency with the underlying layer-1 document. + 3. a SHA-1 hash for the output + +The SHA-1 hash makes it possible to verify consistency with the underlying layer-1 document, and to detect modifications to the execution records by other tools. In the set of outputs, each output item contains: @@ -74,12 +75,15 @@ A layer 3 document consists of: Each cell has one of the following types: - a documentation cell, containing text content plus a label identifying the format (Markdown etc.) - - an code cell, containing a code block - a reference to an execution record, consisting of (1) the index of the layer 2 document that contains the record and (2) the sequence index of the record inside the layer 2 document + - a code cell, containing a code block + - a stale output cell, containing output from a prior execution for which no log is available Code cells are for code that has never been executed. Executed code blocks can be retrieved through the execution record from layer 2. -### File formats + +### File format + ### Implementation @@ -91,9 +95,12 @@ The Web client creates new notebooks as layer 3 documents with no attached lower When a kernel is restarted for an existing notebook, its layers 1 and 2 are attached to the client's layer 3 in addition to layers 1 and 2 from earlier kernels. Existing layer 1/2 attachments can be deleted only when no reference to them exists any more in layer 3. -When already executed code is edited, the execution reference is reverted to a code cell again. To maintain the current notebook functionality identically, the outputs from the execution record would have to be copied and stored in an additional type of cell ("stale output cell"). However, in the interest of consistency, it seems preferable to modify the current behavior and delete stale output immediately. +When already executed code is edited, the execution reference is replaced by a code cell plus a stale output cell. The latter should be displayed in a way that clearly marks it as stale. + +When opening a stored notebook, the consistency between layers 1 and 2 must be verified because other tools, in particular version control systems, may create inconsistent notebook files. This check consists of comparing the SHA-1 hashes in layer 2 to freshly computed hashes for layer 1, proceding in execution order. If a difference is detected, the layer 2 data is truncated at this point, and references from layer 3 to the invalidated execution records are replaced by code cell/stale output cell pairs. + +A cleanup operation ("remove output / remove all outputs") replaces execution records by code cells. -A cleanup operation ("remove all outputs") replaces execution records by code cells and deletes all layer 1/2 attachments. ## Pros and Cons From 0fa0905617c7a4e037082ed665cd5577a855f8e8 Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Thu, 10 Sep 2015 11:15:45 +0200 Subject: [PATCH 06/11] First thoughts on the file format --- ...ebook-document-format-for-improved-workflow-integration.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index 0038ea1d..a2602419 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -84,6 +84,10 @@ Code cells are for code that has never been executed. Executed code blocks can b ### File format +The main difficulty in defining a file format for the data model described above is suitability for version control. The biggest challenge is support for merging independent changes. In general, this creates an inconsistent notebook document because the computed content (layer 2) is not automatically updated after code changes. The use of SHA-1 hashes makes it possible to detect inconsistencies between layers 1 and 2. + +In order to make diffs readable, a line-oriented format with light markup is desirable for layers 1 and 3. Moreover, layer 2 should be placed at the end of a notebook document, following layers 1 and 3. + ### Implementation From 6c8541d128ae18d278decbd84a6a7c9e59559fb0 Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Fri, 11 Sep 2015 10:42:17 +0200 Subject: [PATCH 07/11] Revision --- ...ormat-for-improved-workflow-integration.md | 39 +++++++++++++------ 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index a2602419..4d838b1f 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -4,6 +4,7 @@ Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Examples of such tools are workflow managers, provenance trackers, and version control systems. The core of the problem is that Jupyter's on-disk notebook format is closely tied to Jupyter's functionality and design. Much information of use for other tools, although available at execution time, is not preserved in notebook files. + ## Proposed Enhancement Define a data model and file format for notebooks as digital documents. Include as much relevant information as possible that is available during a Jupyter session, in particular a complete log of code executed by the kernel. Design the file format with the needs of other tools in mind. @@ -13,9 +14,9 @@ Define a data model and file format for notebooks as digital documents. Include ### Notebooks as digital documents -Currently, a Jupyter notebook file is essentially an on-disk representation of the internal state of the Web-based Jupyter editor. The file format mixes human-edited and computationally generated information with insufficient distinction, and does not preserve the history of the computation in enough detail to permit replication and other validation techniques. +Currently, a Jupyter notebook file is essentially an on-disk representation of the internal state of the Jupyter notebook client. The file format mixes human-edited and computationally generated information with insufficient distinction, and does not preserve the history of the computation in enough detail to permit replication and other validation techniques. -The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter editor should become just one out of many possible tools that process such notebook documents. +The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter notebook client should become just one out of many possible tools that process such notebook documents. ### A three-layer data model @@ -28,9 +29,9 @@ The proposed data model for notebook documents consists of three layers: Layers 1 and 3 are human-edited content, subject to version control. Layer 2 consists entirely of computational results. In principle, it can be recomputed at any time. However, since recomputation can be time-consuming, and is often unreliable due to today's fragile computational enviromnents, layer 2 should be archived as well under version control, as a foundation for layer 3. -Conceptually, each layer is an independent electronic document, with each layer depending on information from lower layers. A layer 3 document can depend on multiple layer 1/2 documents. A provenance tracking system would treat a layer 1 document exactly like a script, and a layer 2 document exactly like the console output from a script. Provenance trackers may thus need to store the layers as separate files or datasets. The default notebook file format should combine all three layers, but facilitate extraction of individual layers. +Conceptually, each layer is an independent electronic document, depending on information from lower layers. A layer 3 document can depend on multiple layer 1/2 documents. A provenance tracking system would treat a layer 1 document exactly like a script, and a layer 2 document exactly like the console output from a script. Provenance trackers may thus need to store the layers as separate files or datasets. The default notebook file format should combine all three layers, but facilitate extraction of individual layers. -Note that today's Jupyter file format resembles layer 3. It contains some information about execution order in the form of the prompt numbers. However, since the executed code is not stored anywhere, replication of the computation is impossible. Even if the prompt numbers in the notebook are sequential and start with 1, the code cells might have been edited after execution. The only guarantee that a notebook file makes is that the outputs were obtained from *some* computation. +Note that today's Jupyter file format resembles layer 3. It contains some information about execution order in the form of the prompt numbers. However, since the executed code is not stored anywhere, replication of the computation is impossible. Even if the prompt numbers in the notebook are sequential and start with 1, the code cells might have been edited after execution, and code might have been submitted to the kernel outside of the notebook. The only guarantee that a notebook file makes is that the outputs were obtained from *some* computation. In the following, the three layers are described in more detail. @@ -53,16 +54,16 @@ An execution record contains the following information: 1. the SHA-1 hash of the code block that was executed 2. a set of outputs produced by the code blocks - 3. a SHA-1 hash for the output + 3. a SHA-1 hash for each output -The SHA-1 hash makes it possible to verify consistency with the underlying layer-1 document, and to detect modifications to the execution records by other tools. +The SHA-1 hashes make it possible to verify consistency with the underlying layer-1 document, and to detect modifications to the execution records by other tools. In the set of outputs, each output item contains: 1. a label defining the output type 2. the output data, conforming to a data model specific to the output type -Note: this section must be complemented with data models for the standard output types. Overall, outputs are handled very much like in the current notebook format. +Note: this section must be complemented with data models for the standard output types. Overall, outputs can be handled very much like in the current notebook format. #### Layer 3 @@ -84,13 +85,15 @@ Code cells are for code that has never been executed. Executed code blocks can b ### File format -The main difficulty in defining a file format for the data model described above is suitability for version control. The biggest challenge is support for merging independent changes. In general, this creates an inconsistent notebook document because the computed content (layer 2) is not automatically updated after code changes. The use of SHA-1 hashes makes it possible to detect inconsistencies between layers 1 and 2. +The main difficulty in defining a file format for the data model described above is suitability for version control. The biggest challenge is support for merging independent changes. In general, this creates an inconsistent notebook document because the computed content (layer 2) is not automatically updated after code changes. The use of SHA-1 hashes makes it possible to detect such inconsistencies. In order to make diffs readable, a line-oriented format with light markup is desirable for layers 1 and 3. Moreover, layer 2 should be placed at the end of a notebook document, following layers 1 and 3. ### Implementation +#### Jupyter notebook client + Layers 1 and 2 are managed by the kernel, layer 3 is managed by the notebook client. When a kernel is started, it creates a fresh layer 1 and layer 2 document. All code submitted to the kernel, whether through the notebook client or by other means, is appended to layer 1. Outputs are appended to layer 2. For execution requests coming from the notebook client, the kernel returns updates to layers 1 and 2 since the previous execution request, permitting the client to reconstruct layers 1 and 2 as soon as possible. This limits information loss in case of a kernel shutdown or crash. @@ -101,16 +104,30 @@ When a kernel is restarted for an existing notebook, its layers 1 and 2 are atta When already executed code is edited, the execution reference is replaced by a code cell plus a stale output cell. The latter should be displayed in a way that clearly marks it as stale. -When opening a stored notebook, the consistency between layers 1 and 2 must be verified because other tools, in particular version control systems, may create inconsistent notebook files. This check consists of comparing the SHA-1 hashes in layer 2 to freshly computed hashes for layer 1, proceding in execution order. If a difference is detected, the layer 2 data is truncated at this point, and references from layer 3 to the invalidated execution records are replaced by code cell/stale output cell pairs. - A cleanup operation ("remove output / remove all outputs") replaces execution records by code cells. +When opening a stored notebook, all execution records are replaced by code cell/stale output cell pairs and layers 1 and 2 are discarded. Then a new kernel is started, creating new layer 1/2 information. + +Publishing tools (including nbviewer) should follow a more careful procedure in order to preserve information about replicability. This requires first of all a verification of the consistency between layers 1 and 2, because other tools, in particular version control systems, may create inconsistent notebook files. The check consists of comparing the SHA-1 hashes in layer 2 to freshly computed hashes for layer 1, proceding in execution order. If a difference is detected, the layer 2 data is truncated at this point, and references from layer 3 to the invalidated execution records are replaced by code cell/stale output cell pairs. A visual marker for stale outputs then tells the reader which parts of the notebook are backed up by replicable computations. + +#### Alternative user interfaces + +A more natural user interface for the document data model would propose two views on the data: + + 1. an interactive shell much like IPython, but code-block oriented rather than line oriented + 2. a notebook editor + +A single command would send a code cell from the notebook editor to the interactive shell for execution. In the interactive shell, a single command would append the current execution record to the notebook being edited. + +An advantage of such a user interface is that it generalizes easily to multi-user setups. + ## Pros and Cons Pros: * The computations in notebooks become replicable, at least in an identical computational environment. -* Notebooks can be managed as alternatives to scripts by workflow management tools. +* Computations in notebooks can be handled by version control, including merge operations for independent changes. +* Notebooks can be managed like scripts by workflow management tools. * Alternative notebook editing tools can be developed that support different tastes or needs, while maintaining document compatibility and thus avoiding lock-in to any particular tool. Cons: From c009a69dd7fcb2f188ebe4e2e378329705b6ceae Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Fri, 11 Sep 2015 10:42:31 +0200 Subject: [PATCH 08/11] A simple example for the document structure --- .../data_model_example.py | 36 +++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 a-new-notebook-document-format-for-improved-workflow-integration/data_model_example.py diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/data_model_example.py b/a-new-notebook-document-format-for-improved-workflow-integration/data_model_example.py new file mode 100644 index 00000000..cfe1553a --- /dev/null +++ b/a-new-notebook-document-format-for-improved-workflow-integration/data_model_example.py @@ -0,0 +1,36 @@ +# This file contains an example for the proposed data model implemented +# as Python data structures. + +import hashlib + +def sha1(s): + return hashlib.sha1(s.encode('utf-8')).hexdigest() + +layer1 = {'language': 'python3', + 'code': ["import math", + "x = math.pi\nmath.cos(x)"]} + +layer2 = {'code': layer1, + 'environment': "Python 3.4.3 (default, Apr 15 2015, 21:03:06)\n[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.49)] on darwin", + 'log': [{'hash': sha1(layer1['code'][0]), + 'outputs': {'console': {'data': "", + 'hash': sha1("")}}}, + {'hash': sha1(layer1['code'][1]), + 'outputs': {'console': {'data': "-1.0", + 'hash': sha1("-1.0")}}}]} + +layer3 = {'language': 'python3', + 'execution': [layer2], + 'cells': [{'type': 'documentation', + 'format': 'markdown', + 'data': "# The cosine function\n"}, + {'type': 'documentation', + 'format': 'markdown', + 'data': "First we import the math module:\n"}, + {'type': 'execution_record', + 'data': (0, 0)}, + {'type': 'documentation', + 'format': 'markdown', + 'data': "Now we can compute the cosine:\n"}, + {'type': 'execution_record', + 'data': (0, 1)},]} From c77f1e0c03f1d8e7a09239674d412c68d593fbbe Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Fri, 11 Sep 2015 12:25:29 +0200 Subject: [PATCH 09/11] Another revision --- ...ormat-for-improved-workflow-integration.md | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index 4d838b1f..32c7d1d1 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -2,21 +2,19 @@ ## Problem -Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Examples of such tools are workflow managers, provenance trackers, and version control systems. The core of the problem is that Jupyter's on-disk notebook format is closely tied to Jupyter's functionality and design. Much information of use for other tools, although available at execution time, is not preserved in notebook files. +Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Version control systems require a clear separation of human-edited content and computed content. The current notebook file format mixes them. Workflow managers and provenance trackers require that all computations be replicable. For interactive computations, replicability requires storing a full log of user actions. The current notebook file format does not preserve this information, although it is available at execution time. +The core of the problem is that Jupyter's notebook file format is closely tied to Jupyter's functionality and design. It is essentially an on-disk representation of the internal state of the Jupyter notebook client, storing only the information required to open the notebook later or elsewhere. -## Proposed Enhancement - -Define a data model and file format for notebooks as digital documents. Include as much relevant information as possible that is available during a Jupyter session, in particular a complete log of code executed by the kernel. Design the file format with the needs of other tools in mind. +## Proposed Enhancement -## Detailed Explanation +The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter notebook client should become just one out of many possible tools that process such notebook documents. -### Notebooks as digital documents +This core of this proposal is a new data model and file format notebooks as digital documents. It includes as much relevant information as possible that is available during a Jupyter session, in particular a complete log of the code executed by the kernel. The file format is designed with the specific needs of other tools in mind, in particular version control systems. -Currently, a Jupyter notebook file is essentially an on-disk representation of the internal state of the Jupyter notebook client. The file format mixes human-edited and computationally generated information with insufficient distinction, and does not preserve the history of the computation in enough detail to permit replication and other validation techniques. -The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter notebook client should become just one out of many possible tools that process such notebook documents. +## Detailed Explanation ### A three-layer data model @@ -54,14 +52,14 @@ An execution record contains the following information: 1. the SHA-1 hash of the code block that was executed 2. a set of outputs produced by the code blocks - 3. a SHA-1 hash for each output -The SHA-1 hashes make it possible to verify consistency with the underlying layer-1 document, and to detect modifications to the execution records by other tools. - In the set of outputs, each output item contains: 1. a label defining the output type 2. the output data, conforming to a data model specific to the output type + 3. a SHA-1 hash + +The SHA-1 hashes make it possible to verify consistency with the underlying layer-1 document, and to detect accidental modifications to the execution records by other tools. Note: this section must be complemented with data models for the standard output types. Overall, outputs can be handled very much like in the current notebook format. @@ -80,7 +78,7 @@ Each cell has one of the following types: - a code cell, containing a code block - a stale output cell, containing output from a prior execution for which no log is available -Code cells are for code that has never been executed. Executed code blocks can be retrieved through the execution record from layer 2. +Code cells contain code that has not yet been executed. Executed code blocks can be retrieved through the execution record from layer 2. ### File format @@ -121,6 +119,8 @@ A single command would send a code cell from the notebook editor to the interact An advantage of such a user interface is that it generalizes easily to multi-user setups. +An alternative user interface for users whose computations are short is a literate-programming style editor in which the user mixes code and documentation and a run-time system immediately updates all output cells by re-executing all code from start to end. + ## Pros and Cons @@ -131,7 +131,7 @@ Pros: * Alternative notebook editing tools can be developed that support different tastes or needs, while maintaining document compatibility and thus avoiding lock-in to any particular tool. Cons: -* A data model and file format defined independently of the Jupyter implementation creates constraints on the future evolution of Jupyter. +* A data model and file format defined independently of the Jupyter implementation generates constraints on the future evolution of Jupyter. ## Interested Contributors @khinsen From c2eeb549f3152df04487ea508c77b02d329fba5a Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Mon, 14 Sep 2015 15:13:42 +0200 Subject: [PATCH 10/11] Another revision --- ...ormat-for-improved-workflow-integration.md | 24 +++++++++++-------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md index 32c7d1d1..1d6e4534 100644 --- a/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md +++ b/a-new-notebook-document-format-for-improved-workflow-integration/a-new-notebook-document-format-for-improved-workflow-integration.md @@ -11,11 +11,13 @@ The core of the problem is that Jupyter's notebook file format is closely tied t The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter notebook client should become just one out of many possible tools that process such notebook documents. -This core of this proposal is a new data model and file format notebooks as digital documents. It includes as much relevant information as possible that is available during a Jupyter session, in particular a complete log of the code executed by the kernel. The file format is designed with the specific needs of other tools in mind, in particular version control systems. +The core enhancement is a new data model and file format for notebooks as digital documents. It includes as much relevant information as possible that is available during a Jupyter session, in particular a complete log of the code executed by the kernel. The file format is designed with the specific needs of other tools in mind, in particular version control systems. ## Detailed Explanation +Note: for an illustration of the data model described in the following, see [a simple example](./data_model_example.py) implemented in terms of Python data structures. + ### A three-layer data model The proposed data model for notebook documents consists of three layers: @@ -25,11 +27,11 @@ The proposed data model for notebook documents consists of three layers: 3. A narrative containing references to specific code blocks and outputs. -Layers 1 and 3 are human-edited content, subject to version control. Layer 2 consists entirely of computational results. In principle, it can be recomputed at any time. However, since recomputation can be time-consuming, and is often unreliable due to today's fragile computational enviromnents, layer 2 should be archived as well under version control, as a foundation for layer 3. +Layers 1 and 3 are human-edited content, subject to version control. Layer 2 consists entirely of computational results. In principle, it can be recomputed at any time. However, since recomputation can be time-consuming, and is often unreliable due to today's fragile computational enviromnents, layer 2 must be archived as well under version control, as a foundation for layer 3. Conceptually, each layer is an independent electronic document, depending on information from lower layers. A layer 3 document can depend on multiple layer 1/2 documents. A provenance tracking system would treat a layer 1 document exactly like a script, and a layer 2 document exactly like the console output from a script. Provenance trackers may thus need to store the layers as separate files or datasets. The default notebook file format should combine all three layers, but facilitate extraction of individual layers. -Note that today's Jupyter file format resembles layer 3. It contains some information about execution order in the form of the prompt numbers. However, since the executed code is not stored anywhere, replication of the computation is impossible. Even if the prompt numbers in the notebook are sequential and start with 1, the code cells might have been edited after execution, and code might have been submitted to the kernel outside of the notebook. The only guarantee that a notebook file makes is that the outputs were obtained from *some* computation. +Note that today's Jupyter file format resembles layer 3. It contains some information about execution order in the form of the prompt numbers. However, since the executed code is not stored, replication of the computation is impossible. Even if the prompt numbers in the notebook are sequential and start with 1, the code cells might have been edited after execution, and code might have been submitted to the kernel outside of the notebook. The only guarantee that a notebook file makes is that the outputs were obtained from *some* computation. In the following, the three layers are described in more detail. @@ -73,16 +75,18 @@ A layer 3 document consists of: Each cell has one of the following types: - - a documentation cell, containing text content plus a label identifying the format (Markdown etc.) - - a reference to an execution record, consisting of (1) the index of the layer 2 document that contains the record and (2) the sequence index of the record inside the layer 2 document - - a code cell, containing a code block - - a stale output cell, containing output from a prior execution for which no log is available + - documentation: contains text content plus a label identifying the format (Markdown etc.) + - execution record reference: consists of (1) the index of the layer 2 document that contains the record and (2) the sequence index of the record inside the layer 2 document + - code: contains a code block + - stale output: contains output from a prior execution for which no log is available -Code cells contain code that has not yet been executed. Executed code blocks can be retrieved through the execution record from layer 2. +Code cells are provided for code that has not yet been executed. Executed code blocks can be retrieved through the execution record from layer 2. ### File format +Note: a detailed file format definition would be premature at this time and only lead to lengthy bikeshedding. This section lists only important file format design criteria. Initial discussion should focus on the data model. + The main difficulty in defining a file format for the data model described above is suitability for version control. The biggest challenge is support for merging independent changes. In general, this creates an inconsistent notebook document because the computed content (layer 2) is not automatically updated after code changes. The use of SHA-1 hashes makes it possible to detect such inconsistencies. In order to make diffs readable, a line-oriented format with light markup is desirable for layers 1 and 3. Moreover, layer 2 should be placed at the end of a notebook document, following layers 1 and 3. @@ -96,9 +100,9 @@ Layers 1 and 2 are managed by the kernel, layer 3 is managed by the notebook cli When a kernel is started, it creates a fresh layer 1 and layer 2 document. All code submitted to the kernel, whether through the notebook client or by other means, is appended to layer 1. Outputs are appended to layer 2. For execution requests coming from the notebook client, the kernel returns updates to layers 1 and 2 since the previous execution request, permitting the client to reconstruct layers 1 and 2 as soon as possible. This limits information loss in case of a kernel shutdown or crash. -The Web client creates new notebooks as layer 3 documents with no attached lower layers. User-edited content is stored as documentation cells or code cells. When a kernel is started, its layer 2 document is attached to the client's layer 3 document. When a code cell is sent to the kernel, it is replaced by a reference to the resulting execution record that is returned by the kernel. The client reconstructs layers 1 and 2 incrementally from this information. +The Web client creates new notebooks as layer 3 documents with no attached lower-level layers. User-edited content is stored in documentation and code cells. When a kernel is started, its layer 2 document is attached to the client's layer 3 document. When a code cell is sent to the kernel, it is replaced by a reference to the resulting execution record that is returned by the kernel. The client reconstructs layers 1 and 2 incrementally from this information. -When a kernel is restarted for an existing notebook, its layers 1 and 2 are attached to the client's layer 3 in addition to layers 1 and 2 from earlier kernels. Existing layer 1/2 attachments can be deleted only when no reference to them exists any more in layer 3. +When a kernel is restarted for an existing notebook, its layers 1 and 2 are attached to the client's layer 3 in addition to layers 1 and 2 from earlier kernel runs. Existing layer 1/2 attachments are deleted only when no reference to them exists any more in layer 3 (i.e. a garbage collection approach). When already executed code is edited, the execution reference is replaced by a code cell plus a stale output cell. The latter should be displayed in a way that clearly marks it as stale. From 2a4d6588864a829d88975790a6123aa14578bfdc Mon Sep 17 00:00:00 2001 From: Konrad Hinsen Date: Tue, 20 Oct 2015 09:25:25 +0200 Subject: [PATCH 11/11] Example notebook that is syntactically valid but semantically invalid --- .../semantically-invalid-notebook.ipynb | 56 +++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 a-new-notebook-document-format-for-improved-workflow-integration/semantically-invalid-notebook.ipynb diff --git a/a-new-notebook-document-format-for-improved-workflow-integration/semantically-invalid-notebook.ipynb b/a-new-notebook-document-format-for-improved-workflow-integration/semantically-invalid-notebook.ipynb new file mode 100644 index 00000000..1d4ea90f --- /dev/null +++ b/a-new-notebook-document-format-for-improved-workflow-integration/semantically-invalid-notebook.ipynb @@ -0,0 +1,56 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "7" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "2+3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}