Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package location of document models #149

Open
9 of 42 tasks
afshin opened this issue Jun 20, 2022 · 9 comments
Open
9 of 42 tasks

Package location of document models #149

afshin opened this issue Jun 20, 2022 · 9 comments

Comments

@afshin
Copy link
Member

afshin commented Jun 20, 2022

This issue is to call a vote on the proposal in jupyterlab/jupyterlab#12708

(Full text of top-line comment)

Problem

We have an open PR jupyterlab/jupyterlab#12359 from @hbcarlos that refactors our models for notebooks and text documents to improve the real-time collaboration (RTC) user experience and APIs for extension developer. There are a couple of questions that are coming up in the review of that PR that are not really RTC questions. I am opening a few issues to help us reach consensus/vote to resolve those questions to make it easier for us to review and make decisions about #12359.

In the current issue, we are trying to answer the following question:

In what npm package should our models live?

In JupyterLab master, our models live in a couple of places:

  • shared-models
    • Contains a public interface and a Yjs-based implementation of "shared documents"
    • Includes ISharedFile, ISharedNotebook, ISharedCell, etc. that wrap Yjs and expose public API.
  • nbformat.IOutput has the output model, but it is not a RTC/shared model. It is used in shared-models as the output model and is basically the JSON representation of a single output.
  • cells has a ICellModel that "has a" ISharedCell and is also a public model API.
  • rendermime has an OutputModel that wraps the IOutput into a TS object with a public API.
  • outputarea has an OutputAreaModel object that has a list of OutputModel instances.
  • notebook
    • Has a CellList model that is an observable list of CellModel objects.
    • Has a NotebookModel that "has a" ISharedNotebook and CellLlist

The key point here is that we have multiple layers of model APIs across those in the shared-models package and the (notebook, cells, outputarea packages). Both are currently public APIs that are high-level and semantic - they refer to native Jupyter things rather than the lower-level of Yjs types or ModelDB types.

The approach in #12359 removes the public API in shared-models and incorporates all of that logic into the (notebook, cells, outputarea) packages. As a result of these changes all of the models are in the npm package for that entity – (cells, notebook, outputarea) respectively.

Separate from all of the details of how we are doing RTC, we need to decide which of these two approaches is best for where our models are:

Proposed Solution

The proposal in this issue is to remove the shared-models npm package and move all of our models to the (notebook, cells, outputarea) packages. This decision only pertains to which npm package these models are in, rather than to the details of what those models are.

Pros:

  • Removing the shared-models package reduces the number of npm packages across which our model APIs are spread.
  • Having the models consolidated to its entity's package would give us more flexibility to consolidate, simplify, and refactor those models APIs.
  • Because of this split across multiple npm packages, the shared-models package is only a partial representation of the overall model APIs, so they are less useful on their own.

Cons:

  • In principle, someone might want to use the models in shared-models to build something without our (notebook, cells, outputareas) packages.
  • Removing shared-models is a breaking API. to be clear though, there are other breaking API changes in #12359

Summary

Vote Yes if you agree with the following:

... to remove the shared-models npm package and move all of our models to the (notebook, cells, outputarea) packages. This decision only pertains to which npm package these models are in, rather than to the details of what those models are.

Vote No if you disagree.

@jupyterlab/jupyterlab-council votes


Edited to correct issue link

@afshin afshin added bug Something isn't working vote and removed bug Something isn't working labels Jun 20, 2022
@fperez
Copy link

fperez commented Jun 22, 2022

Before voting, I'd appreciate a bit more clarification on the extent to which this proposal impacts some of the specific points raised by @dmonad during the #12359 discussion.

If I understand correctly, this is mostly about the package locations, but @dmonad was also concerned about other technical issues, specifically related to the complexity of correctly initializing these data structures and later manipulating them without introducing inconsistencies.

Does this particular proposal intersect with those issues, or is the intent that for now, the package "home" for these APIs is the only change, with those issues being left for later discussion?

I'd just like a bit more clarity on the scope of this particular vote. I realize there's probably been a massive amount of discussion at meetings I haven't participated in, and I'm not requesting a rehash of all the details, but in key public votes I think it's useful to summarize key points with sufficient precision for anyone who reads these logs to understand the decision-making process (Python PEPs do an outsanding job with this, and I can attest to having read the discussion/pros/cons sections of important PEPs many times in detail, to understand the evolution of the language).

@afshin
Copy link
Member Author

afshin commented Jun 23, 2022

@fperez,

Does this particular proposal intersect with those issues, or is the intent that for now, the package "home" for these APIs is the only change, with those issues being left for later discussion?

I don't think this approach strictly obviates any RTC implementation beyond defining where the underlying classes reside. So I think @dmonad can reasonably describe this proposal as orthogonal to the conversation in the RTC pull request (Kevin, please correct me if I have misstated your position).

The proposal in this issue does coincide with the implementation in Carlos' PR but a different code implementation could still contain a package structure that accords with the proposed package architecture.

I think the intent of this issue is to shrink the contested API surface areas by proposing a first principles question about how we organize the packages and APIs related to cells, notebooks, and their respective data models. The idea is to propose a question that does not require delving into a huge code base or a huge proposed change to that code base while still being able to offer an informed opinion.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Jun 27, 2022

In the past few weeks, there have been a lot of discussions on the best next steps for RTC, as the deadline for JupyterLab 4.0 is approaching.

Some context

The first version of JupyterLab including collaborative editing features was JupyterLab 3.1, although it is only enabled by setting a flag. In JupyterLab 3.1, collaborative editing is built upon the Yjs CRDT implementation, which provides a framework for building collaborative applications.

For this first iteration, a key objective was to provide a collaborative experience while retaining backwards compatibility for extension authors that were relying on the current ModelDB-based model implementation. This was a requirement for integrating this work into a 3.x release. This first version reached that objective by having the Yjs-based models (the so-called shared documents) exist in parallel with ModelDB, and synchronize with it.

This first iteration has major issues. First, the bidirectional binding of Yjs and ModelDB model states created significant problems with respect to state initialization, such as occasionally causing data loss. Second, saving the documents was still done through a request from the front-end to the contents API, also causing issues with race conditions, forcing us to make use of a locking approach for the initial loading and the saving of documents from disk. Finally, this created significant overhead both in terms of performance and memory.

Major goals for JupyterLab 4 were (among others) to

  • Implement a back-end Yjs peer that would be the only one responsible for saving the document to disk, removing the locking approach.
  • Resolve that duplication of model data (which also poses some performance issues).

The first objective has been implemented - and merged into the relevant packages. The second objective is the subject of this issue.

Carlos' PR

In PR 12359, Carlos (@hbcarlos) resolves the duplication problem by removing the new "shared documents" added in Lab 3.1 altogether, and by adopting an approach similar to ModelDB, except that the ModelDB replacement is based on Yjs. This ModelDB replacement is essentially Yjs, except that it exposes a subset of API and prevents exposing Yjs data types to other JupyterLab components.

This PR has been significantly iterated upon, and is technically ready to go. It does resolve the catastrophic data loss issues that we were experiencing with JupyterLab 3, and gets us to the point of providing a real working collaborative editing experience in JupyterLab.

Notably, Carlos' PR also includes a large number of new tests and docstrings. The ModelDB equivalent that is implemented on top of Yjs also adds to the line count.

Kevin's objections

Kevin objects to Carlos' approach in PR 12359 and proposes instead to retain the "shared documents" introduced in 3.x, and to remove ModelDB.

I don't want to misrepresent the arguments, but here is my short summary of their nature:

Kevin's arguments in favor of this alternative approach are of several natures. Some pertain to code organization, such as that "shared state" and "application state" should live in two different classes and packages, architectural decisions, such as that the way abstractions for e.g. cells and output areas should be articulated around shared types. He made a draft implementation of this approach in PR 12695.

What is common in both approaches - Yjs becoming the main source of truth

In both Carlos' and Kevin's approaches, the model state is solely represented as a Yjs document even in non-collaborative mode, making Yjs very core to JupyterLab. A single YDoc represents the entire state of a shared document. In both approaches, Yjs becomes the source of truth.

So it better work, otherwise JupyterLab users will face significant regressions in JupyterLab 4.0.

For example, it becomes critical that the so-called "move feature" that is currently missing in Yrs (the rust implementation of Yjs used in the backend) works as expected, or users of JupyterLab won't be able to make "undo/redo'' operations in cells after moving them. We were also resolving significant issues in Yrs/Ypy until recently, that caused some crashes during the test collaborative sessions.

Hence, the move feature is an important priority for JupyterLab 4.0.

The approach of constraining the decision making

There have been many debates on this issue, in the public RTC meeting and in the various PRs and issues, one approach to get closer to an agreement that was proposed was to constrain the problem in an exogenous manner. If there were intrinsic constraints coming from our goals for JupyterLab that could guide some of the decisions, this would be a way to arbitrate the issue. This is why Afshin opened two votes for the JupyterLab council in this issue and Issue 150 that concern the location of the model implementation for documents, and the modeling of metadata.

(I am not touching the question of modeling metadata in this post. This is another place where Carlos' approach differs from the current state of shared documents, although either implementation could probably implement the modeling of the notebook metadata in the way of the other).

@SylvainCorlay
Copy link
Member

Now, on the crux of the matter of this issue (package location of document models), I think it is important that the document model lives in the same package as the document implementation. If we were to write a spreadsheet extension for JupyterLab, we would have spreadsheet model be in the spreadsheet NPM package.

One thing that could (and should probably) be separate is a static schema definition fot the shared state, so that it can be consumed by other implementation.

@ellisonbg
Copy link
Contributor

The voting period for this issues has closed:

  • Yes: 9 votes
  • No: 0 votes
  • Abstain: 0 votes

Given the 14 council members, the 50% quorum is met and the proposal passes.

@afshin
Copy link
Member Author

afshin commented Jul 18, 2022

(This vote is closed but the issue is open to allow others to find it more easily.)

@dmonad
Copy link
Member

dmonad commented Jul 18, 2022

In principle, someone might want to use the models in shared-models to build something without our (notebook, cells, outputareas) packages.

There are already users of our shared-models package that currently has very few dependencies. Ideally - and I believe everyone agrees with that - we want everyone to be able to consume Jupyter packages to build derivates. This will allow others (e.g. VSCode Jupyter extension) to build their product around the same sync API as JupyterLab (they can collaborate via the same backend!). Merging the shared models into the other packages might still allow others to consume the shared models, but we will force them to install almost the complete jupyter ecosystem. The argument about "bundlers will just eliminate this away" is a really awkward one (700MB of dependencies still need to be downloaded).

We implemented this really awesome, complex feature "collaborative editing". It is fine to dedicate a single package to this very complex problem. We are not really missing out on much (maybe this is not the most ideomatic code architecture for some). But it is very practical and might mean a lot to others.

@afshin
Copy link
Member Author

afshin commented Jul 18, 2022

The argument about "bundlers will just eliminate this away" is a really awkward one (700MB of dependencies still need to be downloaded).

The entire @jupyterlab/notebook package weighs in at 69.1MB when I install it in an empty folder and check the size of node_modules. That is ten times larger than if I install the current @jupyterlab/shared-models, for sure. But "700MB of dependencies" is simply inaccurate by an order of magnitude.

I don't think it is helpful to re-ignite an argument to save authors of non-JupyterLab applications from an additional 63MB download. I don't think that's a huge imposition. We're not talking about extension authors — they already need to install many other JupyterLab packages, after all.

We have spent months in this exact conversation and as recently as last week's RTC call, I thought we had consensus on a path forward. In addition, we held this vote for this issue specifically because we had difficulty arriving at the answer otherwise.

It's absolutely fine to disagree. If somebody on the @jupyterlab/jupyterlab-council feels like proposing the opposite of this issue and putting it to a vote that is okay. I advise against doing that. I'd rather release software than legislation.

I recommend we move forward and actually release RTC one day soon.

There has been plenty of discussion, disagreement, and compromise so far. To re-litigate the issue now is a step backward. We've agreed to go with @dmonad's approach despite how difficult the conversation around it has been and how much time and effort was poured into a different approach. We even agreed to ship the exact same API footprint.

Is it really worth arguing over whether a package adds 7MB to node_modules or 70MB? Personally, I don't think it is.

@ellisonbg
Copy link
Contributor

I agree with @afshin that the decision has already been made. If a JupyterLab council member wants us to reconsider that decision/vote, they are free to propose a new vote. Discussion should continue, but at this point, I don't see any reason to not move forward with the approved plan.

I realize that this is a new decision making approach for Jupyter and will likely take time for us to get used to. For those who want to read up on how it all works the decision making approach is described here:

https://github.com/jupyter/governance/blob/master/decision_making.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants