Add support for using cell ID in diffing and merging #639

vidartf · 2022-11-23T01:09:13Z

Fixes #553.

Previous cell ID efforts have been focused on having them not have nbdime fall over, and make sure nbdime doesn't drop them completely (#566). This PR makes nbdime start using the IDs to make better diffs! While the PR does a bunch of extra things as well (some optimization, some refactoring to have nbdime treat the id field as atomic, some test fixes, improving the union merge strategy), the main logical changes for the cell IDs are relatively small (the changes to compare_cell_X functions in nbdime/diffing/notebooks.py). To break it down:

In the strictest check, cells without an ID is never considered equal to cells with IDs.
In the two less-strict comparisons (moderate and approximate), it will consider cells as unequal if they both have ids and they differ. However, they will allow a cell without an ID to be considered similar to a cell with an ID. This allows:
- Commits where cell IDs are added to notebooks that previously didn't have it should still be able to match cells.
- If a notebook ends up with only some cells having IDs (e.g. during a partial merge or rebase), and then IDs are added to those cells later, then in the first comparison pass (strict) the existing cells that all had IDs are first matched to each other, and only in the next passes are the cells that gained IDs in the current change compared with their old ID-less versions.

I want to run some extra manual tests on the change of the differ for the source field, so leaving as draft for now. We also need to have a way to surface the cell ID in the web apps. Maybe include it in the [N] input prompt? This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

krassowski · 2023-06-04T23:27:51Z

We also need to have a way to surface the cell ID in the web apps. Maybe include it in the [N] input prompt?

Markdown and raw cells do not have execution count prompts:

The cell IDs tend to be long and non-user-readable. The would likely fit next to the prompt:

Is this what you had in mind? I can send a PR against this one.

What would you say about only showing those when hovering over the cell to reduce visual clutter?

This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

Another common habit which will lead to this is when working with e.g. visualisation, if a user duplicates the cell, modifies the code, runs the other cell, compares visualisation results and depending on satisfaction removes one or the other.

I was wondering whether it should be:

if cell_a_id == cell_b_id:
    return True

rather than:

return cell_a_id == cell_b_id

I guess this would have much less of a performance benefit (i.e. only benefit if notebook structure is largely unchanged).

krassowski · 2023-06-10T20:28:44Z

I opened vidartf#3 implementing the UI changes exposing the cell IDs. I think that these could be improved iteratively later.

I had also measured the execution time to understand how much improvement this PR brings. In a notebook of 1000 code cells (10 unique cells duplicated 100 times), benchmarked with hyperfine using nbdime diff path/to/notebook.ipynb:

Scenario	Before (mean ± σ)	After (mean ± σ)	Speedup
no changes	1.133 s ± 0.056 s	460.2 ms ± 17.7 ms	x 2.5
content of last cell changed	10.442 s ± 0.442 s	2.486 s ± 0.060 s	x 4.2
content of first cell changed	10.717 s ± 0.843 s	2.448 s ± 0.093 s	x 4.4
last cell added	10.735 s ± 0.748 s	2.398 s ± 0.159 s	x 4.5
first cell added	13.872 s ± 1.710 s	2.311 s ± 0.254 s	x 6.0
cells randomly resuffled	16.925 s ± 0.232 s	2.555 s ± 0.053 s	x 6.6
content of every cell changed	387.859 s ± 18.706 s	2.664 s ± 0.078 s	x 145.6

This is pretty impressive, especially for the case of a notebook where content of each cell was modified.

krassowski · 2023-06-10T20:32:50Z

This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

With vidartf#3 this case would be seen as:

vidartf · 2023-07-07T17:46:09Z

@krassowski I changed how ID comparison is done now. Now it simply uses ID equality as the first pass (any cells that have the same ID will be considered equal, and used for the first pass of creating coherent snakes). It should be more consistent with current behavior when diffing/merging notebooks without IDs (or partial transitions). I'm pretty confident that this should not affect the performance for notebooks with cell IDs, but it might affect performance for mixed notebooks (not sure if positively or negatively).

Note that both the current and previous implementation allows for cells to be considered equal even if they have different ID fields. @minrk suggested at JupyterCon that this should be safe in the case where no cell IDs match each on either side, but it might not for other scenarios, so maybe that should be the only scenario where the non-ID comparison is invoked? I.e. if a base notebook with IDs branch, and two sides each add the same exact cell other than the ID, nbdime should not consider them equal. Formulated as code:

if {a.id for a in notebook_A} & {b.id for b in notebook_B}:  # check for intersection of IDs
    notebook_predicates['cells'] = [ compare_cell_by_ids ]  # only ID comparison is considered
else:
    notebook_predicates['cells'] = [  # IDs have no impact here, so ignore and fall back on old implementation
        compare_cell_approximate,
        compare_cell_moderate,
        compare_cell_strict,
    ]

I would be very happy to have opinions on either of the above points!

vidartf · 2023-07-07T18:13:54Z

This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

Another common habit which will lead to this is when working with e.g. visualisation, if a user duplicates the cell, modifies the code, runs the other cell, compares visualisation results and depending on satisfaction removes one or the other.

These arguments are probably why I would tilt for the current behavior of allowing non-ID identical cells to match, but there might be counter-argument that I haven't considered(?).

krassowski · 2023-08-13T15:18:58Z

I'm pretty confident that this should not affect the performance for notebooks with cell IDs

For the most part this is true, except for not very realistic scenario of a notebook with cells randomly reshuffled, in which case there is still a performance boost by a factor of two (rather than six as before).

Scenario	Before (mean ± σ)	After (mean ± σ)	Speedup
no changes	1.133 s ± 0.056 s	469.5 ms ± 14.3 ms	x 2.4
content of last cell changed	10.442 s ± 0.442 s	2.357 s ± 0.027 s	x 4.4
content of first cell changed	10.717 s ± 0.843 s	2.371 s ± 0.071	x 4.5
last cell added	10.735 s ± 0.748 s	2.404 s ± 0.089 s	x 4.5
first cell added	13.872 s ± 1.710 s	2.391 s ± 0.091 s	x 5.8
cells randomly resuffled	16.925 s ± 0.232 s	8.568 s ± 0.075 s	x 2.0
content of every cell changed	387.859 s ± 18.706 s	2.558 s ± 0.107 s	x 151.6

krassowski

Not sure whether there is anything else you would like to do here, but it seems good to go for me. If you believe this needs further user testing, maybe it would be a good idea to merge it together with JupyterLab 4.0 port and release an alpha/beta for more users to test out?

vidartf · 2023-09-08T10:54:45Z

Not sure whether there is anything else you would like to do here, but it seems good to go for me. If you believe this needs further user testing, maybe it would be a good idea to merge it together with JupyterLab 4.0 port and release an alpha/beta for more users to test out?

I was waiting on input on vidartf#3, but I can merge and iterate on that PR.

fcollonval · 2023-10-26T09:21:21Z

bot please update playwright snapshots

fcollonval · 2023-10-26T09:22:47Z

This works for notebook diff (tested on example8) but not on the merge view (tested on example8).

I don't have time to address the merge case.

github-actions · 2023-10-26T09:51:28Z

Playwright windows-latest snapshots updated.

krassowski · 2023-10-26T15:37:12Z

So this PR does not solve the pre-existing problem with merge (#690) but still provides a notable performance benefit. In my opinion we should merge it and include in the upcoming release.

Any objections @vidartf?

An object that encapsulates differs and predicates, and also the new "is_atomic", which is added so that we can mark e.g. cell IDs as atomic. DiffConfig could potentially also be used in the future to better config notebook ignores etc?

When comparing identical sources we can avoid a full diff by a quick equality check.

Avoid recursing for string mimetypes that are identical

In the strictest check, cells without ID is never considered equal to cells with IDs, but they can be in the two less-strict comparisons. I.e. if a notebook ends up with only some cells having IDs, and then IDs are added to those cells later, then in the first pass the existing cells that all had IDs are first matched to each other, and only in the next passes are the cells that gained IDs compared with their old ID-less versions.

Always use line based diffing for source.

Enable previously not working test

vidartf · 2023-10-27T15:35:10Z

Note: While I have a fix for the failure locally, we still don't have a way to visualize a conflict on cell IDs. The changes in vidartf#3 only display the change in the diff case. I can add a placeholder for now: "conflict on cell ID, use raw text editor to resolve".

krassowski · 2023-10-27T15:39:33Z

Once you push your changes I can thread the cell IDs in the merge case too, though I am not sure on the design.

Good for collecting debug data on merge decisions.

vidartf · 2023-10-27T16:23:57Z

@krassowski pushed changes here, I expect some UI test will need re-rendering, but I will let you have a look at improvements first (or maybe in follow-up PR, as branches here seem to be getting rebased, so new PRs are probably less messy).

Adds tests for the recently added ability to output merge decisions as raw JSON. As the tests indicate, this is a breaking behavior, but this is a major version release, so should be ok.

vidartf · 2023-10-30T18:32:17Z

bot please update playwright snapshots

github-actions · 2023-10-30T18:43:30Z

Playwright windows-latest snapshots updated.

github-actions · 2023-10-30T18:50:37Z

Playwright ubuntu-22.04 snapshots updated.

This is needed to have download test work (assumes no conflicts).

vidartf · 2023-11-01T11:30:03Z

bot please update playwright snapshots

github-actions · 2023-11-01T11:34:38Z

Playwright ubuntu-22.04 snapshots updated.

github-actions · 2023-11-01T11:42:58Z

Playwright windows-latest snapshots updated.

Click should only register as final after mouse up with primary button.

vidartf · 2023-11-01T12:18:43Z

@krassowski merged. Let's do any further improvement for cell ID conflicts display/resolution in a follow-up PR.

vidartf closed this May 10, 2023

vidartf reopened this May 10, 2023

vidartf force-pushed the cellid-diff branch from a672cb3 to d8f5300 Compare May 10, 2023 12:47

krassowski mentioned this pull request Jun 10, 2023

Expose cell ID in the UI vidartf/nbdime#3

Merged

vidartf marked this pull request as ready for review July 7, 2023 18:23

fcollonval mentioned this pull request Jul 11, 2023

Weekly Team Meetings: Jul-Dec 2023 jupyterlab/frontends-team-compass#205

Closed

krassowski approved these changes Aug 26, 2023

View reviewed changes

krassowski mentioned this pull request Sep 6, 2023

Help support by increasing the maintainer list #670

Closed

vidartf force-pushed the cellid-diff branch from bf9742b to 0c721b5 Compare September 8, 2023 11:55

krassowski mentioned this pull request Sep 22, 2023

Merge does not work for some notebooks #690

Closed

fcollonval mentioned this pull request Oct 18, 2023

Migration to jupyterlab 4.0 #659

Closed

fcollonval force-pushed the cellid-diff branch from 0c721b5 to 72dd1bd Compare October 26, 2023 09:19

vidartf added 7 commits October 27, 2023 09:51

Add DiffConfig

7eb83fb

An object that encapsulates differs and predicates, and also the new "is_atomic", which is added so that we can mark e.g. cell IDs as atomic. DiffConfig could potentially also be used in the future to better config notebook ignores etc?

Optimize source diffing

028b618

When comparing identical sources we can avoid a full diff by a quick equality check.

Optimize mime diff

89dfb23

Avoid recursing for string mimetypes that are identical

Skip diffing if string are equal

496df57

Fix source string diffing

1b318e6

Always use line based diffing for source.

Fix tests for cell IDs

17caf86

Enable previously not working test

vidartf added 5 commits October 27, 2023 17:21

Add capability to output merge decisions to file

51b0571

Good for collecting debug data on merge decisions.

whitelist -> allowlist

70670f7

Add bare minimum cell ID merge conflict support

a6b7b65

Better handle merge processing error

15e6be4

useful command to build all ts

37e2fb2

Update tests with new behavior

9630642

Adds tests for the recently added ability to output merge decisions as raw JSON. As the tests indicate, this is a breaking behavior, but this is a major version release, so should be ok.

Update Playwright Snapshots

cc0be95

Update Playwright Snapshots

e8e2dff

vidartf closed this Oct 30, 2023

vidartf reopened this Oct 30, 2023

Remove ID conflict for UI test merge_test1

e247fea

This is needed to have download test work (assumes no conflicts).

Update Playwright Snapshots

efbcdff

Update Playwright Snapshots

08636b8

Fix mergeview gutter click buttons

9884381

Click should only register as final after mouse up with primary button.

vidartf merged commit c1ea9b9 into jupyter:master Nov 1, 2023
14 checks passed

vidartf deleted the cellid-diff branch November 1, 2023 12:18

renonat mentioned this pull request Nov 21, 2023

Incompatible with nbdime version 4.0.0+ chrisjsewell/pytest-notebook#59

Closed

krassowski mentioned this pull request Nov 27, 2023

nbdime 4.x: TypeError: diff() got an unexpected keyword argument 'predicates' #741

Closed

jupyter deleted a comment from Cloudflare-d Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for using cell ID in diffing and merging #639

Add support for using cell ID in diffing and merging #639

vidartf commented Nov 23, 2022 •

edited

Loading

krassowski commented Jun 4, 2023

krassowski commented Jun 10, 2023

krassowski commented Jun 10, 2023

vidartf commented Jul 7, 2023 •

edited

Loading

vidartf commented Jul 7, 2023

krassowski commented Aug 13, 2023

krassowski left a comment

vidartf commented Sep 8, 2023

fcollonval commented Oct 26, 2023

fcollonval commented Oct 26, 2023

github-actions bot commented Oct 26, 2023

krassowski commented Oct 26, 2023

vidartf commented Oct 27, 2023

krassowski commented Oct 27, 2023

vidartf commented Oct 27, 2023

vidartf commented Oct 30, 2023

github-actions bot commented Oct 30, 2023

github-actions bot commented Oct 30, 2023

vidartf commented Nov 1, 2023

github-actions bot commented Nov 1, 2023

github-actions bot commented Nov 1, 2023

vidartf commented Nov 1, 2023

Add support for using cell ID in diffing and merging #639

Add support for using cell ID in diffing and merging #639

Conversation

vidartf commented Nov 23, 2022 • edited Loading

krassowski commented Jun 4, 2023

krassowski commented Jun 10, 2023

krassowski commented Jun 10, 2023

vidartf commented Jul 7, 2023 • edited Loading

vidartf commented Jul 7, 2023

krassowski commented Aug 13, 2023

krassowski left a comment

Choose a reason for hiding this comment

vidartf commented Sep 8, 2023

fcollonval commented Oct 26, 2023

fcollonval commented Oct 26, 2023

github-actions bot commented Oct 26, 2023

krassowski commented Oct 26, 2023

vidartf commented Oct 27, 2023

krassowski commented Oct 27, 2023

vidartf commented Oct 27, 2023

vidartf commented Oct 30, 2023

github-actions bot commented Oct 30, 2023

github-actions bot commented Oct 30, 2023

vidartf commented Nov 1, 2023

github-actions bot commented Nov 1, 2023

github-actions bot commented Nov 1, 2023

vidartf commented Nov 1, 2023

vidartf commented Nov 23, 2022 •

edited

Loading

vidartf commented Jul 7, 2023 •

edited

Loading