Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle insertions and deletions (<ins> and <del>) consistently across readers/writers #1560

Open
mfenner opened this issue Aug 24, 2014 · 11 comments

Comments

@mfenner
Copy link

mfenner commented Aug 24, 2014

The new docx reader converts docx track changes into HTML. I suggest the following changes:

  • use <ins> and <del> instead of <span class="insertion"> and `", as they are standard HTML tags and will render in most browsers.
  • don't use a wrapper <span> including author and date for combined insertions/deletions, rather use author and date attributes on both the <ins> and <del> tag.
@mpickering
Copy link
Collaborator

This is probably desirable but technically difficult to get right because
there isn't a specific place in the AST for insertions and deletions.

On Sun, Aug 24, 2014 at 2:17 PM, Martin Fenner [email protected]
wrote:

The new docx reader converts docx track changes into HTML. I suggest the
following changes:

  • use and instead of and `", as
    they are standard HTML tags and will render in most browsers.
  • don't use a wrapper including author and date for combined
    insertions/deletions, rather use author and date attributes on both the
    and tag.


Reply to this email directly or view it on GitHub
#1560.

@jgm
Copy link
Owner

jgm commented Aug 24, 2014

We have the Spans in the AST. I think it would just be a matter of changing
the HTML writer so that it renders these spans with <del> and <ins>
rather than the spans. Probably the author and date attributes should
be included on whatever tags are used.

+++ mpickering [Aug 24 14 06:39 ]:

This is probably desirable but technically difficult to get right
because
there isn't a specific place in the AST for insertions and deletions.
On Sun, Aug 24, 2014 at 2:17 PM, Martin Fenner
[email protected]
wrote:

The new docx reader converts docx track changes into HTML. I suggest
the
following changes:

  • use and instead of and `", as
    they are standard HTML tags and will render in most browsers.
  • don't use a wrapper including author and date for combined
    insertions/deletions, rather use author and date attributes on both
    the
    and tag.


Reply to this email directly or view it on GitHub
#1560.


Reply to this email directly or [1]view it on GitHub.

References

  1. Handle insertions and deletions (<ins> and <del>) consistently across readers/writers #1560 (comment)

@mpickering
Copy link
Collaborator

I get a bit worried about the sustainability of this approach in general especially as it causes unexpected results in a user has "insertion" class but also because you lose type guarantees and it makes code more difficult to understand in someone is unfamiliar.

@jgm
Copy link
Owner

jgm commented Aug 24, 2014

It does add some complexity. @mfenner, why not just add some
CSS rules for these span elements, so they're displayed as
you like?

+++ mpickering [Aug 24 14 08:40 ]:

I get a bit worried about the sustainability of this approach in
general especially as it causes unexpected results in a user has
"insertion" class but also because you lose type guarantees and it
makes code more difficult to understand in someone is unfamiliar.


Reply to this email directly or [1]view it on GitHub.

References

  1. Handle insertions and deletions (<ins> and <del>) consistently across readers/writers #1560 (comment)

@jgm
Copy link
Owner

jgm commented Mar 9, 2017

One way to do this without an AST change would be to special-case these spans in the HTML writer.

@mb21
Copy link
Collaborator

mb21 commented Mar 9, 2017

One way to do this without an AST change would be to special-case these spans in the HTML writer.

I think a dedicated element would work better in the long run.. considering for example how the image/figure hack turned out, I think it's better to include more AST elements.

@mb21 mb21 changed the title Use <ins> instead of <span class="insertion"> for docx reader track changes Handle insertions and deletions (<ins> and <del>) consistently across readers/writers Nov 5, 2018
@mb21 mb21 removed the format:Docx label Nov 5, 2018
@mb21
Copy link
Collaborator

mb21 commented Nov 5, 2018

I've broadened the scope of this issue. If we decide to handle insertions/deletions better across pandoc, there are a few formats to consider.

In HTML and markdown, pandoc 2.4 currently does:

$ echo '<ins>foo</ins>' | pandoc -f html -t native
[Plain [Span ("",["underline"],[]) [Str "foo"]]]

$ echo '<del>foo</del>' | pandoc -f html -t native
[Plain [Strikeout [Str "foo"]]]

$ echo '~~foo~~' | pandoc -f markdown -t native
[Para [Strikeout [Str "foo"]]]

Looking at CriticMarkup, from this closed issue:

critic markup HTML LaTeX
{--[text]--} <del>[text]</del> \st{[text]}
{++[text]++} <ins>[text]</ins> \underline{[text]}
{~~[text1]~>[text2]~~} <del>[text1]</del><ins>[text2]</ins> \st{[text1]}\underline{[text2]}
{==[text]==} <mark>[text]</mark> \hl{[text]}
{>>[text]<<} <aside>[text]</aside> \marginpar{[text]}

See also the LaTeX changes package.

And of course the existing docx --track-changes option. (In the code, look for AcceptChanges in the Docx reader.)


If we go with an AST change, it might seem like we simply could:

  • replace the (non-semantic) Strikeout element with a Deleted element
  • add an Inserted element

However, Strikeout is an Inline element, so it's not clear how to handle changes that span multiple blocks. From pandoc-discuss:

It's not that easy. What kind of native output would this produce?

- My first list item. 
- My second {-- list item. 
- My third --} combined list item. 

Presumably a list with three items, the second and third of which
contain these special StartDelete and StopDelete markers.
But that doesn't tell you the structure you want after applying
the edits -- which is a list with two items. Special logic
for doing this sort of transformation would need to be included
somewhere. And there are much more complex cases I could come up
with.

It seems the fundamental question is whether this should be modelled in the pandoc AST like the HTML <ins> and <del> elements (which are block-level elements, but nonetheless force you to serialize the changes to a tree), or whether it should be more like a plain-text diff kind of thing, where we have starting and ending markers that can cross nodes in the pandoc AST (that's more what CiriticMarkup does). But if it's the second, you can just use an external preprocessor to diff your changes and handle the markers (like pancritic), and pandoc wouldn't have to have anything to do with it.

So, input welcome on how different formats handle this. Especially, I'm unclear on how docx --track-changes and LaTeX handle:

  • changes that span multiple blocks
  • changes that are not contained in any one node of the document tree.

@davidar
Copy link
Contributor

davidar commented Apr 6, 2019

It's not that easy. What kind of native output would this produce?

- My first list item. 
- My second {-- list item. 
- My third --} combined list item. 

As mentioned in that thread, the CriticMarkup spec actually recommends against these kinds of AST-breaking changes:

Avoid Newlines in CriticMarkup

Wrap Markdown Tags Completely

While it may support incomplete Markdown tags in the future, the CriticMarkup processor currently chokes on them. Avoid this:

I really love *italic {~~fonts*~>font-styles*~~}.

Instead, wrap the asterisks completely:

I really love {~~*italic fonts*~>*italic font-styles*~~}.

The one exception to this seems to be for insertion/deletion of paragraph breaks, I'm not sure if there's a clean way to handle that case.

@ickc
Copy link
Contributor

ickc commented Apr 9, 2019

Don’t read too much into the current CriticMarkup “spec” though. It is not a proper parser but just a bunch of regex. (FYI I’m maintaining a fork of the reference implementation of CriticMarkup in the form of pancritic which provides Python 3 support among other things.)

Multiline (ie Newlines crossing) shouldn’t be a big problem but “boundary crossing” between markdown markups and CriticMarkup is.

If it is decided to support CriticMarkup (as a syntax, not its spec) in pandoc then we got to eventually decided on how it should behaves (ie a spec) which will most likely different from current one.

@mgajda
Copy link

mgajda commented Apr 30, 2019

Is it much work to spec out these issues? CriticMarkup is becoming a standard among Markdown editors, but it is difficult to pass it through standard pandoc.

@ttxtea
Copy link

ttxtea commented Jun 16, 2021

I would argue for an AST based pandoc intrinsic implementation. Thanks to ickc the preporcessor variant has been around for some while, but it could gain wider audience if it was just a pandoc filter like the docx->criticmarkup lua filter [1]. The multi author writing process could benefit a lot from round trips like docx->criticmarkup->docx

[1] https://gist.github.com/noamross/12e67a8d8d1fb71c4669cd1ceb9bbcf9

Also criticmarkup's comments seem more like an <aside>, while docx comments can refer to a whole body of text and are translated to [This is a Comment]{.start-comment}highlighted text[]{.end-comment} by --track-changes. Someone has proposed something like ?[highlighted text](This is a comment) instead of {>>This is a comment<<}. No idea if that is really necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants