Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to convert Word into Coradoc (and to adoc) #115

Open
ronaldtse opened this issue Jul 11, 2024 · 8 comments
Open

Ability to convert Word into Coradoc (and to adoc) #115

ronaldtse opened this issue Jul 11, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

Could use the docx gem.

@hmdne
Copy link
Contributor

hmdne commented Jul 11, 2024

We already kinda support this with w2a: https://github.com/metanorma/coradoc/blob/main/exe/w2a

This could be amended and an API exposed. Related issues: #100, #64

@ronaldtse
Copy link
Contributor Author

w2a worked poorly because it was actually a Word => HTML => AsciiDoc which means we lose plenty of information.

We now have a document to process from Word directory to AsciiDoc. Let's use this opportunity to make Coradoc work.

@hmdne
Copy link
Contributor

hmdne commented Aug 28, 2024

@ronaldtse I have yet to evaluate that in full, as of now I have no idea yet how much data is lost (my assumption is that at least from my long-time-ago experience, at least Microsoft Word made it sure, that the generated HTML file is still editable).

But - another experience of mine was that I had to choose a format for postprocessing - either ODT or DOCX. DOCX I felt mostly as a dump of Microsoft Word memory. It was very verbose, hard to work with. Comparing that, ODT felt much more like HTML - it was a well designed format. Even if we use a docx gem, I feel like we would have to heavily amend it for this task - for now, it looks like it only supports the most basic elements.

Below, I will show a result of my basic test, of creating a simple test document in LibreOffice:

image

I have extracted the exact fragment that corresponded to document structure, as that's what we're most interested in. This is ODT:

image

And this is DOCX:

image
image

To compare, below is the entire document converted to HTML:

image

Of course, this document has been generated carefully. I have, for instance, used a correct button in Libreoffice to generate a heading. I can assume it won't be the case all the time.

My proposal would be to:

  1. Attempt a work on Update implementation to be able to transform the ISO Simple Template docx #87 using w2a, the current solution.
  2. If we can proceed with this issue without many problems, let's keep it as-is and scrap the idea for handling DOCX format directly.

@ronaldtse
Copy link
Contributor Author

Thank you for the investigation. Unfortunately, most users use DOCX, not ODT, and we definitely cannot get people to migrate from DOCX to ODT.

If we do DOCX, we should do it right, instead of using ODT or the XSLT stylesheet (as to convert to HTML), because there are semantic losses.

I actually believe that lutaml-model will make parsing and working with a DOCX much easier than using the docx gem itself.

@hmdne
Copy link
Contributor

hmdne commented Sep 13, 2024

I actually believe that lutaml-model will make parsing and working with a DOCX much easier than using the docx gem itself.

As I understand, by writing our own rules for handling DOCX.

Unfortunately, most users use DOCX, not ODT, and we definitely cannot get people to migrate from DOCX to ODT.

But, for implementation, we could possibly keep Libreoffice, just to use it to convert from DOCX to ODT (instead of converting to HTML, as we do now), therefore supporting both formats. Since both formats are, from what I know, semantically interchangeable, this wouldn't hamper the task, but make the implementation simpler.

If we do DOCX, we should do it right, instead of using ODT or the XSLT stylesheet (as to convert to HTML), because there are semantic losses.

And that could be some alternative to using Libreoffice for that conversion in the future.

@opoudjis
Copy link
Contributor

@hmdne You are currently going through the same process I went through 6 years ago. See the readme on

https://github.com/strogonoff/reverse_asciidoctor

(Since my original reverse_adoc readme appears to have been memory-holed.)

reverse_adoc, which you are now reimplementing, had decided to use ODT HTML rather than DOCX HTML, precisely because its HTML was much neater and closer to the semantics.

Ronald does not want to pursue this approach, and he does not want the dependence on LibreOffice. He wants to implement this directly from the object model, with a serialiser currently under development.

@opoudjis
Copy link
Contributor

opoudjis commented Sep 17, 2024

Even if we use a docx gem, I feel like we would have to heavily amend it for this task - for now, it looks like it only supports the most basic elements.

https://github.com/metanorma/html2doc/wiki/Why-not-docx%3F, authored around the same time. I did my own survey of Node-based authoring tools for my day job last year; they had more features than what I found in Ruby in 2018, but most features are still only available if you pay for them, and from what I have seen, I have little confidence that they will ever cope with the complexity of stuff Metanorma expects in a Word document. Even colouring table cells turned out to be surprisingly difficult with Node tools.

The direction this is clearly heading towards is using a full Word SDK: Word-authoring gems, which assume nothing more complex than an image, simply won't cut it, and neither will naive serialisers solve the semantic complexities of OOXML. And that means fully understanding the OOXML spec, if you're going to use a Word SDK.

Sebastian Sauvage was right when he wrote, 10 years ago,

BANNED. I don’t have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !

(You will find, when you scrutinise real OOXML in documents, how right he is. And how underdocumented Word formatting actually is: there is no documentation of their Word CSS in MHT at all, you can only update it through trial and error.)

@hmdne
Copy link
Contributor

hmdne commented Sep 29, 2024

@opoudjis
That's the current approach - we take DOCX, convert that using LibreOffice to HTML and then we roll with that using the existing HTML pipeline. We are not reimplementing reverse_adoc - for the most part, we have split the existing code in two parts - one is concerned with converting HTML to a Coradoc tree, another is concerned with converting said tree to AsciiDoc.

What I propose is to use ODT directly, not ODT HTML. Compared to the second, the first preserves the semantics, but it's much closer to HTML in terms of readability than DOCX is. I assume this should be a fairly straightforward task, at least to get parity with what we have with ODT HTML currently.

@hmdne hmdne mentioned this issue Sep 30, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants