-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to convert Word into Coradoc (and to adoc) #115
Comments
We already kinda support this with w2a: https://github.com/metanorma/coradoc/blob/main/exe/w2a This could be amended and an API exposed. Related issues: #100, #64 |
w2a worked poorly because it was actually a Word => HTML => AsciiDoc which means we lose plenty of information. We now have a document to process from Word directory to AsciiDoc. Let's use this opportunity to make Coradoc work. |
@ronaldtse I have yet to evaluate that in full, as of now I have no idea yet how much data is lost (my assumption is that at least from my long-time-ago experience, at least Microsoft Word made it sure, that the generated HTML file is still editable). But - another experience of mine was that I had to choose a format for postprocessing - either ODT or DOCX. DOCX I felt mostly as a dump of Microsoft Word memory. It was very verbose, hard to work with. Comparing that, ODT felt much more like HTML - it was a well designed format. Even if we use a Below, I will show a result of my basic test, of creating a simple test document in LibreOffice: I have extracted the exact fragment that corresponded to document structure, as that's what we're most interested in. This is ODT: And this is DOCX: To compare, below is the entire document converted to HTML: Of course, this document has been generated carefully. I have, for instance, used a correct button in Libreoffice to generate a heading. I can assume it won't be the case all the time. My proposal would be to:
|
Thank you for the investigation. Unfortunately, most users use DOCX, not ODT, and we definitely cannot get people to migrate from DOCX to ODT. If we do DOCX, we should do it right, instead of using ODT or the XSLT stylesheet (as to convert to HTML), because there are semantic losses. I actually believe that lutaml-model will make parsing and working with a DOCX much easier than using the |
As I understand, by writing our own rules for handling DOCX.
But, for implementation, we could possibly keep Libreoffice, just to use it to convert from DOCX to ODT (instead of converting to HTML, as we do now), therefore supporting both formats. Since both formats are, from what I know, semantically interchangeable, this wouldn't hamper the task, but make the implementation simpler.
And that could be some alternative to using Libreoffice for that conversion in the future. |
@hmdne You are currently going through the same process I went through 6 years ago. See the readme on https://github.com/strogonoff/reverse_asciidoctor (Since my original reverse_adoc readme appears to have been memory-holed.) reverse_adoc, which you are now reimplementing, had decided to use ODT HTML rather than DOCX HTML, precisely because its HTML was much neater and closer to the semantics. Ronald does not want to pursue this approach, and he does not want the dependence on LibreOffice. He wants to implement this directly from the object model, with a serialiser currently under development. |
https://github.com/metanorma/html2doc/wiki/Why-not-docx%3F, authored around the same time. I did my own survey of Node-based authoring tools for my day job last year; they had more features than what I found in Ruby in 2018, but most features are still only available if you pay for them, and from what I have seen, I have little confidence that they will ever cope with the complexity of stuff Metanorma expects in a Word document. Even colouring table cells turned out to be surprisingly difficult with Node tools. The direction this is clearly heading towards is using a full Word SDK: Word-authoring gems, which assume nothing more complex than an image, simply won't cut it, and neither will naive serialisers solve the semantic complexities of OOXML. And that means fully understanding the OOXML spec, if you're going to use a Word SDK. Sebastian Sauvage was right when he wrote, 10 years ago,
(You will find, when you scrutinise real OOXML in documents, how right he is. And how underdocumented Word formatting actually is: there is no documentation of their Word CSS in MHT at all, you can only update it through trial and error.) |
@opoudjis What I propose is to use ODT directly, not ODT HTML. Compared to the second, the first preserves the semantics, but it's much closer to HTML in terms of readability than DOCX is. I assume this should be a fairly straightforward task, at least to get parity with what we have with ODT HTML currently. |
Could use the
docx
gem.The text was updated successfully, but these errors were encountered: