`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

ronaldtse · 2019-11-21T19:23:08Z

I've extracted out LibreOffice's Word related XSLTs here:

https://github.com/metanorma/ooo-word-xslt

This task is to utilize these xslt files to directly transform Word -> HTML, instead of needing to install LibreOffice.

ronaldtse · 2019-11-25T03:51:32Z

This is for @w00lf . Thanks!

ronaldtse · 2019-11-26T13:15:39Z

@w00lf could you please help make this work? Thanks!

w00lf · 2019-11-26T14:16:25Z

@w00lf could you please help make this work? Thanks!

Sure, let me look more closely what can be used here.

ronaldtse · 2019-11-26T18:36:38Z

@w00lf I think you can use Nokogiri to run XSLT (which runs libxslt underneath) to transform Word to HTML

e.g. http://craftingruby.com/posts/2014/01/14/transforming-xml-in-ruby-with-xslt-and-nokogiri.html

opoudjis · 2019-11-28T01:57:42Z

Yes, and if you want an example of our existing stack doing this, see the use of XSLT in the html2doc gem: https://github.com/metanorma/html2doc/blob/master/lib/html2doc/math.rb

opoudjis · 2019-11-28T01:58:33Z

I've unassigned myself, but of course please let me know how this goes.

ronaldtse · 2019-11-28T04:02:36Z

Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.

w00lf · 2019-12-01T20:19:48Z

Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.

Hi there, @ronaldtse. I tried to use libreoffice xslt(https://github.com/metanorma/ooo-word-xslt) with Nokogiri. Unfortunately, i cannot make them to work with the test docx document. For example, ./wordml2ooo/wordml2ooo_text.xsl produces pure text from document without any formating:

<?xml version="1.0"?>
HelloH20i=1n&#x3B2;2i

As i understand the main entry point for this xslt is wordml2ooo/wordml2ooo.xsl file, as it includes all other xslt, but for me it just produces blank file with xml notation:

<?xml version="1.0"?>

Why did you choose these particular xslt files? Are there any documentation for their structure? Maybe the input docx xml files should be linked properly before transforming with that stylesheets?

This is the code i am using to transform:

document = Nokogiri::XML(File.read('./word/document.xml')) 
template = Nokogiri::XSLT(File.read('./wordml2ooo/wordml2ooo.xsl'))
transformed_document = template.transform(document)
File.open('output.html', 'w') {  |file| file.write(transformed_document) }

ronaldtse · 2019-12-02T00:50:02Z

@w00lf the source of the XSLT files are given at:
https://github.com/metanorma/ooo-word-xslt#history

Unfortunately there doesn't seem to be any documentation on how to use the XSLTs.

By searching the source repo (https://github.com/LibreOffice/core/search?q=wordml2ooo&unscoped_q=wordml2ooo), the only place it is used is here:
https://github.com/LibreOffice/core/blob/330df37c7e2af0564bcd2de1f171bed4befcc074/filter/source/config/fragments/filters/MS_Word_2003_XML.xcu#L22

The code points to XMLOasisImporter and XMLOasisExporter which is the software used to import/export OOO.

A search of XMLOasisImporter provides this: https://github.com/search?p=4&q=org%3ALibreOffice+XMLOasisImporter&type=Code

A google search of it indicates it is a class available to use by developers. It doesn't seem like this class does anything special for this process except run XSLTs.

opoudjis · 2019-12-02T12:37:10Z

@ronaldtse,

BZZZZT

Those are the wrong XSLTs. They are the transforms between Microsoft OOXML and Open Office.

It looks like what you want is at: https://github.com/LibreOffice/core/tree/master/filter/source/xslt/odf2xhtml/export/xhtml

But it also looks like that only converts from OpenOffice to XHTML, so in fact you need two stages:

Microsoft OOXML > OpenOffice XML > XHTML

So you still need to get the first stage working with wordml2ooo.xsl

ronaldtse · 2019-12-02T15:39:36Z

@opoudjis yes that's what I was thinking this morning.

Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.

w00lf · 2019-12-04T18:37:12Z

Hi there. So, after some testing i was able to determine why these particular xslt files do not work for test docx i was using. I have noticed that the entry xlst file https://github.com/metanorma/ooo-word-xslt/blob/master/wordml2ooo/wordml2ooo.xsl#L37 is using w:wordDocument as an entry point for a document. There is no such tag in docx file word/document.xml. Rtf file(docx extension) is using w:document as its root tag. I have checked random tags from the test document and it turns out that wordml2ooo xslt does not have a number of them, for example, there is no transform rules for tags: sSubSup, w:document, oMathPara, ctrlPr. I have searched for w:wordDocument signature, this is its description - https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats. It seems that these particular xslt are for other types of documents, not docx(rtf) itself. I have searched core LibreOffice repo for mentions of rtf tags and found this file - https://github.com/LibreOffice/core/blob/1c5465ef1158ebf0f3f64e3343c2ed610024e5a8/writerfilter/source/rtftok/rtfcontrolwords.cxx. This file has all tags from test docx file and it seems that this is the file that LibreOffice is using to convert rtf files and there is no xslt files for them at all. Here is another file - https://github.com/LibreOffice/core/blob/93eeaf0ad902214fb6b4205606b24046a458ee45/starmath/source/rtfexport.cxx. So, obviously, we cannot use that file separately and we will need to find another way to do this if we dont want to use LibreOffice anymore. I can look for other gems that can work with docx, what do you think?

ronaldtse · 2019-12-04T19:18:37Z

Interestingly I think you're right! Seems that the WordML in these XSLT files are for WordProcessingML of the 2003 version, not the 2007 version.

So with some Googling I found these XSLTs that are docx2foo where foo is the format:

Maybe @Intelligent2013 is more familiar with working with XSLT? Probably docx2json will provide a very good entry?

Intelligent2013 · 2019-12-05T10:22:00Z

@ronaldtse
Microsoft Word supports these xml formats (Save As command, I use Word 2010 for example):

.docx - it's a multi-component .zip file with 'main entry-point' file 'word\document.xml'. The root tag is <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main". . .
.xml (XML document Word 2003) - i's an one xml file with root tag <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml". Another name is 'WordML"
.xml (XML document Word) - it's one xml file with root tag <pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage" . . . This file contains a few components (rels, themes, styles, fonttable, document, etc.) similar to .docx zip file, but not compressed into .zip.

Main format of LibreOffice are:

.odt (ODF document), Open Document Format - multicomponent .zip file, 'main entry point' is ./content.xml with root tag <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" . . . This format is 'similar' .docx (multicomponent+zip)
.fodt (Flat XML ODF Text Document) - it's one xml file file with root tag <office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0". . . This file contains a few components similar to .odt zip file, but not compressed into .zip. This format is 'similar' MsWord .xml (with <pkg:package root)

Regarding xslt:

wordml2ooo.xsl - is xslt to convert .xml file (with root tag w:wordDocument, i.e. XML document Word 2003) into .fodt
**ooo2wordml.xsl ** - is xslt to convert .fodt file into .xml file (with root tag w:wordDocument, i.e. XML document Word 2003)

If you need to convert .docx into html, then you need:

unpack .docx zip file into some folder
convert the ./word/document.xml ((WordprocessingML) into html with using some XSLT (may be https://github.com/ottoville/DOCX2HTML.XSL, I didn't work with it). Please note, that .docx can contain inside Excel Spreadsheet, Drawings, etc with own XML format (full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages).

opoudjis · 2019-12-06T01:28:11Z

(full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages)

https://sebsauvage.net/wiki/doku.php?id=word_document_generation :

Possible Solutions:
Generate .docx files (Afterall, that's XML, isn't it ?) BANNED. I don't have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !

opoudjis · 2019-12-06T01:41:49Z

I'm worried with where this is going: if the available XSLTs online do a worse job of converting DOCX to a clean HTML with complete coverage, then this approach of externalised XHTML has to be rejected. Phrases like "So far following features are supported" in the https://github.com/ottoville/DOCX2HTML.XSL readme, or code targetting the much simpler Markdown format, do not inspire confidence. At all.

So with any of these XSLTs, we will need to ensure that they generate all the markup that we want to see in Asciidoctor. That includes footnotes, mathematics, images, bookmarks, and so on.

Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.

... Obviously with a 1500 pp spec, and with a conversion already in place in LibreOffice (and presumably elsewhere) such a brand new XSLT from scratch is not a good use of anybody's time.

w00lf · 2019-12-09T13:45:52Z

@ronaldtse @opoudjis I have tested our option with unzipped test docx file, here are some results:

https://github.com/ottoville/DOCX2HTML.XSL - currently requires xslt 2.0 support, and latest version of nokogiri supports only 1.0 and 1.1 syntaxis:

2.5.3 :002 > template = Nokogiri::XSLT(File.read('/Users/mitaraskin/Work/Personal/Metanorma/DOCX2HTML.XSL/docx2html.xsl'))
XPath error : Invalid expression
max(($r2,$g2,$b2))
        ^
XPath error : Invalid expression
min(($r2,$g2,$b2))
        ^
.....
RuntimeError (compilation error: element stylesheet)
xsl:version: only 1.1 features are supported

So if you to use this xsl we still will need to use external dependency with xsl 2.0 support

https://github.com/chrahunt/docx - does not support images at all,
https://github.com/kaleguy/docx2json/blob/master/wordtoxml.xsl - does not support images either.

ronaldtse · 2019-12-09T13:47:03Z

@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?

w00lf · 2019-12-09T13:57:58Z

@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?

I have inspected it a little bit. It has some support for image embedding: https://github.com/openxml/openxml-docx/blob/fc093111eb6b0640d0b34901de6d39ba3907df3d/examples/image-embedding, but code itself is focused on docx creation and will require some work in order to use it for parsing docx documents if it even possible.

ronaldtse · 2019-12-10T00:47:55Z

I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.

w00lf · 2019-12-10T06:10:26Z

I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.

There is no support of saxon in MRI ruby, only jRuby.

ronaldtse · 2019-12-10T06:11:14Z

@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.

opoudjis · 2019-12-10T10:38:16Z

XSLT 2.0 is the Devil's own proprietary monopoly, and anyone who codes in XSLT 2.0 deserves bastinado'ing. And the fact that it is the Devil's own monopoly demonstrates what a niche fail XSLT has become. (Couldn't have happened to a more deserving spec.)

And yes indeed, XSLT 2.0 commits you to Java. If that's not an indictment, I don't know what is.

w00lf · 2019-12-13T14:24:45Z

@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.

Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?

w00lf · 2020-01-18T06:10:06Z

@opoudjis @ronaldtse what's our plan here next?

opoudjis · 2020-06-05T07:02:52Z

@ronaldtse Strong suggest this ticket be closed

ronaldtse · 2020-06-09T09:34:03Z

I'm not convinced that an off band Saxon process does harm.

Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?

Yes. We wanted to get away from libreoffice, not all third-party dependencies.

ronaldtse added the enhancement New feature or request label Nov 21, 2019

ronaldtse assigned w00lf and opoudjis Nov 21, 2019

opoudjis removed their assignment Nov 28, 2019

ronaldtse mentioned this issue Nov 28, 2019

Rspec local image file processing metanorma/reverse_adoc#38

Merged

ronaldtse unassigned w00lf Jun 3, 2024

ronaldtse transferred this issue from metanorma/reverse_adoc Jun 3, 2024

ronaldtse added this to Metanorma Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Metanorma Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

ronaldtse commented Nov 21, 2019

ronaldtse commented Nov 25, 2019

ronaldtse commented Nov 26, 2019

w00lf commented Nov 26, 2019

ronaldtse commented Nov 26, 2019

opoudjis commented Nov 28, 2019

opoudjis commented Nov 28, 2019

ronaldtse commented Nov 28, 2019

w00lf commented Dec 1, 2019

ronaldtse commented Dec 2, 2019

opoudjis commented Dec 2, 2019

ronaldtse commented Dec 2, 2019

w00lf commented Dec 4, 2019

ronaldtse commented Dec 4, 2019

Intelligent2013 commented Dec 5, 2019

opoudjis commented Dec 6, 2019

opoudjis commented Dec 6, 2019

w00lf commented Dec 9, 2019

ronaldtse commented Dec 9, 2019

w00lf commented Dec 9, 2019

ronaldtse commented Dec 10, 2019

w00lf commented Dec 10, 2019

ronaldtse commented Dec 10, 2019

opoudjis commented Dec 10, 2019

w00lf commented Dec 13, 2019

w00lf commented Jan 18, 2020

opoudjis commented Jun 5, 2020

ronaldtse commented Jun 9, 2020

w2a uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

w2a uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

Comments

ronaldtse commented Nov 21, 2019

ronaldtse commented Nov 25, 2019

ronaldtse commented Nov 26, 2019

w00lf commented Nov 26, 2019

ronaldtse commented Nov 26, 2019

opoudjis commented Nov 28, 2019

opoudjis commented Nov 28, 2019

ronaldtse commented Nov 28, 2019

w00lf commented Dec 1, 2019

ronaldtse commented Dec 2, 2019

opoudjis commented Dec 2, 2019

ronaldtse commented Dec 2, 2019

w00lf commented Dec 4, 2019

ronaldtse commented Dec 4, 2019

Intelligent2013 commented Dec 5, 2019

opoudjis commented Dec 6, 2019

opoudjis commented Dec 6, 2019

w00lf commented Dec 9, 2019

ronaldtse commented Dec 9, 2019

w00lf commented Dec 9, 2019

ronaldtse commented Dec 10, 2019

w00lf commented Dec 10, 2019

ronaldtse commented Dec 10, 2019

opoudjis commented Dec 10, 2019

w00lf commented Dec 13, 2019

w00lf commented Jan 18, 2020

opoudjis commented Jun 5, 2020

ronaldtse commented Jun 9, 2020

`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84