Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

w2a uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84

Open
ronaldtse opened this issue Nov 21, 2019 · 27 comments
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

I've extracted out LibreOffice's Word related XSLTs here:

This task is to utilize these xslt files to directly transform Word -> HTML, instead of needing to install LibreOffice.

@ronaldtse ronaldtse added the enhancement New feature or request label Nov 21, 2019
@ronaldtse
Copy link
Contributor Author

This is for @w00lf . Thanks!

@ronaldtse
Copy link
Contributor Author

@w00lf could you please help make this work? Thanks!

@w00lf
Copy link
Contributor

w00lf commented Nov 26, 2019

@w00lf could you please help make this work? Thanks!

Sure, let me look more closely what can be used here.

@ronaldtse
Copy link
Contributor Author

@w00lf I think you can use Nokogiri to run XSLT (which runs libxslt underneath) to transform Word to HTML

e.g. http://craftingruby.com/posts/2014/01/14/transforming-xml-in-ruby-with-xslt-and-nokogiri.html

@opoudjis
Copy link
Contributor

Yes, and if you want an example of our existing stack doing this, see the use of XSLT in the html2doc gem: https://github.com/metanorma/html2doc/blob/master/lib/html2doc/math.rb

@opoudjis opoudjis removed their assignment Nov 28, 2019
@opoudjis
Copy link
Contributor

I've unassigned myself, but of course please let me know how this goes.

@ronaldtse
Copy link
Contributor Author

Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.

@w00lf
Copy link
Contributor

w00lf commented Dec 1, 2019

Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.

Hi there, @ronaldtse. I tried to use libreoffice xslt(https://github.com/metanorma/ooo-word-xslt) with Nokogiri. Unfortunately, i cannot make them to work with the test docx document. For example, ./wordml2ooo/wordml2ooo_text.xsl produces pure text from document without any formating:

<?xml version="1.0"?>
HelloH20i=1n&#x3B2;2i

As i understand the main entry point for this xslt is wordml2ooo/wordml2ooo.xsl file, as it includes all other xslt, but for me it just produces blank file with xml notation:

<?xml version="1.0"?>

Why did you choose these particular xslt files? Are there any documentation for their structure? Maybe the input docx xml files should be linked properly before transforming with that stylesheets?

This is the code i am using to transform:

document = Nokogiri::XML(File.read('./word/document.xml')) 
template = Nokogiri::XSLT(File.read('./wordml2ooo/wordml2ooo.xsl'))
transformed_document = template.transform(document)
File.open('output.html', 'w') {  |file| file.write(transformed_document) }

@ronaldtse
Copy link
Contributor Author

@w00lf the source of the XSLT files are given at:
https://github.com/metanorma/ooo-word-xslt#history

Unfortunately there doesn't seem to be any documentation on how to use the XSLTs.

By searching the source repo (https://github.com/LibreOffice/core/search?q=wordml2ooo&unscoped_q=wordml2ooo), the only place it is used is here:
https://github.com/LibreOffice/core/blob/330df37c7e2af0564bcd2de1f171bed4befcc074/filter/source/config/fragments/filters/MS_Word_2003_XML.xcu#L22

The code points to XMLOasisImporter and XMLOasisExporter which is the software used to import/export OOO.

A search of XMLOasisImporter provides this: https://github.com/search?p=4&q=org%3ALibreOffice+XMLOasisImporter&type=Code

A google search of it indicates it is a class available to use by developers. It doesn't seem like this class does anything special for this process except run XSLTs.

@opoudjis
Copy link
Contributor

opoudjis commented Dec 2, 2019

@ronaldtse,

BZZZZT

Those are the wrong XSLTs. They are the transforms between Microsoft OOXML and Open Office.

It looks like what you want is at: https://github.com/LibreOffice/core/tree/master/filter/source/xslt/odf2xhtml/export/xhtml

But it also looks like that only converts from OpenOffice to XHTML, so in fact you need two stages:

Microsoft OOXML > OpenOffice XML > XHTML

So you still need to get the first stage working with wordml2ooo.xsl

@ronaldtse
Copy link
Contributor Author

@opoudjis yes that's what I was thinking this morning.

Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.

@w00lf
Copy link
Contributor

w00lf commented Dec 4, 2019

Hi there. So, after some testing i was able to determine why these particular xslt files do not work for test docx i was using. I have noticed that the entry xlst file https://github.com/metanorma/ooo-word-xslt/blob/master/wordml2ooo/wordml2ooo.xsl#L37 is using w:wordDocument as an entry point for a document. There is no such tag in docx file word/document.xml. Rtf file(docx extension) is using w:document as its root tag. I have checked random tags from the test document and it turns out that wordml2ooo xslt does not have a number of them, for example, there is no transform rules for tags: sSubSup, w:document, oMathPara, ctrlPr. I have searched for w:wordDocument signature, this is its description - https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats. It seems that these particular xslt are for other types of documents, not docx(rtf) itself. I have searched core LibreOffice repo for mentions of rtf tags and found this file - https://github.com/LibreOffice/core/blob/1c5465ef1158ebf0f3f64e3343c2ed610024e5a8/writerfilter/source/rtftok/rtfcontrolwords.cxx. This file has all tags from test docx file and it seems that this is the file that LibreOffice is using to convert rtf files and there is no xslt files for them at all. Here is another file - https://github.com/LibreOffice/core/blob/93eeaf0ad902214fb6b4205606b24046a458ee45/starmath/source/rtfexport.cxx. So, obviously, we cannot use that file separately and we will need to find another way to do this if we dont want to use LibreOffice anymore. I can look for other gems that can work with docx, what do you think?

@ronaldtse
Copy link
Contributor Author

Interestingly I think you're right! Seems that the WordML in these XSLT files are for WordProcessingML of the 2003 version, not the 2007 version.

So with some Googling I found these XSLTs that are docx2foo where foo is the format:

Maybe @Intelligent2013 is more familiar with working with XSLT? Probably docx2json will provide a very good entry?

@Intelligent2013
Copy link

@ronaldtse
Microsoft Word supports these xml formats (Save As command, I use Word 2010 for example):

Main format of LibreOffice are:

  • .odt (ODF document), Open Document Format - multicomponent .zip file, 'main entry point' is ./content.xml with root tag <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" . . . This format is 'similar' .docx (multicomponent+zip)
  • .fodt (Flat XML ODF Text Document) - it's one xml file file with root tag <office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0". . . This file contains a few components similar to .odt zip file, but not compressed into .zip. This format is 'similar' MsWord .xml (with <pkg:package root)

Regarding xslt:

  • wordml2ooo.xsl - is xslt to convert .xml file (with root tag w:wordDocument, i.e. XML document Word 2003) into .fodt
  • **ooo2wordml.xsl ** - is xslt to convert .fodt file into .xml file (with root tag w:wordDocument, i.e. XML document Word 2003)

If you need to convert .docx into html, then you need:

  • unpack .docx zip file into some folder
  • convert the ./word/document.xml ((WordprocessingML) into html with using some XSLT (may be https://github.com/ottoville/DOCX2HTML.XSL, I didn't work with it). Please note, that .docx can contain inside Excel Spreadsheet, Drawings, etc with own XML format (full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages).

@opoudjis
Copy link
Contributor

opoudjis commented Dec 6, 2019

(full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages)

https://sebsauvage.net/wiki/doku.php?id=word_document_generation :

Possible Solutions:
Generate .docx files (Afterall, that's XML, isn't it ?) BANNED. I don't have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !

@opoudjis
Copy link
Contributor

opoudjis commented Dec 6, 2019

I'm worried with where this is going: if the available XSLTs online do a worse job of converting DOCX to a clean HTML with complete coverage, then this approach of externalised XHTML has to be rejected. Phrases like "So far following features are supported" in the https://github.com/ottoville/DOCX2HTML.XSL readme, or code targetting the much simpler Markdown format, do not inspire confidence. At all.

So with any of these XSLTs, we will need to ensure that they generate all the markup that we want to see in Asciidoctor. That includes footnotes, mathematics, images, bookmarks, and so on.

Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.

... Obviously with a 1500 pp spec, and with a conversion already in place in LibreOffice (and presumably elsewhere) such a brand new XSLT from scratch is not a good use of anybody's time.

@w00lf
Copy link
Contributor

w00lf commented Dec 9, 2019

@ronaldtse @opoudjis I have tested our option with unzipped test docx file, here are some results:

2.5.3 :002 > template = Nokogiri::XSLT(File.read('/Users/mitaraskin/Work/Personal/Metanorma/DOCX2HTML.XSL/docx2html.xsl'))
XPath error : Invalid expression
max(($r2,$g2,$b2))
        ^
XPath error : Invalid expression
min(($r2,$g2,$b2))
        ^
.....
RuntimeError (compilation error: element stylesheet)
xsl:version: only 1.1 features are supported

So if you to use this xsl we still will need to use external dependency with xsl 2.0 support

@ronaldtse
Copy link
Contributor Author

@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?

@w00lf
Copy link
Contributor

w00lf commented Dec 9, 2019

@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?

I have inspected it a little bit. It has some support for image embedding: https://github.com/openxml/openxml-docx/blob/fc093111eb6b0640d0b34901de6d39ba3907df3d/examples/image-embedding, but code itself is focused on docx creation and will require some work in order to use it for parsing docx documents if it even possible.

@ronaldtse
Copy link
Contributor Author

I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.

@w00lf
Copy link
Contributor

w00lf commented Dec 10, 2019

I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.

There is no support of saxon in MRI ruby, only jRuby.

@ronaldtse
Copy link
Contributor Author

@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.

@opoudjis
Copy link
Contributor

XSLT 2.0 is the Devil's own proprietary monopoly, and anyone who codes in XSLT 2.0 deserves bastinado'ing. And the fact that it is the Devil's own monopoly demonstrates what a niche fail XSLT has become. (Couldn't have happened to a more deserving spec.)

And yes indeed, XSLT 2.0 commits you to Java. If that's not an indictment, I don't know what is.

@w00lf
Copy link
Contributor

w00lf commented Dec 13, 2019

@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.

Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?

@w00lf
Copy link
Contributor

w00lf commented Jan 18, 2020

@opoudjis @ronaldtse what's our plan here next?

@opoudjis
Copy link
Contributor

opoudjis commented Jun 5, 2020

@ronaldtse Strong suggest this ticket be closed

@ronaldtse
Copy link
Contributor Author

I'm not convinced that an off band Saxon process does harm.

Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?

Yes. We wanted to get away from libreoffice, not all third-party dependencies.

@ronaldtse ronaldtse transferred this issue from metanorma/reverse_adoc Jun 3, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Metanorma Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants