-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
w2a
uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly
#84
Comments
This is for @w00lf . Thanks! |
@w00lf could you please help make this work? Thanks! |
Sure, let me look more closely what can be used here. |
@w00lf I think you can use Nokogiri to run XSLT (which runs e.g. http://craftingruby.com/posts/2014/01/14/transforming-xml-in-ruby-with-xslt-and-nokogiri.html |
Yes, and if you want an example of our existing stack doing this, see the use of XSLT in the html2doc gem: https://github.com/metanorma/html2doc/blob/master/lib/html2doc/math.rb |
I've unassigned myself, but of course please let me know how this goes. |
Hi there, @ronaldtse. I tried to use libreoffice xslt(https://github.com/metanorma/ooo-word-xslt) with Nokogiri. Unfortunately, i cannot make them to work with the test docx document. For example, ./wordml2ooo/wordml2ooo_text.xsl produces pure text from document without any formating:
As i understand the main entry point for this xslt is
Why did you choose these particular xslt files? Are there any documentation for their structure? Maybe the input docx xml files should be linked properly before transforming with that stylesheets? This is the code i am using to transform: document = Nokogiri::XML(File.read('./word/document.xml'))
template = Nokogiri::XSLT(File.read('./wordml2ooo/wordml2ooo.xsl'))
transformed_document = template.transform(document)
File.open('output.html', 'w') { |file| file.write(transformed_document) } |
@w00lf the source of the XSLT files are given at: Unfortunately there doesn't seem to be any documentation on how to use the XSLTs. By searching the source repo (https://github.com/LibreOffice/core/search?q=wordml2ooo&unscoped_q=wordml2ooo), the only place it is used is here: The code points to A search of A google search of it indicates it is a class available to use by developers. It doesn't seem like this class does anything special for this process except run XSLTs. |
BZZZZT Those are the wrong XSLTs. They are the transforms between Microsoft OOXML and Open Office. It looks like what you want is at: https://github.com/LibreOffice/core/tree/master/filter/source/xslt/odf2xhtml/export/xhtml But it also looks like that only converts from OpenOffice to XHTML, so in fact you need two stages: Microsoft OOXML > OpenOffice XML > XHTML So you still need to get the first stage working with wordml2ooo.xsl |
@opoudjis yes that's what I was thinking this morning. Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better. |
Hi there. So, after some testing i was able to determine why these particular xslt files do not work for test docx i was using. I have noticed that the entry xlst file https://github.com/metanorma/ooo-word-xslt/blob/master/wordml2ooo/wordml2ooo.xsl#L37 is using |
Interestingly I think you're right! Seems that the WordML in these XSLT files are for WordProcessingML of the 2003 version, not the 2007 version. So with some Googling I found these XSLTs that are
Maybe @Intelligent2013 is more familiar with working with XSLT? Probably docx2json will provide a very good entry? |
@ronaldtse
Main format of LibreOffice are:
Regarding xslt:
If you need to convert .docx into html, then you need:
|
https://sebsauvage.net/wiki/doku.php?id=word_document_generation :
|
I'm worried with where this is going: if the available XSLTs online do a worse job of converting DOCX to a clean HTML with complete coverage, then this approach of externalised XHTML has to be rejected. Phrases like "So far following features are supported" in the https://github.com/ottoville/DOCX2HTML.XSL readme, or code targetting the much simpler Markdown format, do not inspire confidence. At all. So with any of these XSLTs, we will need to ensure that they generate all the markup that we want to see in Asciidoctor. That includes footnotes, mathematics, images, bookmarks, and so on.
... Obviously with a 1500 pp spec, and with a conversion already in place in LibreOffice (and presumably elsewhere) such a brand new XSLT from scratch is not a good use of anybody's time. |
@ronaldtse @opoudjis I have tested our option with unzipped test docx file, here are some results:
So if you to use this xsl we still will need to use external dependency with xsl 2.0 support
|
@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has? |
I have inspected it a little bit. It has some support for image embedding: https://github.com/openxml/openxml-docx/blob/fc093111eb6b0640d0b34901de6d39ba3907df3d/examples/image-embedding, but code itself is focused on docx creation and will require some work in order to use it for parsing docx documents if it even possible. |
I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0. |
There is no support of saxon in MRI ruby, only jRuby. |
@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process. |
XSLT 2.0 is the Devil's own proprietary monopoly, and anyone who codes in XSLT 2.0 deserves bastinado'ing. And the fact that it is the Devil's own monopoly demonstrates what a niche fail XSLT has become. (Couldn't have happened to a more deserving spec.) And yes indeed, XSLT 2.0 commits you to Java. If that's not an indictment, I don't know what is. |
Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)? |
@opoudjis @ronaldtse what's our plan here next? |
@ronaldtse Strong suggest this ticket be closed |
I'm not convinced that an off band Saxon process does harm.
Yes. We wanted to get away from libreoffice, not all third-party dependencies. |
I've extracted out LibreOffice's Word related XSLTs here:
This task is to utilize these xslt files to directly transform Word -> HTML, instead of needing to install LibreOffice.
The text was updated successfully, but these errors were encountered: