Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion from (X)HTML to ODT does not produce 'real' ODT documents, but HTML documents that don't behave like 'regular' ODT documents in LibreOffice #297

Closed
lucaa opened this issue Jun 12, 2022 · 6 comments

Comments

@lucaa
Copy link

lucaa commented Jun 12, 2022

Steps to reproduce:

  • have a html file containing some content - in my case, the content is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test Office conversion</title>
</head>
<body>
<h1>Test Title</h1>
<div>Test Content</div>
</body>
    public static void main(String[] args) throws OfficeException
    {
        File inputFile = new File("src/test/resources/input/basic.html");
        File outputFile = new File("src/test/resources/output/result-basic.odt");

        LocalOfficeManager.Builder config = LocalOfficeManager.builder();
        LocalOfficeManager officeManager = config.build();
        officeManager.start();
        LocalConverter localConverter = LocalConverter.builder()
            .officeManager(officeManager)
            .build();
        localConverter.convert(inputFile)
            .as(DefaultDocumentFormatRegistry.HTML)
            .to(outputFile)
            .as(DefaultDocumentFormatRegistry.ODT)
            .execute();
        officeManager.stop();
    }

In my case, my locally installed libreoffice is LibreOffice 7.3.3.2 but it happens the same with older versions of LibreOffice - I can find which versions if it's relevant.

Expected result:

  • an odt file is produced and this odt file can be used like any other odt file in libreoffice

Actual result:

  • an odt file is produced - the attached result-basic.odt
  • however, this file is not displayed in libreoffice in the same way as a 'regular' odt file is displayed, notably:
    • the default view of this document is 'web'
    • LibreOffice does not display it with its name but as 'Untitled', as if it was an unknown file
      image
    • When using "save as", this document cannot be saved as anything else than html formats (a regular document can be saved as ms office formats, etc):
      image
    • the tracing of comments and history is not available in this document either, which would be available in a standard odt file.

Note: When converting using libreoffice (the same one) in command line, the result is correct:

> libreoffice --convert-to odt basic.html 
convert <redacted>/basic.html -> <redacted>/basic.odt using filter : writerweb8_writer

Result basic.odt .

@lucaa
Copy link
Author

lucaa commented Jun 12, 2022

The problem reproduces the same with xhtml files.

@lucaa
Copy link
Author

lucaa commented Jun 12, 2022

Maybe some extra parameters need to be passed to the converter in order to obtain a correct result?

@lucaa
Copy link
Author

lucaa commented Jun 12, 2022

With some debug help, I managed to guess that what would be needed would be to force the filter that is used when converting from writer8 used by jodconverter to the one that libreoffice states to be using, writerweb8_writer.

Apparently this needs to be set in the converter builder, like this:

        Map<String, Object> converterStoreProperties = new HashMap<>();
        converterStoreProperties.put("FilterName", "writerweb8_writer");
        LocalConverter localConverter = LocalConverter.builder()
            .officeManager(officeManager)
            .storeProperties(converterStoreProperties)
            .build();

However, I don't really know what are the implications of this, what does it actually mean to make this change:

  • is this specific to the input file type or to the output file type or to the combination of the 2? (e.g. if I convert from html to rtf, should I set this too or not?
  • is this specific to the libreoffice version that is used, should it be different for different versions of libreoffice?
  • is there a better place to configure this?
  • should the default of jodconverter be different?

@surli
Copy link

surli commented Jun 23, 2022

is there a better place to configure this?
should the default of jodconverter be different?

So IMO this should be fixed there: https://github.com/sbraconnier/jodconverter/blob/master/jodconverter-core/src/main/resources/document-formats.json#L11

@sbraconnier
Copy link
Member

sbraconnier commented Sep 13, 2022

@lucaa @surli
Thanks for your investigation and comments.

In order to be able to manage this in the document-formats.json, I will have to add a new DocumentFamily to seperate text documents from web documents. I plan to release a new jodconverter version in the next few days. I'll try to implement this before the release.

@sbraconnier
Copy link
Member

It will be included in the next version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants