Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) #2593

Merged
merged 9 commits into from
Mar 4, 2024

Conversation

micmarty-deepsense
Copy link
Contributor

@micmarty-deepsense micmarty-deepsense commented Feb 27, 2024

Problem Description

In some cases you might find yourselves in a situation when pandoc won't be able to process an rtf as input file format, because older versions simply do not support that.

RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki

Basically, some user may install the wrong version. The README.md is not be precise enough when mentioning RTF files support:

unstructured/README.md

Lines 120 to 122 in 47b35cc

- `tesseract-ocr` (images and PDFs, install `tesseract-lang` for additional language support)
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs, RTFs and Open Office docs)

Example

Installing pandoc from a stable repository, like Debian will give you 2.9 and the official documentation shows clearly that support for rtf was introduced in 2.14 https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21
image

Note that rtf is not there

image

More detail

image

Proposed Solution

  • I've simply added/copied make install-pandoc calls, mimicking other recipes in order to ensure that 3.1.2 will be installed in all cases. Side note: make install-pandoc calls ./scripts/install-pandoc.sh under the hood.
  • Update README file - mention that make install-pandoc is recommended (>=2.14.2)
  • Verify tests that cover rtf cases:
    filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
  • Update setup_ubuntu.sh if needed?:
    $sudo $pac install -y libreoffice pandoc

@micmarty-deepsense micmarty-deepsense changed the title Install pandoc in Makefile recipes via script consistently Install pandoc consistently, via Makefile recipe Feb 27, 2024
@micmarty-deepsense micmarty-deepsense changed the title Install pandoc consistently, via Makefile recipe Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) Feb 27, 2024
@micmarty-deepsense micmarty-deepsense force-pushed the mike/pandoc-version-unification branch from 488f39b to 817ea72 Compare February 28, 2024 12:24
@micmarty-deepsense
Copy link
Contributor Author

Suggested error message

RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki

Support for RTF files is not available in the current pandoc installation. It was introduced in pandoc 2.14.2.
Reference: https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

Current version of pandoc: 2.9.2.1
Please, follow the pandoc installation instructions in README.md to install the right version.

Running make test-extra-pypandoc will generate the following output
image

Reinstalling the right version makes all tests passing

sudo apt remove pandoc
make install-pandoc
make test-extra-pypandoc

image

…-related erros

add more verbose error when convert_file_to_text fails due to pandoc
add note in README about pandoc version and the proper way to install it

All test will pass with 3.1.2 and fail with versions older than 2.14.2:
```
make install-pandoc
make test-extra-pypandoc
```
@micmarty-deepsense micmarty-deepsense force-pushed the mike/pandoc-version-unification branch from e56262d to 0629b53 Compare February 28, 2024 16:23
@Klaijan
Copy link
Contributor

Klaijan commented Feb 29, 2024

A little related to this PR - Here at this line https://github.com/Unstructured-IO/unstructured/blob/main/.github/workflows/ci.yml#L438 just for cleanliness, I'm thinking is it ok if we add the pandoc installation to unstructured-inference's install-ci as well since CI will need to run against various file types including rtf. This way we can omit the make install-pandoc on later line.

But since it's not directly related to this PR so it's your call :)

Anyways, just need to update CHANGELOG and update the branch then all LGTM! Switch this to ready for review then I can approve it then!

@micmarty-deepsense micmarty-deepsense marked this pull request as ready for review February 29, 2024 09:00
@micmarty-deepsense micmarty-deepsense changed the title Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) Feb 29, 2024
@micmarty-deepsense micmarty-deepsense self-assigned this Feb 29, 2024
@micmarty-deepsense micmarty-deepsense force-pushed the mike/pandoc-version-unification branch from caf0391 to b792bda Compare February 29, 2024 14:56
@micmarty-deepsense
Copy link
Contributor Author

I've verified if docker-build contains the right version of pandoc and it does:

make docker-build
make docker-start-bash
pandoc --version
pandoc --list-input-formats | grep rtf

image

@micmarty-deepsense
Copy link
Contributor Author

micmarty-deepsense commented Feb 29, 2024

A little related to this PR - Here at this line https://github.com/Unstructured-IO/unstructured/blob/main/.github/workflows/ci.yml#L438 just for cleanliness, I'm thinking is it ok if we add the pandoc installation to unstructured-inference's install-ci as well since CI will need to run against various file types including rtf. This way we can omit the make install-pandoc on later line.

But since it's not directly related to this PR so it's your call :)

Anyways, just need to update CHANGELOG and update the branch then all LGTM! Switch this to ready for review then I can approve it then!

Good catch @Klaijan ! I removed those install-pandoc calls from GH workflows and now it's part of the install-ci recipe

PR is no longer in draft mode, so if it's still legitimate, please approve ;) (when all the checks will pass)

Copy link
Contributor

@Klaijan Klaijan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Klaijan Klaijan added this pull request to the merge queue Mar 1, 2024
@Klaijan Klaijan removed this pull request from the merge queue due to a manual request Mar 1, 2024
@micmarty-deepsense micmarty-deepsense added this pull request to the merge queue Mar 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 1, 2024
@micmarty-deepsense micmarty-deepsense added this pull request to the merge queue Mar 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Mar 1, 2024
@pawel-kmiecik pawel-kmiecik added this pull request to the merge queue Mar 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 1, 2024
@micmarty-deepsense micmarty-deepsense added this pull request to the merge queue Mar 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 1, 2024
@micmarty-deepsense micmarty-deepsense added this pull request to the merge queue Mar 1, 2024
github-merge-queue bot pushed a commit that referenced this pull request Mar 1, 2024
…upports .rtf files as input format) (#2593)

## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.

```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```

Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:

https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122

## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8)

### Note that `rtf` is not there

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38)

### More detail

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8)

## Proposed Solution 
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14
- [x] Update `setup_ubuntu.sh` if needed?:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87
-
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 1, 2024
@micmarty-deepsense micmarty-deepsense added this pull request to the merge queue Mar 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 4, 2024
@micmarty-deepsense micmarty-deepsense added this pull request to the merge queue Mar 4, 2024
Merged via the queue into main with commit b9aa4b7 Mar 4, 2024
46 checks passed
@micmarty-deepsense micmarty-deepsense deleted the mike/pandoc-version-unification branch March 4, 2024 11:33
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
…upports .rtf files as input format) (Unstructured-IO#2593)

## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.

```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```

Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:

https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122

## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8)

### Note that `rtf` is not there

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38)

### More detail

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8)

## Proposed Solution 
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14
- [x] Update `setup_ubuntu.sh` if needed?:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87
-
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants