Skip to content

Commit

Permalink
fix: Install pandoc consistently, via Makefile recipe (version that s…
Browse files Browse the repository at this point in the history
…upports .rtf files as input format) (#2593)

## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.

```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```

Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:

https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122

## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8)

### Note that `rtf` is not there

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38)

### More detail

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8)

## Proposed Solution 
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14
- [x] Update `setup_ubuntu.sh` if needed?:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87
-
  • Loading branch information
micmarty-deepsense authored Mar 4, 2024
1 parent 43250d5 commit b9aa4b7
Show file tree
Hide file tree
Showing 7 changed files with 32 additions and 9 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,6 @@ jobs:
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
Expand Down Expand Up @@ -327,7 +326,8 @@ jobs:
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
Expand Down Expand Up @@ -390,7 +390,8 @@ jobs:
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
Expand Down Expand Up @@ -437,7 +438,6 @@ jobs:
# FIXME (yao): sometimes there is cache but we still miss argilla in the env; so we add make install-ci again
make install-ci
sudo apt-get update && sudo apt-get install --yes poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@ jobs:
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
### Fixes

* **Fix SharePoint dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.
* **Include warnings** about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue.
* **Incorporate the `install-pandoc` Makefile recipe** into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.
* **Fix Google Drive source key** Allow passing string for source connector key.

## 0.12.5
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ install-base: install-base-pip-packages install-nltk-models
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-all-docs

.PHONY: install-ci
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test install-pandoc

.PHONY: install-base-ci
install-base-ci: install-base-pip-packages install-nltk-models install-test
install-base-ci: install-base-pip-packages install-nltk-models install-test install-pandoc

.PHONY: install-base-pip-packages
install-base-pip-packages:
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ installation.
- `poppler-utils` (images and PDFs)
- `tesseract-ocr` (images and PDFs, install `tesseract-lang` for additional language support)
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs, RTFs and Open Office docs)
- `pandoc` (EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version `2.14.2` or newer. Running either `make install-pandoc` or `./scripts/install-pandoc.sh` will install the correct version for you.

- For suggestions on how to install on the Windows and to learn about dependencies for other features, see the
installation documentation [here](https://unstructured-io.github.io/unstructured/installing.html).
Expand Down
2 changes: 1 addition & 1 deletion docs/source/introduction/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This guide offers concise steps to swiftly install and validate your ``unstructu
- `poppler-utils` : Needed for images and PDFs.
- `tesseract-ocr` : Essential for images and PDFs.
- `libreoffice` : For MS Office documents.
- `pandoc` : For EPUBs, RTFs, and Open Office documents.
- `pandoc` : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version `2.14.2` or newer. Running `this script <https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh>`__ will install the correct version for you.

Validating Installation
-----------------------
Expand Down
20 changes: 20 additions & 0 deletions unstructured/file_utils/file_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,26 @@ def convert_file_to_text(filename: str, source_format: str, target_format: str)
f"{err}"
)
raise FileNotFoundError(msg)
except RuntimeError as err:
supported_source_formats, _ = pypandoc.get_pandoc_formats()

if source_format == "rtf" and source_format not in supported_source_formats:
additional_info = (
"Support for RTF files is not available in the current pandoc installation. "
"It was introduced in pandoc 2.14.2.\n"
"Reference: https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21"
)
else:
additional_info = ""

msg = (
f"{err}\n\n{additional_info}\n\n"
f"Current version of pandoc: {pypandoc.get_pandoc_version()}\n"
"Make sure you have the right version installed in your system. "
"Please, follow the pandoc installation instructions "
"in README.md to install the right version."
)
raise RuntimeError(msg)

return text

Expand Down

0 comments on commit b9aa4b7

Please sign in to comment.