Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) #2593

Merged
merged 9 commits into from
Mar 4, 2024
8 changes: 4 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,6 @@ jobs:
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
Klaijan marked this conversation as resolved.
Show resolved Hide resolved
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
Expand Down Expand Up @@ -327,7 +326,8 @@ jobs:
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
Expand Down Expand Up @@ -390,7 +390,8 @@ jobs:
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
Expand Down Expand Up @@ -437,7 +438,6 @@ jobs:
# FIXME (yao): sometimes there is cache but we still miss argilla in the env; so we add make install-ci again
make install-ci
sudo apt-get update && sudo apt-get install --yes poppler-utils libreoffice
make install-pandoc
Klaijan marked this conversation as resolved.
Show resolved Hide resolved
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@ jobs:
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
### Fixes

* **Fix SharePoint dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.
* **Include warnings** about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue.
* **Incorporate the `install-pandoc` Makefile recipe** into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.
* **Fix Google Drive source key** Allow passing string for source connector key.

## 0.12.5
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ install-base: install-base-pip-packages install-nltk-models
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-all-docs

.PHONY: install-ci
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test install-pandoc

.PHONY: install-base-ci
install-base-ci: install-base-pip-packages install-nltk-models install-test
install-base-ci: install-base-pip-packages install-nltk-models install-test install-pandoc

.PHONY: install-base-pip-packages
install-base-pip-packages:
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ installation.
- `poppler-utils` (images and PDFs)
- `tesseract-ocr` (images and PDFs, install `tesseract-lang` for additional language support)
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs, RTFs and Open Office docs)
- `pandoc` (EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version `2.14.2` or newer. Running either `make install-pandoc` or `./scripts/install-pandoc.sh` will install the correct version for you.

- For suggestions on how to install on the Windows and to learn about dependencies for other features, see the
installation documentation [here](https://unstructured-io.github.io/unstructured/installing.html).
Expand Down
2 changes: 1 addition & 1 deletion docs/source/introduction/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This guide offers concise steps to swiftly install and validate your ``unstructu
- `poppler-utils` : Needed for images and PDFs.
- `tesseract-ocr` : Essential for images and PDFs.
- `libreoffice` : For MS Office documents.
- `pandoc` : For EPUBs, RTFs, and Open Office documents.
- `pandoc` : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version `2.14.2` or newer. Running `this script <https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh>`__ will install the correct version for you.

Validating Installation
-----------------------
Expand Down
20 changes: 20 additions & 0 deletions unstructured/file_utils/file_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,26 @@ def convert_file_to_text(filename: str, source_format: str, target_format: str)
f"{err}"
)
raise FileNotFoundError(msg)
except RuntimeError as err:
supported_source_formats, _ = pypandoc.get_pandoc_formats()

if source_format == "rtf" and source_format not in supported_source_formats:
additional_info = (
"Support for RTF files is not available in the current pandoc installation. "
"It was introduced in pandoc 2.14.2.\n"
"Reference: https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21"
)
else:
additional_info = ""

msg = (
f"{err}\n\n{additional_info}\n\n"
f"Current version of pandoc: {pypandoc.get_pandoc_version()}\n"
"Make sure you have the right version installed in your system. "
"Please, follow the pandoc installation instructions "
"in README.md to install the right version."
)
raise RuntimeError(msg)

return text

Expand Down
Loading