-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) #2593
Conversation
488f39b
to
817ea72
Compare
…-related erros add more verbose error when convert_file_to_text fails due to pandoc add note in README about pandoc version and the proper way to install it All test will pass with 3.1.2 and fail with versions older than 2.14.2: ``` make install-pandoc make test-extra-pypandoc ```
e56262d
to
0629b53
Compare
A little related to this PR - Here at this line https://github.com/Unstructured-IO/unstructured/blob/main/.github/workflows/ci.yml#L438 just for cleanliness, I'm thinking is it ok if we add the But since it's not directly related to this PR so it's your call :) Anyways, just need to update CHANGELOG and update the branch then all LGTM! Switch this to ready for review then I can approve it then! |
caf0391
to
b792bda
Compare
Good catch @Klaijan ! I removed those PR is no longer in draft mode, so if it's still legitimate, please approve ;) (when all the checks will pass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…upports .rtf files as input format) (#2593) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122 ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. **Side note**: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14 - [x] Update `setup_ubuntu.sh` if needed?: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87 -
…upports .rtf files as input format) (Unstructured-IO#2593) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122 ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. **Side note**: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14 - [x] Update `setup_ubuntu.sh` if needed?: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87 -
Problem Description
In some cases you might find yourselves in a situation when pandoc won't be able to process an
rtf
as input file format, because older versions simply do not support that.Basically, some user may install the wrong version. The
README.md
is not be precise enough when mentioning RTF files support:unstructured/README.md
Lines 120 to 122 in 47b35cc
Example
Installing
pandoc
from a stable repository, like Debian will give you2.9
and the official documentation shows clearly that support for rtf was introduced in2.14
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21Note that
rtf
is not thereMore detail
Proposed Solution
make install-pandoc
calls, mimicking other recipes in order to ensure that3.1.2
will be installed in all cases. Side note:make install-pandoc
calls./scripts/install-pandoc.sh
under the hood.make install-pandoc
is recommended (>=2.14.2
)rtf
cases:unstructured/test_unstructured/file_utils/test_file_conversion.py
Line 14 in 47b35cc
setup_ubuntu.sh
if needed?:unstructured/scripts/setup_ubuntu.sh
Line 87 in 47b35cc