Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: provide command line tool #1

Closed
gnusupport opened this issue Jan 2, 2025 · 15 comments · Fixed by #2 or #3
Closed

Feature request: provide command line tool #1

gnusupport opened this issue Jan 2, 2025 · 15 comments · Fixed by #2 or #3
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@gnusupport
Copy link

Just as markitdown I suggest you provide a command line tool for PDF conversions.

@AstraBert
Copy link
Owner

Legit :)

@AstraBert AstraBert self-assigned this Jan 2, 2025
@AstraBert AstraBert added the enhancement New feature or request label Jan 2, 2025
@AstraBert AstraBert linked a pull request Jan 2, 2025 that will close this issue
@AstraBert
Copy link
Owner

Implemented!🥰

If you install pdfitdown == 0.0.2 you will have the command line tool! (See README for examples)

@gnusupport
Copy link
Author

👍😊 Gratitude for the 'pdfitdown' tool! Your innovation makes PDF management a breeze. 🌟 Thank you! 💌📄

@gnusupport
Copy link
Author

bin/pdfitdown -i ../line.md -o ../line.pdf
🚨 Segmentation fault

@gnusupport
Copy link
Author

I do not know how to give you other errors, just see segmentation fault.

@AstraBert
Copy link
Owner

Can you send me the files you are using for it, so that I can try to reproduce it? :)

@AstraBert AstraBert reopened this Jan 2, 2025
@gnusupport
Copy link
Author

line.md

@AstraBert
Copy link
Owner

Hi!

So the issue is related to the PyMuPdf version used by markdown-pdf: being <1.24.6, it is not able to handle empty headings (such as the one at line 59 in your markdown file), and so it produces segmentation fault. See this issue to know more. Nervetheless, I am going to implement a temporary fix (until markdown-pdf does not fix the issue itself) in the next release :)

@AstraBert AstraBert linked a pull request Jan 2, 2025 that will close this issue
@AstraBert AstraBert added the bug Something isn't working label Jan 2, 2025
@AstraBert
Copy link
Owner

For temporary fix, see README

@gnusupport
Copy link
Author

Found existing installation: PyMuPDF 1.24.2
Uninstalling PyMuPDF-1.24.2:
  Successfully uninstalled PyMuPDF-1.24.2

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown-pdf 1.3.2 requires PyMuPDF==1.24.2, but you have pymupdf 1.24.6 which is incompatible.
Successfully installed PyMuPDFb-1.24.6 pymupdf-1.24.6
(TTS) ~/TTS
$

I guess I can only wait

@gnusupport
Copy link
Author

I have tested it now, it gives very nice output.

@gnusupport
Copy link
Author

Links are underlined though not working, I guess it is intentional?

@AstraBert
Copy link
Owner

Yep, pdfitdown for now is intended for textual data ingestion (e.g. for uploading text to vector dbs for RAG apps), and so it does not crearte functional links. This may change withb future releases, tho :)

@AstraBert
Copy link
Owner

AstraBert commented Jan 3, 2025

Hi @gnusupport, just to let you know, pdfitdown v0.0.4 has native support for PyMuPdf v1.25.1 so you don't have to do the workaround that was necessary for previous versions🥰

@gnusupport
Copy link
Author

Yes, I am using it, it generates PDF, and I can use it without links. For sure I need time to reach point where I can use it for LLM training, for now is useful for documents out of that scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants