-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot extract_text
from weasyprint generated PDF
#242
Comments
pypdf2==1.26.0 In these versions, the text is also not extracted for the example above. |
It doesn’t work because PyPDF2 doesn’t use the |
PyPDF2 now uses |
I can make the script work by adding this code in _extract_text: if isinstance(op, bytes):
process_operation(b"Tj", [op.decode('utf-16be')]) Maybe this case is not supported because WeasyPrint uses 2-byte codes for its strings. This code is probably a dirty workaround, but you’ll find the correct fix faster than me 😁. (We get the right text, so |
@liZe Very nice! Thanks for sharing! @pubpub-zz You're the expert here. Do you think adding it like this is ok? |
the problem is a little more tricky. |
extractText
from weasyprint generated PDFextract_text
from weasyprint generated PDF
@liZe, can you retest with the latest version and give feed-backs |
👏 it works 👏 |
@MartinThoma |
Very nice! Amazing work @pubpub-zz and thank you for confirming @liZe 🤗 |
Generating a PDF with the following code ends up not returning anything from
extractText
.In this issue: Kozea/WeasyPrint/issues/290 @liZe points out that other tools are able to extract the text.
The text was updated successfully, but these errors were encountered: