Cannot `extract_text` from weasyprint generated PDF #242

mattjmorrison-imt · 2016-01-12T19:06:01Z

Generating a PDF with the following code ends up not returning anything from extractText.

"""
PyPDF2==2.1.0
WeasyPrint==55.0
"""

from io import BytesIO
from PyPDF2 import PdfReader

# Create example
from weasyprint import HTML
stream = BytesIO()
HTML(string="""
<html>
<body>
<div>Hello World</div>
</body>
</html>
""").write_pdf(stream)
stream.seek(0)

# Try to read "Hello World"
reader = PdfReader(stream)
print(reader.pages[0].extract_text())

In this issue: Kozea/WeasyPrint/issues/290 @liZe points out that other tools are able to extract the text.

The text was updated successfully, but these errors were encountered:

afedosenko · 2021-08-30T15:31:29Z

pypdf2==1.26.0
weasyprint==53.2

In these versions, the text is also not extracted for the example above.

liZe · 2021-08-30T17:52:46Z

In these versions, the text is also not extracted for the example above.

It doesn’t work because PyPDF2 doesn’t use the /Encoding and/or the /ToUnicode information included in embedded fonts. As far as I can tell, there’s no easy fix :/, it will probably require a certain amount of work.

MartinThoma · 2022-06-06T12:17:24Z

PyPDF2 now uses /Encoding and /ToUnicode. Sadly, this issue is still open.

liZe · 2022-06-06T13:12:07Z

PyPDF2 now uses /Encoding and /ToUnicode. Sadly, this issue is still open.

I can make the script work by adding this code in _extract_text:

if isinstance(op, bytes):
    process_operation(b"Tj", [op.decode('utf-16be')])

Maybe this case is not supported because WeasyPrint uses 2-byte codes for its strings. This code is probably a dirty workaround, but you’ll find the correct fix faster than me 😁.

(We get the right text, so /Encoding and /ToUnicode seem to work correctly as we always use custom encodings in WeasyPrint. 👏🎉)

MartinThoma · 2022-06-06T13:22:06Z

@liZe Very nice! Thanks for sharing!

@pubpub-zz You're the expert here. Do you think adding it like this is ok?

pubpub-zz · 2022-06-06T22:09:28Z

the problem is a little more tricky.
Under analysis

pubpub-zz · 2022-06-19T12:13:35Z

@liZe, can you retest with the latest version and give feed-backs

liZe · 2022-06-19T16:43:51Z

@liZe, can you retest with the latest version and give feed-backs

👏 it works 👏

pubpub-zz · 2022-06-19T16:46:19Z

@MartinThoma
can you close it then

MartinThoma · 2022-06-19T16:58:08Z

Very nice! Amazing work @pubpub-zz and thank you for confirming @liZe 🤗

mstamy2 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label May 19, 2016

MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Apr 18, 2022

MartinThoma mentioned this issue Jun 6, 2022

ExtractText yields nothing for apparently good PDF #168

Closed

MartinThoma changed the title ~~Cannot extractText from weasyprint generated PDF~~ Cannot extract_text from weasyprint generated PDF Jun 10, 2022

pubpub-zz mentioned this issue Jun 10, 2022

improved ExtractText(3) #969

Merged

MartinThoma closed this as completed Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot `extract_text` from weasyprint generated PDF #242

Cannot `extract_text` from weasyprint generated PDF #242

mattjmorrison-imt commented Jan 12, 2016 •

edited by MartinThoma

Loading

afedosenko commented Aug 30, 2021

liZe commented Aug 30, 2021

MartinThoma commented Jun 6, 2022

liZe commented Jun 6, 2022

MartinThoma commented Jun 6, 2022

pubpub-zz commented Jun 6, 2022 •

edited

Loading

pubpub-zz commented Jun 19, 2022

liZe commented Jun 19, 2022

pubpub-zz commented Jun 19, 2022

MartinThoma commented Jun 19, 2022

Cannot extract_text from weasyprint generated PDF #242

Cannot extract_text from weasyprint generated PDF #242

Comments

mattjmorrison-imt commented Jan 12, 2016 • edited by MartinThoma Loading

afedosenko commented Aug 30, 2021

liZe commented Aug 30, 2021

MartinThoma commented Jun 6, 2022

liZe commented Jun 6, 2022

MartinThoma commented Jun 6, 2022

pubpub-zz commented Jun 6, 2022 • edited Loading

pubpub-zz commented Jun 19, 2022

liZe commented Jun 19, 2022

pubpub-zz commented Jun 19, 2022

MartinThoma commented Jun 19, 2022

Cannot `extract_text` from weasyprint generated PDF #242

Cannot `extract_text` from weasyprint generated PDF #242

mattjmorrison-imt commented Jan 12, 2016 •

edited by MartinThoma

Loading

pubpub-zz commented Jun 6, 2022 •

edited

Loading