You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to extract text from the attached PDF, several pages return cid values instead of readable text. Additionally, pages containing mixed content (text and images) do not return any text at all.
This PDF has completely arbitrary and corrupt ToUnicode character mappings, it's unlikely that pdfminer can do much about it. You can see the problem by trying to copy and paste text out of it from your browser's PDF viewer (in my case Chrome). Even the English text is corrupted, for example, "The dancers" on page 3 comes out as:
Issue:
When attempting to extract text from the attached PDF, several pages return cid values instead of readable text. Additionally, pages containing mixed content (text and images) do not return any text at all.
Affected PDF:
The Phantom of the Opera.pdf
Code Sample:
Output:
The extracted content includes cid values such as:
The text was updated successfully, but these errors were encountered: