-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of advanced cmap encodings #2356
Comments
No, there is none. I guess only @pubpub-zz can help you with that. |
@stefan6419846
|
@pubpub-zz Thanks for pointing this out. It seems to indeed work. When looking at this, two questions arose for me:
|
Similiar issues for "/UniCNS-UTF16-H" , "/ETen-B5-H" , "/ETen-B5-V", "/ETenms-B5-H" , how to modify _cmap? |
@actuary-chen can you please share your pdf for analysis? |
Hi,
Maybe regards these two files.
Benjamin
pubpub-zz ***@***.***> 於 2024年6月19日 週三 下午7:26寫道:
… @actuary-chen <https://github.com/actuary-chen> can you please share your
pdf for analysis?
—
Reply to this email directly, view it on GitHub
<#2356 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEO7QJBBCNVEMFERM4HCEWDZIFTHZAVCNFSM6AAAAABA7VBGLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGQ2DAMZUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@actuary-chen the files are not attached. Please attach them directly in the thread |
The issues are maybe from such as the attached files |
@actuary-chen
|
This issue seems solved. Don't know why it has not been closed automatically |
This has not been closed before as I was looking for a generic solution for implementing all possible encodings in one step instead of opening a new issue for each one. |
we need to check the encodings. I can not see a global solutoin |
Currently, I am trying to extract text from PDF files which partially report some warnings like
I have seen this for the both encodings mentioned above and for
/StandardEncoding
.Digging through the available resources related to the GBK2K cmaps, I found some Adobe resources as well as the implementation from
pdfminer.six
, which ships some custom pickled files derived from the Adobe open source components to handle such cases.Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
For now, I have no uncritical file I could share here. Looking at the example file, it seems like in this case it is a scan of a document (from a Canon device?) with Latin characters with wrongly configured or strange OCR, yielding a mix of Latin and Chinese characters inside the text layer.
Traceback
warnings.warn
as currently used only prints the pypdf code line this occurred, thus there is not much of a traceback.The text was updated successfully, but these errors were encountered: