Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cant extract proper local language(like kannada,tamil) text from pdf ?? #7

Open
rustiever opened this issue Jun 18, 2020 · 2 comments

Comments

@rustiever
Copy link

rustiever commented Jun 18, 2020

i only tested in android. so the error might be from pdfBox

W/PdfBox-Android( 6641): No Unicode mapping for CID+222 (222) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+254 (254) in font TAUElangoArunthathi
I/chatty  ( 6641): uid=10281(com.example.text_audio) Thread-5 identical 2 lines
W/PdfBox-Android( 6641): No Unicode mapping for CID+254 (254) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+270 (270) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+270 (270) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+262 (262) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+262 (262) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+223 (223) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+223 (223) in font TAUElangoArunthathi

i think specifying the font while calling the method might solve. Just saying not sure

@AlessioLuciani
Copy link
Owner

i only tested in android. so the error might be from pdfBox

W/PdfBox-Android( 6641): No Unicode mapping for CID+222 (222) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+254 (254) in font TAUElangoArunthathi
I/chatty  ( 6641): uid=10281(com.example.text_audio) Thread-5 identical 2 lines
W/PdfBox-Android( 6641): No Unicode mapping for CID+254 (254) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+270 (270) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+270 (270) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+262 (262) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+262 (262) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+223 (223) in font TAUElangoArunthathi
W/PdfBox-Android( 6641): No Unicode mapping for CID+223 (223) in font TAUElangoArunthathi

i think specifying the font while calling the method might solve. Just saying not sure

Apparently there are some characters in the the font TAUElangoArunthathi that have no mapping for Unicode. So I guess that PdfBox can't turn them into plain text. Unfortunately I couldn't reproduce the error. I tried with a tamil pdf and PdfBox didn't complain. Maybe a similar error would present on iOS too.

@rustiever
Copy link
Author

while parsing tamil pdf which font used by PdfBox??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants