Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

name2unicode(): handle hexadecimal literals for unicode glyphs in text extraction #230

Merged
merged 1 commit into from
Jul 9, 2019

Conversation

0xabu
Copy link
Contributor

@0xabu 0xabu commented Feb 25, 2019

This is based on a slight tweak to the fix proposed by @janslifka in #183 (handling of lowercase hex literals, because those showed up in the sample PDF from issue #229).

Fixes #183, #229

@pietermarsman
Copy link
Member

@0xabu, could you add a regression test that prevents the same mistake in the future?

@pietermarsman
Copy link
Member

I am not an expert on this matter. But the Adobe Glyph List Specification does only talk about hexadecimal numbers, so this seems legit.

Also, two people indicate that this solves there problems.

So, I think we should merge this.

@0xabu
Copy link
Contributor Author

0xabu commented Jul 9, 2019

@pietermarsman thanks for the pointer to the spec. That's very helpful to explain what is going on, but there are at least two issues:

  • The spec makes it clear that "uni..." is a string of possibly multiple characters, whereas "u..." is always a single character. This code makes no attempt to handle that, in fact the use of re.search makes it look prone to mis-matching.
  • The spec talks only about uppercase, but clearly PDFs in the wild use lowercase.

Nevertheless, I'm also tempted to merge this because it clearly makes things better than they were before.

@0xabu 0xabu merged commit 6b312ed into pdfminer:master Jul 9, 2019
eladkehat added a commit to eladkehat/yapdfminer that referenced this pull request Aug 8, 2019
@0xabu 0xabu deleted the unicode_glyph_bug branch August 16, 2021 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing diff for fonts doesn't work correctly
2 participants