Skip to content

Commit

Permalink
BUG: Avoid a crash when a ToUnicode CMap has an empty dstString in be…
Browse files Browse the repository at this point in the history
…ginbfchar (#1118)

This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a `beginbfchar` table in a
ToUnicode CMap.

We take narrow aim at the issue of zero-length (empty) hex
string representations.

We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character.  when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".".  Then, when we encounter a ".", we remember that it was
supposed to be an empty string.

One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
`PageObject::process_operation` so that it doesn't try to read the
final character from an empty string.

Closes #1111
  • Loading branch information
dkg authored Jul 17, 2022
1 parent baeb7d2 commit ae0ff49
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 4 deletions.
18 changes: 14 additions & 4 deletions PyPDF2/_cmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,13 @@ def parse_to_unicode(
for i in range(len(ll)):
j = ll[i].find(b">")
if j >= 0:
ll[i] = ll[i][:j].replace(b" ", b"") + b" " + ll[i][j + 1 :]
if j == 0:
# string is empty: stash a placeholder here (see below)
# see https://github.com/py-pdf/PyPDF2/issues/1111
content = b"."
else:
content = ll[i][:j].replace(b" ", b"")
ll[i] = content + b" " + ll[i][j + 1 :]
cm = (
(b" ".join(ll))
.replace(b"[", b" [ ")
Expand Down Expand Up @@ -246,13 +252,17 @@ def parse_to_unicode(
lst = [x for x in l.split(b" ") if x]
map_dict[-1] = len(lst[0]) // 2
while len(lst) > 1:
map_to = ""
# placeholder (see above) means empty string
if lst[1] != b".":
map_to = unhexlify(lst[1]).decode(
"utf-16-be", "surrogatepass"
) # join is here as some cases where the code was split
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
)
] = unhexlify(lst[1]).decode(
"utf-16-be", "surrogatepass"
) # join is here as some cases where the code was split
] = map_to
int_entry.append(int(lst[0], 16))
lst = lst[2:]
for a, value in map_dict.items():
Expand Down
1 change: 1 addition & 0 deletions PyPDF2/_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -1384,6 +1384,7 @@ def process_operation(operator: bytes, operands: List) -> None:
if (
(abs(float(op)) >= _space_width)
and (abs(float(op)) <= 8 * _space_width)
and (len(text) > 0)
and (text[-1] != " ")
):
process_operation(b"Tj", [" "])
Expand Down

0 comments on commit ae0ff49

Please sign in to comment.