Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced text extraction #464

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
fcd7b19
Update gitIgnore
Oct 14, 2018
08ba779
introduce the dbg function
May 2, 2018
46a073b
introduce lineCallback
Oct 14, 2018
915f40d
Treat TD like Td
Apr 29, 2018
232ca33
use absolute positions for text
May 1, 2018
28d57f8
Basic support for CMAP
May 1, 2018
7737642
Fix CMAP handling to support two byte chars
May 2, 2018
354a24e
Basic support for the Tm operands
May 2, 2018
3444f81
Handle TJ text elements
May 3, 2018
ac4e98b
improve CMAP and TJ handling
May 6, 2018
fd6858d
Fix CMAP to support multiple ranges
May 6, 2018
1c297cb
Increase tolerance for garbage after EOF
May 7, 2018
1461ede
Glyph encoding for Type1C
May 3, 2018
f420505
Extract Fonts and Images
May 15, 2018
1619d0a
fix cmap handling for non unicode
May 24, 2018
91870e9
Fix encoding handling
May 24, 2018
6561bc8
fix getDocumentInfo return value
May 28, 2018
c0a0b74
accept streams with no endstream
Jun 1, 2018
b4603fc
Fix Tm handling
Jun 1, 2018
bbb81f8
PyPdf: improve glyph name handling
Jun 10, 2018
aa2d46d
Add linemargin
Jun 13, 2018
c6dbbb0
add the .notdef glyph name
Jun 13, 2018
efd1eec
handle strange char numbers in cmaps
Jun 19, 2018
1cbce88
Fix handling of rare text elements
Jun 19, 2018
2394f39
don't miss the last line
Jun 29, 2018
49bd3fe
fix BFRANGE handling
Jun 29, 2018
ee824d0
CMAP parsing improvements
Jul 2, 2018
6898737
Debugging enhancement
Jul 2, 2018
22ac7ad
hack for singlebyte CMAP
Jul 2, 2018
2ce9465
Real lines sorting
Jul 5, 2018
a7e57e5
Handle graphics state
Jul 7, 2018
1078110
Fix cmap handling
Jul 10, 2018
b262e5a
Fix XMP metadata extraction
Jul 10, 2018
53257d2
Fix TJ and unicode translation code
Oct 15, 2018
6742852
Refactor cmap handling to support random '\n'
Jul 10, 2018
d2eba4b
fix and enhance cmap handling
Jul 10, 2018
9c30676
fix graphics matrix handling
Jul 23, 2018
053d27d
Fix CMAP handling
Jul 25, 2018
0461d8b
Calculate the rightmost X values with a GraphicsMatrix
Jul 25, 2018
cf14170
extract the correct char widths
Aug 6, 2018
06d8971
Add extractTextState
Oct 16, 2018
add17ac
FIx chr_ abstraction
Oct 21, 2018
28a64bd
Fix Python3 support
Oct 21, 2018
7427b5c
More Python3 cleanups
Oct 21, 2018
5c9f0ed
Fix more Python3 compatibility issues
Oct 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@
build
.idea/*

.project
.vscode/launch.json
.vscode/settings.json
2 changes: 1 addition & 1 deletion PyPDF2/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -625,7 +625,7 @@ def readFromStream(stream, pdf):
pos = stream.tell()
stream.seek(-10, 1)
end = stream.read(9)
if end == b_("endstream"):
if end == b_("endstream") or end[:4] == b_("\nend"):
# we found it by looking back one character further.
data["__streamdata__"] = data["__streamdata__"][:-1]
else:
Expand Down
Loading