Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced text extraction #464

Closed
wants to merge 45 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
fcd7b19
Update gitIgnore
Oct 14, 2018
08ba779
introduce the dbg function
May 2, 2018
46a073b
introduce lineCallback
Oct 14, 2018
915f40d
Treat TD like Td
Apr 29, 2018
232ca33
use absolute positions for text
May 1, 2018
28d57f8
Basic support for CMAP
May 1, 2018
7737642
Fix CMAP handling to support two byte chars
May 2, 2018
354a24e
Basic support for the Tm operands
May 2, 2018
3444f81
Handle TJ text elements
May 3, 2018
ac4e98b
improve CMAP and TJ handling
May 6, 2018
fd6858d
Fix CMAP to support multiple ranges
May 6, 2018
1c297cb
Increase tolerance for garbage after EOF
May 7, 2018
1461ede
Glyph encoding for Type1C
May 3, 2018
f420505
Extract Fonts and Images
May 15, 2018
1619d0a
fix cmap handling for non unicode
May 24, 2018
91870e9
Fix encoding handling
May 24, 2018
6561bc8
fix getDocumentInfo return value
May 28, 2018
c0a0b74
accept streams with no endstream
Jun 1, 2018
b4603fc
Fix Tm handling
Jun 1, 2018
bbb81f8
PyPdf: improve glyph name handling
Jun 10, 2018
aa2d46d
Add linemargin
Jun 13, 2018
c6dbbb0
add the .notdef glyph name
Jun 13, 2018
efd1eec
handle strange char numbers in cmaps
Jun 19, 2018
1cbce88
Fix handling of rare text elements
Jun 19, 2018
2394f39
don't miss the last line
Jun 29, 2018
49bd3fe
fix BFRANGE handling
Jun 29, 2018
ee824d0
CMAP parsing improvements
Jul 2, 2018
6898737
Debugging enhancement
Jul 2, 2018
22ac7ad
hack for singlebyte CMAP
Jul 2, 2018
2ce9465
Real lines sorting
Jul 5, 2018
a7e57e5
Handle graphics state
Jul 7, 2018
1078110
Fix cmap handling
Jul 10, 2018
b262e5a
Fix XMP metadata extraction
Jul 10, 2018
53257d2
Fix TJ and unicode translation code
Oct 15, 2018
6742852
Refactor cmap handling to support random '\n'
Jul 10, 2018
d2eba4b
fix and enhance cmap handling
Jul 10, 2018
9c30676
fix graphics matrix handling
Jul 23, 2018
053d27d
Fix CMAP handling
Jul 25, 2018
0461d8b
Calculate the rightmost X values with a GraphicsMatrix
Jul 25, 2018
cf14170
extract the correct char widths
Aug 6, 2018
06d8971
Add extractTextState
Oct 16, 2018
add17ac
FIx chr_ abstraction
Oct 21, 2018
28a64bd
Fix Python3 support
Oct 21, 2018
7427b5c
More Python3 cleanups
Oct 21, 2018
5c9f0ed
Fix more Python3 compatibility issues
Oct 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
introduce lineCallback
The optional lineCallback argument is called for each line in the
extracted text.
The callback receives a list of line elements, each containing text, x
and y.
  • Loading branch information
Assi.Abramovitz@gmail.com authored and Assi.Abramovitz@gmail.com committed Oct 16, 2018
commit 46a073b9e75eea990cb0428c39d4adb19fd88a4a
23 changes: 21 additions & 2 deletions PyPDF2/pdf.py
Original file line number Diff line number Diff line change
@@ -2649,7 +2649,7 @@ def compressContentStreams(self):
content = ContentStream(content, self.pdf)
self[NameObject("/Contents")] = content.flateEncode()

def extractText(self):
def extractText(self, lineCallback=None):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
@@ -2664,6 +2664,8 @@ def extractText(self):
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
lastPosition = (0, 0)
lineElements = []
# Note: we check all strings are TextStringObjects. ByteStringObjects
# are strings where the byte->string encoding was unknown, so adding
# them to the text here would be gibberish.
@@ -2672,24 +2674,41 @@ def extractText(self):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
text += "\n"
text += "|"
# print("TD = " + str(lastPosition) + " Tj Text Element:" +_text)
if (lastPosition[1] != 0):
text += "\n"
if (lineCallback != None):
lineCallback(lineElements)
lineElements = []
lineElements.append({ 'text':_text, 'x': lastPosition[0], 'y': lastPosition[1]})
elif operator == b_("T*"):
dbg(2, "T*T*T*T*T*T*T*T*T")
text += "\n"
elif operator == b_("'"):
dbg(2, "'''''''''''''''''''''''''''''")
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
dbg(2, '""""""""""""""""""""""""""""')
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
dbg(2, "TJTJTJTJTJTJTJTJTJTJTJ")
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
elif operator == b_("Td"):

# print("Td: x = " + str(operands[0]) + " y = " + str(operands[1]))
lastPosition = (operands[0], operands[1])
# else:
#print ("operator: " + operator)
return text

mediaBox = createRectangleAccessor("/MediaBox", ())