All notable changes in pdfminer.six will be documented in this file.
The format is based on Keep a Changelog.
- Added maxobjects parameter to high_level.extract_pages, to limit the number of objects processed in a page. This allows to process a document faster by intentionally skipping objects in complex pages.
- Fix issue of TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' (#529)
PermissionError
when creating temporary filepaths on windows when running tests (#469)
- Support for Python 3.4 and 3.5 (#522)
- Unused dependency on
sortedcontainers
package (#525) - Support for non-standard output streams that are not binary (#523)
- Support for Python 3.4 and 3.5 (#507)
- Option to disable boxes flow layout analysis when using pdf2txt (#479)
- Support for
pathlib.PurePath
inopen_filename
(#492)
- Pass caching parameter to PDFResourceManager in
high_level
functions (#475) - Fix
.paint_path
logic for handling non-rect quadrilaterals and decomposing complex paths (#512) - Fix out-of-bound access on some PDFs (#483)
- Remove unused rijndael encryption implementation (#465)
- Rename PDFTextExtractionNotAllowedError to PDFTextExtractionNotAllowed to revert breaking change (#461)
- Always try to get CMap, not only for identity encodings (#438)
- Support for painting multiple rectangles at once (#371)
- Validate image object in do_EI is a PDFStream (#451)
- Hiding fallback xref by default from dumppdf.py output (#431)
- Raise a warning instead of an error when extracting text from a non-extractable PDF (#453)
- Switched from pycryptodome to cryptography package for AES decryption (#456)
- Python3 shebang line to script in tools (#408)
- Fix ordering of textlines within a textbox when
boxes_flow=None
(#412)
- Allow boxes_flow LAParam to be passed as None, validate the input, and update documentation (#396)
- Also accept file-like objects in high level functions
extract_text
andextract_pages
(#393)
- Text no longer comes in reverse order when advanced layout analysis is disabled (#399)
- Updated misleading documentation for
word_margin
andchar_margin
(#407) - Ignore ValueError when converting font encoding differences (#389)
- Grouping of text lines outside of parent container bounding box (#386)
- Group text lines if they are centered (#384)
- Removed samples/issue-00152-embedded-pdf.pdf because it contains a possible security thread; a javascript enabled object (#364)
- Interpret two's complement integer as unsigned integer (#352)
- Fix font name in html output such that it is recognized by browser (#357)
- Compute correct font height by removing scaling with font bounding box height (#348)
- KeyError when extracting embedded files and a Unicode file specification is missing (#338)
- The command-line utility latin2ascii.py (#360)
- Support for Python 2 (#346)
- Enforce pep8 coding style by adding flake8 to CI (#345)
- Wrong order of text box grouping introduced by PR #315 (#335)
- Simple wrapper to easily extract text from a PDF file #330
- Support for extracting JBIG2 encoded images (#311 and #46)
- Sphinx documentation that is published on Read the Docs (#329)
- Unhandled AssertionError when dumping pdf containing reference to object id 0 (#318)
- Debug flag actually changes logging level to debug for pdf2txt.py and dumppdf.py (#325)
- Using argparse instead of getopt for command line interface of dumppdf.py (#321)
- Refactor
LTLayoutContainer.group_textboxes
for a significant speed up in layout analysis (#315)
- Files for external applications such as django, cgi and pyinstaller (#320)
- Support for Python 2 is dropped at January 1st, 2020 (#307)
- Contribution guidelines in CONTRIBUTING.md (#259)
- Support new encodings OneByteEncoding and DLIdent for CMaps (#283)
- Use
six.iteritems()
instead ofdict().iteritems()
to ensure Python2 and Python3 compatibility (#274) - Properly convert Adobe Glyph names to unicode characters (#263)
- Allow CMap to be a content stream (#283)
- Resolve indirect objects for width and bounding boxes for fonts (#273)
- Actually updating stroke color in graphic state (#298)
- Interpret (invalid) negative font descent as a positive descent (#203)
- Correct colorspace comparision for images (#132)
- Allow for bounding boxes with zero height or width by removing assertion (#246)