Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add orientation param for text_extraction (# 1071) #1175

Merged
merged 8 commits into from
Jul 30, 2022

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Jul 27, 2022

add new capability to filter text extraction on orientation

Deprecations: PageObject.extract_text no longer uses the Tj_sep and TJ_sep parameters.

cf #1071

add new capability to filter text extraction on orientation
PyPDF2/_page.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 27, 2022

Codecov Report

Merging #1175 (03057ac) into main (8c532a0) will increase coverage by 0.02%.
The diff coverage is 97.77%.

@@            Coverage Diff             @@
##             main    #1175      +/-   ##
==========================================
+ Coverage   92.08%   92.11%   +0.02%     
==========================================
  Files          24       24              
  Lines        4866     4897      +31     
  Branches      996     1011      +15     
==========================================
+ Hits         4481     4511      +30     
  Misses        242      242              
- Partials      143      144       +1     
Impacted Files Coverage Δ
PyPDF2/_page.py 92.81% <97.77%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c532a0...03057ac. Read the comment docs.

PyPDF2/_page.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma changed the title ENH : add orientation param for text_extraction (# 1071) ENH: Add orientation param for text_extraction (# 1071) Jul 30, 2022
@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jul 30, 2022
@MartinThoma
Copy link
Member

MartinThoma commented Jul 30, 2022

Very nice!

It looks good to me - I will merge it tomorrow if the text extraction benchmark looks fine as well. So it should get into the release on Sunday :-)

@MartinThoma
Copy link
Member

MartinThoma commented Jul 30, 2022

For 2201.00029 the score increased from 96.7% to 97.7%, the rest stayed the same 👍

Interestingly, it seems to have killed a lot of newlines:

image

I think I need to design a new benchmark which measures how well newlines are captured. At the moment, this is completely ignored for calculating the score.

@MartinThoma
Copy link
Member

However, getting the spaces in / between words right is way more important. And there was the improvement 👍

@MartinThoma MartinThoma merged commit 8a27fa4 into py-pdf:main Jul 30, 2022
MartinThoma added a commit that referenced this pull request Jul 31, 2022
New Features (ENH):
-  Add ability to add hex encoded colors to outline items (#1186)
-  Add support for pathlib.Path in PdfMerger.merge (#1190)
-  Add link annotation (#1189)
-  Add capability to filter text extraction by orientation  (#1175)

Bug Fixes (BUG):
-  Named Dest in PDF1.1 (#1174)
-  Incomplete Graphic State save/restore (#1172)

Documentation (DOC):
-  Update changelog url in package metadata (#1180)
-  Table extraction (#1179)
-  Mention pyHanko for signing PDF documents (#1178)
-  We now have CMAP support (#1177)

Maintenance (MAINT):
-  Consistant usage of warnings / log messages (#1164)
-  Consistent terminology for outline items (#1156)

Code Style (STY):
-  Apply pre-commit (#1188)

Full Changelog: 2.8.1...2.9.0
MartinThoma pushed a commit that referenced this pull request Aug 5, 2022
@pubpub-zz pubpub-zz deleted the Orientations branch August 8, 2022 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants