-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtractText2 #929
ExtractText2 #929
Conversation
new proposal with deeper analysis of font data and text positionning
new proposal. |
I'll start it and will post the results this evening (might take 1-2h; I need to finish some other stuff) |
The average stayed the same. Most files improved, but one became drastically worse:
|
new draft proposal where bugs (also applying on the first proposal) : @MartinThoma Can you rerun the bench? I will have a look also to #858 in order to get the best of both |
Includes : * XObject Processing, * choice between encoding and tounicode fields * partial compliance with Identify-H/V encoding (missing processing on 2-bytes) *legacy conversion reintroduced as old for comparison *debug extraction *typing and test
increase test and refactory depreciation warning ignore in test
@pubpub-zz I would like to get the Charmap support soon into PyPDF2 and give you ( + some others who made very similar PRs before) full credit for your work. For this reason I would like to avoid to merge #924. I suggest the following:
|
@MartinThoma sorry to bother you can you rerun the bench on this version. |
No problem - I'm happy that you're doing the heavy-lifting 😄 I've just started the benchmark run. I'll share the results tomorrow morning (takes ~20 minutes and I'll go to bed now 😄 ) |
I get
for https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf - reader = PyPDF2.PdfFileReader("GeoTopo.pdf")
page = reader.pages[13]
page.extract_text() |
I've added the fallback
With that fallback, your PR currently boosts the average from 86% to Looking at the single files:
|
@pubpub-zz I love you 🤩 🤗 This is a crazy improvement! Now I really want it to be merged 😄 Please let me know how you would like me to continue. Should I merge pubpub-zz:ExtractText2 into py-pdf:pubpub-zz-extractText and then that one into |
the PR you've referenced will surely improve some translation. In my current branch the legacy function is still present as extract_oldtext for people to reverse if they prefer |
Sounds good! Then I'll wait for your ok to get started :-) |
I think you should be able to merge this release |
You mean I can merge this PR now? (just want to be sure :-) ) |
Go :) |
New proposal for evaluation for the current being