-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to consistently extract field labels from PDFs #3950
Comments
In this case, the field label is not stored with the field itself, but with its so-called |
Access field label as an **inheritable** dictionary value. Addresses #3950.
Access field label as an **inheritable** dictionary value. Addresses #3950.
Thanks @JorjMcKie . Would it be possible to know an approximate timeline for the stable release of this new version? |
There's a small chance that we will make a new release this week, but it's more likely to be next week. |
Fixed in 1.24.12. |
@julian-smith-artifex-com @JorjMcKie thanks for the quick release. I tried running the script again with the latest version (1.24.12) locally. It does work better now, but it still is missing the very first field label from the PDF. When I run this code again on not working sample.pdf, I get only three field label names, while there are four in the PDF. pip install --upgrade pymupdf # installs version 1.24.12. Other system details are the same as mentioned in the start of the issue import fitz
doc = fitz.Document("not working sample.pdf")
for page in doc:
for widget in page.widgets():
print(widget.field_label) I get the following output:
But the expected output should be
Is this something that is fixable or is this due to some PDF level nuance? |
…n of field_label. Also recurse to parent if node's string value is empty string. This appears to be what Adobe does. Addresses #3950.
…n of field_label. Also recurse to parent if node's string value is empty string. This appears to be what Adobe does. Addresses #3950.
@julian-smith-artifex-com thanks again for the newest fix. Could you please provide me with an estimated time-frame for the next release? Thanks |
Fixed in 1.24.13. |
Description of the bug
For my usecase, I am trying to extract the
widget.field_label
field from a PDF file. I tried extracting this field from two PDFs. I am successfully able to extract the field labels from one PDF, but not from the other. If it helps in any way, I used Master PDF Editor to add the field labels for the PDFs.This is the PDF for which I am able to extract the field labels from all the widgets -
working sample.pdf
This is the PDF for which I am not able to extract the field labels even after adding the labels -
not working sample.pdf
Is this a PDF/Editor level nuance? Or a bug?
How to reproduce the bug
The reproduction of the problem should be fairly simple:
PDF files:
working sample.pdf
not working sample.pdf
For
working sample.pdf
, I get the following output:Which is correct and expected. It covers all the available field labels
For
not working sample.pdf
, I get the following output:But the expected output for
not working sample.pdf
should be (not necessarily in the same order):which are all the available field labels in the PDF
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.10
The text was updated successfully, but these errors were encountered: