Unable to consistently extract field labels from PDFs #3950

Rutvik-Trivedi · 2024-10-16T07:26:00Z

Description of the bug

For my usecase, I am trying to extract the widget.field_label field from a PDF file. I tried extracting this field from two PDFs. I am successfully able to extract the field labels from one PDF, but not from the other. If it helps in any way, I used Master PDF Editor to add the field labels for the PDFs.

This is the PDF for which I am able to extract the field labels from all the widgets -
working sample.pdf

This is the PDF for which I am not able to extract the field labels even after adding the labels -
not working sample.pdf

Is this a PDF/Editor level nuance? Or a bug?

How to reproduce the bug

The reproduction of the problem should be fairly simple:

import fitz
doc = fitz.Document("working sample.pdf")  # Or "not working sample.pdf"
for page in doc:
    for widget in page.widgets():
        print(widget.field_label)

PDF files:
working sample.pdf
not working sample.pdf

For working sample.pdf, I get the following output:

{{ firstName }}
{{ lastName }}
{{ address.street }}
{{ address.apt }}
{{ address.zipcode }}
{{ address.city }}
{{ spirit }}
{{ today }}
{{ evil | check }}
{{ language.french | X }}
{{ language.esperento | X }}
{{ language.latin | X }}
{{ sig | paste }}

Which is correct and expected. It covers all the available field labels

For not working sample.pdf, I get the following output:

""
None
None
None

But the expected output for not working sample.pdf should be (not necessarily in the same order):

{{ named_insured }}
{{ insurance_line }}
{{ policy_period_start_date }}
{{ policy_period_end_date }}

which are all the available field labels in the PDF

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-10-16T10:35:14Z

In this case, the field label is not stored with the field itself, but with its so-called Parent. The current code looks at this field Parent only for field_name while it should also do that for field_label.
The fix is trivial and should be available in a soon new version.

Access field label as an **inheritable** dictionary value. Addresses #3950.

Rutvik-Trivedi · 2024-10-16T14:42:20Z

In this case, the field label is not stored with the field itself, but with its so-called Parent. The current code looks at this field Parent only for field_name while it should also do that for field_label. The fix is trivial and should be available in a soon new version.

Thanks @JorjMcKie . Would it be possible to know an approximate timeline for the stable release of this new version?

julian-smith-artifex-com · 2024-10-17T10:52:43Z

Thanks @JorjMcKie . Would it be possible to know an approximate timeline for the stable release of this new version?

There's a small chance that we will make a new release this week, but it's more likely to be next week.

julian-smith-artifex-com · 2024-10-21T14:44:15Z

Fixed in 1.24.12.

Rutvik-Trivedi · 2024-10-22T09:18:42Z

@julian-smith-artifex-com @JorjMcKie thanks for the quick release. I tried running the script again with the latest version (1.24.12) locally. It does work better now, but it still is missing the very first field label from the PDF. When I run this code again on not working sample.pdf, I get only three field label names, while there are four in the PDF.

pip install --upgrade pymupdf  # installs version 1.24.12. Other system details are the same as mentioned in the start of the issue

import fitz
doc = fitz.Document("not working sample.pdf")
for page in doc:
    for widget in page.widgets():
        print(widget.field_label)

I get the following output:

<empty string as the first output ("")>
{{ policy_period_start_date }}
{{ policy_period_end_date }}
{{ insurance_line }}

But the expected output should be

{{ named_insured }}   # This comes as an empty string in the actual output
{{ policy_period_start_date }}
{{ policy_period_end_date }}
{{ insurance_line }}

Is this something that is fixable or is this due to some PDF level nuance?
If this is the former, is there anything I can do to change in the source code locally to try out a quick fix?
If this is the latter, what do I need to consider while editing a PDF so that the field labels are extracted properly?
Thanks

…n of field_label. Also recurse to parent if node's string value is empty string. This appears to be what Adobe does. Addresses #3950.

Rutvik-Trivedi · 2024-10-24T14:26:00Z

@julian-smith-artifex-com thanks again for the newest fix. Could you please provide me with an estimated time-frame for the next release? Thanks

julian-smith-artifex-com · 2024-10-29T16:24:51Z

Fixed in 1.24.13.

JorjMcKie added a commit that referenced this issue Oct 16, 2024

Update __init__.py

dccb979

Access field label as an **inheritable** dictionary value. Addresses #3950.

JorjMcKie mentioned this issue Oct 16, 2024

Update __init__.py #3951

Merged

JorjMcKie added the fix developed release schedule to be determined label Oct 16, 2024

JorjMcKie added a commit that referenced this issue Oct 16, 2024

Update __init__.py

1c661bb

Access field label as an **inheritable** dictionary value. Addresses #3950.

julian-smith-artifex-com added the Fixed in next release label Oct 17, 2024

julian-smith-artifex-com closed this as completed Oct 21, 2024

julian-smith-artifex-com reopened this Oct 22, 2024

julian-smith-artifex-com removed Fixed in next release fix developed release schedule to be determined labels Oct 22, 2024

julian-smith-artifex-com mentioned this issue Oct 22, 2024

src/__init__.py tests/: avoid segv from fz_samples_get() with empty p… #3979

Merged

julian-smith-artifex-com added the Fixed in next release label Oct 22, 2024

julian-smith-artifex-com closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to consistently extract field labels from PDFs #3950

Unable to consistently extract field labels from PDFs #3950

Rutvik-Trivedi commented Oct 16, 2024

JorjMcKie commented Oct 16, 2024

Rutvik-Trivedi commented Oct 16, 2024 •

edited

Loading

julian-smith-artifex-com commented Oct 17, 2024

julian-smith-artifex-com commented Oct 21, 2024

Rutvik-Trivedi commented Oct 22, 2024 •

edited

Loading

Rutvik-Trivedi commented Oct 24, 2024

julian-smith-artifex-com commented Oct 29, 2024

Unable to consistently extract field labels from PDFs #3950

Unable to consistently extract field labels from PDFs #3950

Comments

Rutvik-Trivedi commented Oct 16, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Oct 16, 2024

Rutvik-Trivedi commented Oct 16, 2024 • edited Loading

julian-smith-artifex-com commented Oct 17, 2024

julian-smith-artifex-com commented Oct 21, 2024

Rutvik-Trivedi commented Oct 22, 2024 • edited Loading

Rutvik-Trivedi commented Oct 24, 2024

julian-smith-artifex-com commented Oct 29, 2024

Rutvik-Trivedi commented Oct 16, 2024 •

edited

Loading

Rutvik-Trivedi commented Oct 22, 2024 •

edited

Loading