Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF xmp timestamps aren't extracted with timezone info #1868

Closed
patrickdalla opened this issue Sep 6, 2023 · 15 comments · Fixed by #1874
Closed

PDF xmp timestamps aren't extracted with timezone info #1868

patrickdalla opened this issue Sep 6, 2023 · 15 comments · Fixed by #1874
Labels

Comments

@patrickdalla
Copy link
Collaborator

xmp metadatas dates, like xmp:MetadataDate and xmp:ModifyDate, are being extracted without the zone info, and so are put as UTC but without the correct adjustment.

@lfcnassif
Copy link
Member

Could you test standalone Tika or PDFBox versions used by us on your sample(s) to see if it is a dependency issue? Last year I fixed a similar issue with Exif dates into Tika:
https://issues.apache.org/jira/browse/TIKA-3815

@patrickdalla
Copy link
Collaborator Author

patrickdalla commented Sep 6, 2023 via email

@patrickdalla
Copy link
Collaborator Author

I've run:
[root@localhost Downloads]# java -jar tika-app-2.9.0.jar sobreavisoEditado3.pdf | grep xmp
that returned

WARN  [main] 07:42:34,238 org.apache.pdfbox.pdmodel.font.PDType1Font Using fallback font LiberationSans for base font Symbol
WARN  [main] 07:42:34,241 org.apache.pdfbox.pdmodel.font.PDType1Font Using fallback font LiberationSans for base font ZapfDingbats
<meta name="xmp:ModifyDate" content="2023-09-06T13:35:38Z"/>
<meta name="xmp:MetadataDate" content="2023-09-06T13:35:38Z"/>
<meta name="xmpTPg:NPages" content="11"/>

The IPED hexviewer showed the second ModifyDate tag as:
image

So, this is an issue of the custom PDF Tika Parser. I will inform there. Should I close it here?

@lfcnassif
Copy link
Member

You can report on https://issues.apache.org/jira/projects/TIKA/issues

(But maybe it comes from PDFBox Tika dependency...)

We should keep this open until it is fixed in upstream library and we update Tika. Although coming from a dependency, it is a bug affecting the full software.

@patrickdalla
Copy link
Collaborator Author

patrickdalla commented Sep 11, 2023

The command:
java -jar pdfbox-app-2.0.29.jar ExtractXMP -console sobreavisoEditado3.pdf

Returned the correct timestamp with zone info.
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="" xmp:ModifyDate="2023-09-06T13:35:38-04:00" xmlns:xmp="http://ns.adobe.com/xap/1.0/"><xmp:MetadataDate>2023-09-06T13:35:38-04:00</xmp:MetadataDate></rdf:Description></rdf:RDF></x:xmpmeta><?xpacket end="w"?>

So it seems a TIKA parser issue.

@lfcnassif
Copy link
Member

I would appreciate if you could report to them when you get permission, since I'm going to travel on vacation now and I'll be back on the weekend, thanks.

@patrickdalla
Copy link
Collaborator Author

patrickdalla commented Sep 11, 2023 via email

@patrickdalla
Copy link
Collaborator Author

patrickdalla commented Sep 11, 2023

I found the problem. When setting Calendar object timezone, it does not sync remainder internal fields immediatelly, being delayed. Only whe other methods are called, like date.get(Calendar.HOUR_OF_DAY), the object identifies that its internal state is out of sync and updates them.

So, when the info is read from PDF XMP metadata, the original instance timezone is UTC, and it sets to the informed timezone (-4 in my case) but does not change (again) any other internal fields.

When formatDate method of DateUtils is called, it sets again the timezone of the Calendar object, this time with UTC. Only after, in doFormatDate method, when the get method of Calendar is called, it sync all other fields state, but as the timezone is in UTC, it does not make the HOUR field shift as it thinks this field is uptodate.

I could overcome this just calling "date.get(Calendar.HOUR_OF_DAY);" right before "showed date.setTimeZone(UTC);" in DateUtils, as the get method forces the first timezone setting done when metadata info was read to be applied, so latter timezone modification also shifts correctly the hour_of_day field.

I could do this as there is the iped.engine.tika.SyncMetadata class that we use, overriding the Metadata set method. I have done just to test my hipothesis, what have shown was true.

So we can implement this way, or wait a definitive correction from TIKA. What do you think @lfcnassif?

@lfcnassif
Copy link
Member

lfcnassif commented Sep 12, 2023

So we can implement this way, or wait a definitive correction from TIKA. What do you think @lfcnassif?

Hi @patrickdalla. Seems Tim Allison fixed the bug you reported on https://issues.apache.org/jira/browse/TIKA-4126, right?

But not sure when next Tika version will be released. By the way, we already have a ticket to upgrade Tika (#1744), but that is a big change and needs lots of testing... If we can fix this issue on our side, that would be great! I just ask you to put a comment next to Tika version in our pom.xml stating we should revert your fixing changes when Tika is upgraded.

@patrickdalla
Copy link
Collaborator Author

Right, I will commit in same branch of the PDF carve improvement branch PR.

@patrickdalla
Copy link
Collaborator Author

Let me explain better the problem, as it seems a Calendar usage problem.
When the fields of the calendar are set (timezone included), the other possible affected fields are not immediately changed. There is a "complete" method in Calendar class, but it protected. This method calls the other Calendar protected method "computeFields" if the internal fields state have changed, state controlled by the areAllFieldsSet boolean field.
When setTimeZone method is called, the areAllFieldsSet turns false, marking the field in a inconsistent state.
The consequent computation that should be done (shift the fields value based on timezone offset) is relative to UTC.

So, for example, when you set a HOUR OF DAY field to 13, latter the timezone to -4, the field HOUR OF DAY still remains with the 13 value. The Metadata class implementation normalizes all internal date values to UTC, so it call again setTimeZone, but still not making any computation in HOUR OF DAY field that remains with the 13 value.

When the date is formated to string, the method calendar.get(Calendar.HOUR_OF_DAY) is called, that identifies the inconsistent state (areAllFieldsSet = false) and call computeFields. But as the zone is set to UTC, this method does not change the value of the HOUR OF DAY field.

So, to solve this problem, I put the command calendar.get(Calendar.HOUR_OF_DAY) before the setTimeZone(UTC), to force the computeFields call with the zone already configured (loaded from XMP PDF metadata in our case) before changing it to UTC.

That method call was done in SyncMetadata class, method "void set(Property property, Calendar date)".

Tim Allison used another approach to solve it. His approach converts the calendar to Instant class, without setting the UTC normalized timezone as the used Instant toString method already formats it in ISO-8601 (DateTimeFormatter.ISO_INSTANT). But I could not find a easy way to use his approach without changing TIKA code directly.

patrickdalla added a commit that referenced this issue Sep 12, 2023
before
any subsequent date transformation that can occur.
@patrickdalla
Copy link
Collaborator Author

Pushed to PDF_Carve_Improvement

@lfcnassif
Copy link
Member

Right, I will commit in same branch of the PDF carve improvement branch PR.

Please could you revert and push to another branch and PR?

patrickdalla added a commit that referenced this issue Sep 12, 2023
…mputation before any subsequent date transformation that can occur."

This reverts commit 7811b9a.
patrickdalla added a commit that referenced this issue Sep 12, 2023
@patrickdalla patrickdalla linked a pull request Sep 12, 2023 that will close this issue
@patrickdalla
Copy link
Collaborator Author

Done

@lfcnassif
Copy link
Member

lfcnassif commented Sep 12, 2023

Thank you! And sorry for the additional effort from your side...

lfcnassif pushed a commit that referenced this issue Oct 4, 2023
@lfcnassif lfcnassif changed the title PDF xmp timestamps aren't extracted with zone info PDF xmp timestamps aren't extracted with timezone info Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants