-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF xmp timestamps aren't extracted with timezone info #1868
Comments
Could you test standalone Tika or PDFBox versions used by us on your sample(s) to see if it is a dependency issue? Last year I fixed a similar issue with Exif dates into Tika: |
right
Em qua., 6 de set. de 2023 15:17, Luis Filipe Nassif <
***@***.***> escreveu:
… Could you test standalone Tika or PDFBox versions used by us on your
sample(s) to see if it is a dependency issue? Last year I fixed a similar
issue with Exif dates into Tika:
https://issues.apache.org/jira/browse/TIKA-3815
—
Reply to this email directly, view it on GitHub
<#1868 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG247SYSEI2DOZE3CPZ4OP3XZDD5NANCNFSM6AAAAAA4NYS5OM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
You can report on https://issues.apache.org/jira/projects/TIKA/issues (But maybe it comes from PDFBox Tika dependency...) We should keep this open until it is fixed in upstream library and we update Tika. Although coming from a dependency, it is a bug affecting the full software. |
The command: Returned the correct timestamp with zone info. So it seems a TIKA parser issue. |
I would appreciate if you could report to them when you get permission, since I'm going to travel on vacation now and I'll be back on the weekend, thanks. |
I have already got the permission an registered the issue
Em seg., 11 de set. de 2023 11:08, Luis Filipe Nassif <
***@***.***> escreveu:
… I would appreciate if you could report to them when you get permission,
since I'm going to travel on vacation now and I'll be back on the weekend,
thanks.
—
Reply to this email directly, view it on GitHub
<#1868 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG247S6PQ363VW7JR5CUSXLXZ4SODANCNFSM6AAAAAA4NYS5OM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I found the problem. When setting Calendar object timezone, it does not sync remainder internal fields immediatelly, being delayed. Only whe other methods are called, like date.get(Calendar.HOUR_OF_DAY), the object identifies that its internal state is out of sync and updates them. So, when the info is read from PDF XMP metadata, the original instance timezone is UTC, and it sets to the informed timezone (-4 in my case) but does not change (again) any other internal fields. When formatDate method of DateUtils is called, it sets again the timezone of the Calendar object, this time with UTC. Only after, in doFormatDate method, when the get method of Calendar is called, it sync all other fields state, but as the timezone is in UTC, it does not make the HOUR field shift as it thinks this field is uptodate. I could overcome this just calling "date.get(Calendar.HOUR_OF_DAY);" right before "showed date.setTimeZone(UTC);" in DateUtils, as the get method forces the first timezone setting done when metadata info was read to be applied, so latter timezone modification also shifts correctly the hour_of_day field. I could do this as there is the iped.engine.tika.SyncMetadata class that we use, overriding the Metadata set method. I have done just to test my hipothesis, what have shown was true. So we can implement this way, or wait a definitive correction from TIKA. What do you think @lfcnassif? |
Hi @patrickdalla. Seems Tim Allison fixed the bug you reported on https://issues.apache.org/jira/browse/TIKA-4126, right? But not sure when next Tika version will be released. By the way, we already have a ticket to upgrade Tika (#1744), but that is a big change and needs lots of testing... If we can fix this issue on our side, that would be great! I just ask you to put a comment next to Tika version in our pom.xml stating we should revert your fixing changes when Tika is upgraded. |
Right, I will commit in same branch of the PDF carve improvement branch PR. |
Let me explain better the problem, as it seems a Calendar usage problem. So, for example, when you set a HOUR OF DAY field to 13, latter the timezone to -4, the field HOUR OF DAY still remains with the 13 value. The Metadata class implementation normalizes all internal date values to UTC, so it call again setTimeZone, but still not making any computation in HOUR OF DAY field that remains with the 13 value. When the date is formated to string, the method calendar.get(Calendar.HOUR_OF_DAY) is called, that identifies the inconsistent state (areAllFieldsSet = false) and call computeFields. But as the zone is set to UTC, this method does not change the value of the HOUR OF DAY field. So, to solve this problem, I put the command calendar.get(Calendar.HOUR_OF_DAY) before the setTimeZone(UTC), to force the computeFields call with the zone already configured (loaded from XMP PDF metadata in our case) before changing it to UTC. That method call was done in SyncMetadata class, method "void set(Property property, Calendar date)". Tim Allison used another approach to solve it. His approach converts the calendar to Instant class, without setting the UTC normalized timezone as the used Instant toString method already formats it in ISO-8601 (DateTimeFormatter.ISO_INSTANT). But I could not find a easy way to use his approach without changing TIKA code directly. |
before any subsequent date transformation that can occur.
Pushed to PDF_Carve_Improvement |
Please could you revert and push to another branch and PR? |
…mputation before any subsequent date transformation that can occur." This reverts commit 7811b9a.
Done |
Thank you! And sorry for the additional effort from your side... |
xmp metadatas dates, like xmp:MetadataDate and xmp:ModifyDate, are being extracted without the zone info, and so are put as UTC but without the correct adjustment.
The text was updated successfully, but these errors were encountered: