-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exiv2 doesn't correctly handle characters outside the Basic Multilingual Plane #1279
Comments
This is very helpful. Thank You very much. As a native English speaker, I have no feel for unicode, code pages and other magic involved. I understand 7 bit ascii and little else. I'm very pleased to say that @LeoHsiao1 has recently done a lot of work on our test suite. As he is Chinese, I hope he'll be able to comment and investigate this matter. From your comments such as Basic Multilingual Plane, you sound knowledgable about this topic. I would very much appreciate help with this matter. The code involved isn't complicated as Exiv2 delegates to the iconv library. With the right people on this task, I am confident of success. |
Some comments about this. I think the syntax is: exif["Exif.Photo.UserComment"] = std::string("charset=Unicode EX\xf0\x9f\x98\x80"); 659 rmills@rmillsmbp:~/temp $ curl -LO --silent https://clanmills.com/Stonehenge.jpg
660 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg
Exif.Photo.UserComment Undefined 44 charset=Ascii
661 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode EX\xf0\x9f\x98\x80" Stonehenge.jpg
662 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg
Exif.Photo.UserComment Undefined 44 charset=Unicode EX\xf0\x9f\x98\x80
663 rmills@rmillsmbp:~/temp $ exiftool Stonehenge.jpg | grep -i comment
User Comment : EX\xf0\x9f\x98\x80
664 rmills@rmillsmbp:~/temp $ Please don't regard this as a denial that there could be problems with our Unicode and other charset handling. I think it has encoded the 18 byte string "EX\xf0\x9f\x98\x80" as 36 unicode bytes + 8 bytes for the charset definition. I can dump the "raw data" with the program tvisitor (which is in my book) https://clanmills.com/exiv2/book/ 665 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg | grep -i comment
428 | 0x9286 Exif.Photo.UserComment | UNDEFINED | 44 | 820 | UNICODE_E_X_\_x_f_0_\_x_9_f_\_x_9_8_ +++
666 rmills@rmillsmbp:~/temp $ I have no expertise in working with character sets. We'll need a specialist to help with this. Perhaps @LeoHsiao1 knows what's involved. |
It's not hopeless. Something is working when I cut'n'paste your bamboo poles 塅
As documented in the man page exiv2.1, I can use this to encode "Robin" in Unicode:
677 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg
Exif.Photo.UserComment Undefined 10 charset=Unicode 饕
678 rmills@rmillsmbp:~/temp $ exiftool Stonehenge.jpg | grep -i comment
User Comment : 饕
679 rmills@rmillsmbp:~/temp $ |
Everything seems to be working OK. Unicode \u2103 is the Chinese Degrees Celsius. https://en.wikipedia.org/wiki/Degree_symbol 736 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode It's 18 ℃ outside" Stonehenge.jpg ;exiv2 -g Comment Stonehenge.jpg
Exif.Photo.UserComment Undefined 42 charset=Unicode It's 18 ℃ outside I am using the program dmpf from my book to confirm that macOS Terminal did insert the correct Unicode. 737 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg | grep -e Comment -e Stonehenge.jpg
STRUCTURE OF JPEG FILE (II): Stonehenge.jpg
STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286:830->3142
END: Stonehenge.jpg:12->15286:830->3142
428 | 0x9286 Exif.Photo.UserComment | UNDEFINED | 42 | 3972 | UNICODE_I_t_'_s_ _1_8_ _.! _o_u_t_s_ +++
END: Stonehenge.jpg:12->15286
STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
END: Stonehenge.jpg:12->15286
END: Stonehenge.jpg:12->15286
STRUCTURE OF 8BIM FILE (MM): Stonehenge.jpg:17928->78
STRUCTURE OF IPTC FILE (MM): Stonehenge.jpg:17928->78:12->39
END: Stonehenge.jpg:17928->78:12->39
END: Stonehenge.jpg:17928->78
END: Stonehenge.jpg
738 rmills@rmillsmbp:~/temp $ dmpf Stonehenge.jpg --skip=$((12+3972)) --count=60 --width=20 bs=2
0xf90 3984: UNICODE_I_t_'_s_ _1_ -> 4e55 4349 444f 45 49 74 27 73 20 31
0xfa4 4004: 8_ _.! _o_u_t_s_i_d_ -> 38 20 2103 20 6f 75 74 73 69 64
----
0xfb8 4024: e_._.__....___.___09 -> 65 2 2 100 201 1 0 1 0 3930 I can use \u2103 to put more Degrees Celsius and \u2109 for Degrees Fahrenheit. 739 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode 1\u2103 degreesC == 1.8\u2109 degreesF" Stonehenge.jpg ;exiv2 -g Comment Stonehenge.jpg
Exif.Photo.UserComment Undefined 64 charset=Unicode 1℃ degreesC == 1.8℉ degreesF
740 rmills@rmillsmbp:~/temp $ This appears to work well. I'll wait for @LeoHsiao1 to comment before closing this. |
You've only been testing with Basic Multilingual Plane codes (U+FFFF and lower). The problem arises with the supplementary planes (U+10000 and above). Most of the characters in this range are obscure languages or little-used Chinese ideographs, but the range U+1F300 to U+1FAFF contains emoji and other symbols. |
I really don't know if I can help with this as it's totally beyond my limited skills in this area. Let's hear what @LeoHsiao1 has to say. I did try unsuccessfully to use the emoji characters this afternoon. Then I thought of messing with ℃ and that worked. If this concerns "obscure languages" why are your reporting this? What is your use case? |
I'm working on a photo organizer, and discovered the bug while testing sticking a smiley face into a photo description. |
OK. Let's hear what @LeoHsiao1 has to say. Exiv2 is about metadata. Exiv2 delegates unicode to libiconv. |
As this topic is of great interest to myself, I did a few test to verify
my own understanding of this field and it's content.
From reading of the specs, I had concluded that, in order for the field
to contain anything other than plain ASCII, it had to start with the
string 'UNICODE" in plain ASCII and then be followed by the UT-16
Unicode string.
My big problem was that I had been unable to find images with such
Unicode comments, especially images which I felt confident enough that
they actually would contain valid user comments.
My test with the given 'image' was to open it in a Hex editor and edit
the string of interest until the one utility, which I trust at this
stage, with this, WPMeta, showed the expected output.
FWIW, for the modified file it shows 2 smilies - because that is what
the bytes I entered represent,while for the original file it shows
similar characters, Chinese, I assume, as in the original post. At this
stage, I am not 100% sure this is the 'correct' way, partly because
Exiftool 11.63 does not even show anything for the UserComment for
either the original or my modified image, while for my modified image
Exiv2 0.27.3 gives
Copyright :
Exif comment : charset=Unicode Ôÿ¦Ôÿ¦
As for the test program shown by Carnildo, it obviously assumes that
Exiv2 will expect a UTF-8 string.
To be continued, I am sure :-)
Attached are my modified image as well as a screenshot of the hex editor
data for the relevant section.
Arnold
|
A literal reading of the EXIF 2.3 standard implies that it uses the Unicode standard from 1991, which would be version 1.0. We really, really don't want to follow the standard in that regard: Unicode 1.0 doesn't support Chinese, has different encodings for Korean and Tibetan, doesn't support BiDi formatting, and is largely incompatible with later versions of Unicode. The question becomes how far Exiv2 should go with ignoring the standard, which in turn becomes a question of how far other EXIF software has gone with ignoring the standard. From a practical standpoint, this comes down to encoding Unicode text as UCS-2 versus encoding it as UTF-16. UTF-16 permits the full range of Unicode characters, but may be incompatible with readers that expect UCS-2 (if there are any -- UTF-16 was introduced with Unicode 2.0, in 1996), or with poorly-written software that assumes that one 16-bit Unicode value equals one character. UCS-2 should be compatible with everything that supports Unicode, but only permits the first 65,536 Unicode characters. |
Thanks, Arnold. Your observation "to be continued" make me nervous! We need to recruit an expert in this field. |
Enough! This matter is closed. In 12 years of working on Exiv2, this is the first time I have closed an issue because it is outside the scope of the project. The Exiv2 project is a cross-platform C++ library for 4 metadata standards in about 20 image formats. It supports unicode for UserComment (and two other tags) by delegating to iconv. Without an expert to investigate other scenarios and "obscure" languages, nothing further can be done. |
This is solidly within the scope: the function call |
Are you offering to work on the code to deal with this? |
I've fixed it as follows. However iconv issues a warning: 773 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ bin/exiv2 -M"set Exif.Photo.UserComment charset=Unicode Smile: 😀" ~/temp/Stonehenge.jpg ;exiv2 -g Comment ~/temp/Stonehenge.jpg
Warning: iconv: Invalid argument (errno = 22) inbytesleft = 1
Exif.Photo.UserComment Undefined 19 Warning: iconv: Invalid argument (errno = 22) inbytesleft = 1
charset=Unicode Smile: 😀
774 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $
As I said yesterday. I need specialist help to deal with this. |
I am reopening this issue, removing it from the 0.27.4 milestone and I am no longer assigned to this issue. |
@Carnildo: things are not as simple as you seem to think. Exiv2 does not live in its own world where it can do what ever the coders (or some users) think appropriate. Even though the Exif spec 2.3 may be ambiguous, Robin, and who ever else modifies the code, has to stick to the spec because other users depend on that commitment. Perhaps the documentation does not spell out the restrictions imposed by this particular field, but in that case it is the documentation which needs updating. |
I just saw this issue today. |
Thank You @LeoHsiao1 and Thank You @tester0077 for your feedback and input on this issue. There was a discussion last month concerning charset=Unicode and Chinese. I believe the comments here by @LeoHsiao1 confirm that our support works adequately. #1258 (comment) |
I developed my own version of pyexiv2, which converts Unicode strings to C++ principle code: std::string key = py::bytes('something');
std::string value = py::bytes('something');
exifData[key] = value; The Python example: >>> import pyexiv2
>>> img = pyexiv2.Image(r'D:\test.png')
>>> img.modify_exif({'Exif.Image.ImageDescription': 'test-中文-'}, encoding='UTF-8')
>>> img.read_exif(encoding='UTF-8')
{'Exif.Image.ImageDescription': 'test-中文-'} |
As far as I understand things now, the problem reported really only
relates to UserComment, which the Exif spec treats very differently.
Another aspect of this report relates to a question of how one should
present input strings to Exiv2 when they are intended for that field.
FWIW, I am a bit confused as to where and how to respond best to this
and related issues, because discussions relating to it/them, overlap
several threads.
As well, the more I read and work with these issue, the more nuances
become apparent. The issue raised in #1258, related to
Exif.Image.Artist, IMO, also needs to take into account the software
which (presumably) added the string.
What I find curious, is that not one of these 'editors' included the
Software string, except the Android version and it includes announces
itself as "Picasa".
Only some of them include the ExifVersion - though whether they apply
the spec accordingly, I am still unsure of.
Then again, if the originating software does not write the expected
string in the correct format, it is unfair, though somehow flattering,
to blame Exiv2 for make that apparent, even if, perchance, its rendition
is also flawed
At this point I have taken to view the data mainly in their hex format
and find the differences 'interesting', though I have not really been
able to come to any final conclusion because my understanding of what
there ought to be is evolving as I see more data
@robin: you refer to tvisitor in several places in the correspondence as
well as in the document (and it lists some of the functions used in the
app) you are writing. Is there a compiled or compileable version of the
full program available?
Arnold
…On 2020-09-08 5:34 AM, Robin Mills wrote:
Thank You @LeoHsiao1 <https://github.com/LeoHsiao1> and Thank You
@tester0077 <https://github.com/tester0077> for your feedback and
input on this issue.
There was a discussion last month concerning charset=Unicode and
Chinese. I believe the comments here by @LeoHsiao1
<https://github.com/LeoHsiao1> confirm that our support works
adequately. #1258 (comment)
<#1258 (comment)>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1279 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFCLPFDJGDNXVXFA7S2RJDSEYQGVANCNFSM4Q35ZCJA>.
|
Arnold. There are three "Comment" Exif tags and they are UserComment, GPSProcessingMethod and GPSAreaInformation. They are in fact stored as an undefined byte stream. The first 8 bytes define the charset following by the encoded byte stream. For example:
The point that has been made by @Carnildo about accepting smiley's and other Unicode strings outside the Basic Multilingual Plane maybe worthy of attention, however I don't intend to work on this I know almost nothing about Unicode, Jis, codepages and related technology. To make progress with this, I believe we need a specialist on the team. The Exif.Image.Artist is defined in the Adobe Tiff 6.0 specification as follows: Exiv2 allows you to enter non 7-bit ascii characters into "Ascii" tags. For example:
So © are two bytes
The smiley (a 4 byte character) survives similar treatment:
The code for tvisitor.cpp (and utilities dmpf.cpp and args.cpp) is documented in my book and available (source only) from svn://dev.exiv2.org/svn/team/book The book is available at https://clanmills.com/exiv2/book Please beware those materials are "work in progress" and change frequently. Your feedback is welcome and appreciated, however I don't provide support and hope to avoid discussion while the book is in development. These utilities are written in C++11 and I believe they build on most desktop platforms. They are "single file" programs with no dependencies. |
Hi Robin,
all of this seems to be a work in progress, for myself in any case.
On 2020-09-08 11:22 AM, Robin Mills wrote:
Arnold. There are three "Comment" Exif tags and they are UserComment,
GPSProcessingMethod and GPSAreaInformation. They are in fact stored as
an undefined byte stream. The first 8 bytes define the charset
following by the encoded byte stream.
Understood, though for now, my focus is mostly on UserComment. The other
2 I am not as familiar with as I'd like, though from what I see in the
Exif 2 spec, they seem to be of the same kind.
FWIW, testing the results of Exiv2 modifying data in a file using the
output from Exiv2 is not really conclusive of anything but the fact that
Exiv2's reading and writing of metadata is consistent. :-(
The point that has been made by @Carnildo
<https://github.com/Carnildo> about accepting smiley's and other
Unicode strings outside the Basic Multilingual Plane maybe worthy of
attention, however I don't intend to work on this I know almost
nothing about Unicode, Jis, codepages and related technology. To make
progress with this, I believe we need a specialist on the team.
Understood as well.
FWIW & IMO, the 'Artist' field is not intended (using the spec you
quote) to receive anything but ASCII. If some app allow the user to
enter characters from an extended char set, then the problem is really
with that app not following the spec. and certainly not with Exiv2
The Exif.Image.Artist is defined in the Adobe Tiff 6.0 specification
as follows:
screenshot_31
<https://user-images.githubusercontent.com/529982/92511632-7d598b00-f205-11ea-9a3f-0029931742c5.png>
Exiv2 allows you to enter non 7-bit ascii characters into "Ascii"
tags. For example:
|1155 ***@***.***:~/temp $ exiv2 -M'set Exif.Image.Artist
Copyright © 2020' Stonehenge.jpg 1156 ***@***.***:~/temp $ exiv2
-g UserComment Stonehenge.jpg Exif.Photo.UserComment Undefined 13
charset=Ascii Robin 1157 ***@***.***:~/temp $ exiv2 -g Artist
Stonehenge.jpg Exif.Image.Artist Ascii 18 Copyright © 2020 1158
***@***.***:~/temp $ |
|Again, Exiv2 may allow this sort of data entry and reproduce it on
output, but that does not show that that is what the spec intended, but
rather verify the fact that some apps may interpret the spec to (the
best of) their understanding or preference. The images provided by
||norbertj42 <https://github.com/norbertj42> give ample evidence of that
notion. |||
The code for tvisitor.cpp (and utilities dmpf.cpp and args.cpp) is
documented in my book and available (source only) from
svn://dev.exiv2.org/svn/team/book The book is available at
https://clanmills.com/exiv2/book Please beware those materials are
"work in progress" and change frequently. Your feedback is welcome and
appreciated, however I don't provide support and hope to avoid
discussion while the book is in development. These utilities are
written in C++11 and I believe they build on most desktop platforms.
They are "single file" programs with no dependencies.
Thank you, Robin.
Found the code and have compiled tvisitor under MSVC 2019 without much
of any fuss. Had to do some casting and ignore a bunch or warnings, but
it runs OK. Still have to sort out how to use the options, but that will
come with time.
Arnold
|
@tester0077 I've updated dmpf.cpp to build without warnings using msvc2019. I believe the other programs tvisitor/args/visitor are already "warning free". parse.cpp produces many many warnings. It's fine to ignore those warnings because you should have no reason to use parse.exe. That is Dave Coffin's code and is currently in the repos (and built by Xcode) as I am using it to understand and document CRW. It will be removed before the book is finished. Dave's code is parse.c and mentioned in the book. |
I'm closing this issue. I don't believe we have the necessary skills to pursue this matter. I'm delighted to say that Exiv2 has a team of 8 enthusiastic contributors and I am working on a plan to release v1.00 on 2021-12-15. We don't have the skills in the team to work with 'characters outside the Basic Multilingual Plane'. |
Describe the bug
When setting an EXIF comment ("Exif.Photo.UserComment") to a value that contains characters outside the Basic Multilingual Plane, the value is written as-is, which produces incorrect Unicode in the file.
To Reproduce
The following code, tested with Exiv2 0.27.3, demonstrates the problem.
It will produce the following output on the command line:
Windows 10 File Explorer and exiftool agree that the comment that was written to the file is three Unicode characters: "塅鿰肘"
Example 2x2-pixel image with incorrect metadata:
Expected behavior
Either:
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: