Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exiv2 doesn't correctly handle characters outside the Basic Multilingual Plane #1279

Closed
Carnildo opened this issue Sep 6, 2020 · 25 comments
Closed

Comments

@Carnildo
Copy link

Carnildo commented Sep 6, 2020

Describe the bug
When setting an EXIF comment ("Exif.Photo.UserComment") to a value that contains characters outside the Basic Multilingual Plane, the value is written as-is, which produces incorrect Unicode in the file.

To Reproduce
The following code, tested with Exiv2 0.27.3, demonstrates the problem.

#include <exiv2/exiv2.hpp>
int main(void)
{
    auto exivImage = Exiv2::ImageFactory::open("img.jpg", false);
    exivImage->readMetadata();
    auto exif = exivImage->exifData();
    exif["Exif.Photo.UserComment"] = std::string("charset=\"Unicode\" EX\xf0\x9f\x98\x80"); // the letters "EX", followed by a smiley
    exivImage->setExifData(exif);
    exivImage->writeMetadata();
}

It will produce the following output on the command line:

Warning: iconv: Invalid or incomplete multibyte or wide character (errno = 84) inbytesleft = 4

Windows 10 File Explorer and exiftool agree that the comment that was written to the file is three Unicode characters: "塅鿰肘"

Example 2x2-pixel image with incorrect metadata:
img

Expected behavior
Either:

  1. The string gets converted to UTF-16 and stored in the EXIF data, or
  2. An exception is thrown, since Unicode code points outside the BMP are not valid UCS-2.

Desktop (please complete the following information):

  • OS: Linux
  • Compiler & Version GCC 9.3.0
@Carnildo Carnildo added the bug label Sep 6, 2020
@clanmills
Copy link
Collaborator

clanmills commented Sep 6, 2020

This is very helpful. Thank You very much. As a native English speaker, I have no feel for unicode, code pages and other magic involved. I understand 7 bit ascii and little else.

I'm very pleased to say that @LeoHsiao1 has recently done a lot of work on our test suite. As he is Chinese, I hope he'll be able to comment and investigate this matter.

From your comments such as Basic Multilingual Plane, you sound knowledgable about this topic. I would very much appreciate help with this matter. The code involved isn't complicated as Exiv2 delegates to the iconv library.

With the right people on this task, I am confident of success.

@clanmills
Copy link
Collaborator

Some comments about this. I think the syntax is:

exif["Exif.Photo.UserComment"] = std::string("charset=Unicode EX\xf0\x9f\x98\x80");
659 rmills@rmillsmbp:~/temp $ curl -LO --silent https://clanmills.com/Stonehenge.jpg
660 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  44  charset=Ascii                                     
661 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode EX\xf0\x9f\x98\x80" Stonehenge.jpg 
662 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  44  charset=Unicode EX\xf0\x9f\x98\x80
663 rmills@rmillsmbp:~/temp $ exiftool Stonehenge.jpg | grep -i comment
User Comment                    : EX\xf0\x9f\x98\x80
664 rmills@rmillsmbp:~/temp $ 

Please don't regard this as a denial that there could be problems with our Unicode and other charset handling.

I think it has encoded the 18 byte string "EX\xf0\x9f\x98\x80" as 36 unicode bytes + 8 bytes for the charset definition. I can dump the "raw data" with the program tvisitor (which is in my book) https://clanmills.com/exiv2/book/

665 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg | grep -i comment
         428 | 0x9286 Exif.Photo.UserComment       | UNDEFINED |       44 |       820 | UNICODE_E_X_\_x_f_0_\_x_9_f_\_x_9_8_ +++
666 rmills@rmillsmbp:~/temp $ 

I have no expertise in working with character sets. We'll need a specialist to help with this. Perhaps @LeoHsiao1 knows what's involved.

@clanmills
Copy link
Collaborator

It's not hopeless. Something is working when I cut'n'paste your bamboo poles 塅

672 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode 塅" Stonehenge.jpg 
673 rmills@rmillsmbp:~/temp $ 
673 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  10  charset=Unicode 塅
674 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg | grep -i comment
         428 | 0x9286 Exif.Photo.UserComment       | UNDEFINED |       10 |       820 | UNICODE_EX
675 rmills@rmillsmbp:~/temp $ 

As documented in the man page exiv2.1, I can use this to encode "Robin" in Unicode:

charset=Unicode \u0052\u006f\u0062\u0069\u006e
677 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  10  charset=Unicode 饕
678 rmills@rmillsmbp:~/temp $ exiftool Stonehenge.jpg | grep -i comment
User Comment                    : 饕
679 rmills@rmillsmbp:~/temp $ 

@clanmills
Copy link
Collaborator

Everything seems to be working OK. Unicode \u2103 is the Chinese Degrees Celsius. https://en.wikipedia.org/wiki/Degree_symbol

736 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode It's 18 ℃ outside" Stonehenge.jpg ;exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  42  charset=Unicode It's 18 ℃ outside

I am using the program dmpf from my book to confirm that macOS Terminal did insert the correct Unicode.

737 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg  | grep -e Comment -e Stonehenge.jpg 
STRUCTURE OF JPEG FILE (II): Stonehenge.jpg
  STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
    STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
      STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286:830->3142
      END: Stonehenge.jpg:12->15286:830->3142
         428 | 0x9286 Exif.Photo.UserComment       | UNDEFINED |       42 |      3972 | UNICODE_I_t_'_s_ _1_8_ _.! _o_u_t_s_ +++
    END: Stonehenge.jpg:12->15286
    STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
    END: Stonehenge.jpg:12->15286
  END: Stonehenge.jpg:12->15286
  STRUCTURE OF 8BIM FILE (MM): Stonehenge.jpg:17928->78
    STRUCTURE OF IPTC FILE (MM): Stonehenge.jpg:17928->78:12->39
    END: Stonehenge.jpg:17928->78:12->39
  END: Stonehenge.jpg:17928->78
END: Stonehenge.jpg
738 rmills@rmillsmbp:~/temp $ dmpf Stonehenge.jpg --skip=$((12+3972)) --count=60 --width=20 bs=2
   0xf90     3984: UNICODE_I_t_'_s_ _1_  ->  4e55 4349 444f   45   49   74   27   73   20   31
   0xfa4     4004: 8_ _.! _o_u_t_s_i_d_  ->    38   20 2103   20   6f   75   74   73   69   64
                                                       ----
   0xfb8     4024: e_._.__....___.___09  ->    65    2    2  100  201    1    0    1    0 3930

I can use \u2103 to put more Degrees Celsius and \u2109 for Degrees Fahrenheit.

739 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode 1\u2103 degreesC == 1.8\u2109 degreesF" Stonehenge.jpg ;exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  64  charset=Unicode 1℃ degreesC == 1.8℉ degreesF
740 rmills@rmillsmbp:~/temp $

This appears to work well. I'll wait for @LeoHsiao1 to comment before closing this.

@Carnildo
Copy link
Author

Carnildo commented Sep 6, 2020

You've only been testing with Basic Multilingual Plane codes (U+FFFF and lower). The problem arises with the supplementary planes (U+10000 and above). Most of the characters in this range are obscure languages or little-used Chinese ideographs, but the range U+1F300 to U+1FAFF contains emoji and other symbols.

@clanmills
Copy link
Collaborator

I really don't know if I can help with this as it's totally beyond my limited skills in this area. Let's hear what @LeoHsiao1 has to say.

I did try unsuccessfully to use the emoji characters this afternoon. Then I thought of messing with ℃ and that worked.

If this concerns "obscure languages" why are your reporting this? What is your use case?

@Carnildo
Copy link
Author

Carnildo commented Sep 6, 2020

I'm working on a photo organizer, and discovered the bug while testing sticking a smiley face into a photo description.

@clanmills
Copy link
Collaborator

clanmills commented Sep 6, 2020

OK. Let's hear what @LeoHsiao1 has to say. Exiv2 is about metadata. Exiv2 delegates unicode to libiconv.

@clanmills clanmills removed the bug label Sep 6, 2020
@clanmills clanmills self-assigned this Sep 6, 2020
@clanmills clanmills added this to the v0.27.4 milestone Sep 6, 2020
@tester0077
Copy link
Collaborator

tester0077 commented Sep 6, 2020 via email

@Carnildo
Copy link
Author

Carnildo commented Sep 6, 2020

A literal reading of the EXIF 2.3 standard implies that it uses the Unicode standard from 1991, which would be version 1.0. We really, really don't want to follow the standard in that regard: Unicode 1.0 doesn't support Chinese, has different encodings for Korean and Tibetan, doesn't support BiDi formatting, and is largely incompatible with later versions of Unicode.

The question becomes how far Exiv2 should go with ignoring the standard, which in turn becomes a question of how far other EXIF software has gone with ignoring the standard. From a practical standpoint, this comes down to encoding Unicode text as UCS-2 versus encoding it as UTF-16.

UTF-16 permits the full range of Unicode characters, but may be incompatible with readers that expect UCS-2 (if there are any -- UTF-16 was introduced with Unicode 2.0, in 1996), or with poorly-written software that assumes that one 16-bit Unicode value equals one character.

UCS-2 should be compatible with everything that supports Unicode, but only permits the first 65,536 Unicode characters.

@clanmills
Copy link
Collaborator

Thanks, Arnold. Your observation "to be continued" make me nervous! We need to recruit an expert in this field.

@clanmills
Copy link
Collaborator

Enough! This matter is closed. In 12 years of working on Exiv2, this is the first time I have closed an issue because it is outside the scope of the project.

The Exiv2 project is a cross-platform C++ library for 4 metadata standards in about 20 image formats. It supports unicode for UserComment (and two other tags) by delegating to iconv.

Without an expert to investigate other scenarios and "obscure" languages, nothing further can be done.

@Carnildo
Copy link
Author

Carnildo commented Sep 7, 2020

This is solidly within the scope: the function call exif["Exif.Photo.UserComment"] = std::string("charset=\"Unicode\" Q\xf0\x9f\x98\x80"); causes the Exiv2 library to produce a file that is not valid under any reading of the EXIF 2.3 standard. I've proposed two fixes (either throw an error, or store the string as UTF-16 rather than UCS-2); I don't know which is better.

@clanmills
Copy link
Collaborator

Are you offering to work on the code to deal with this?

@clanmills
Copy link
Collaborator

I've fixed it as follows. However iconv issues a warning:

773 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ bin/exiv2 -M"set Exif.Photo.UserComment charset=Unicode Smile: 😀" ~/temp/Stonehenge.jpg ;exiv2 -g Comment ~/temp/Stonehenge.jpg 
Warning: iconv: Invalid argument (errno = 22) inbytesleft = 1
Exif.Photo.UserComment                       Undefined  19  Warning: iconv: Invalid argument (errno = 22) inbytesleft = 1
charset=Unicode Smile: 😀
774 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ 
774 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ git diff
diff --git a/src/value.cpp b/src/value.cpp
index 5bd815e2..ab21d5be 100644
--- a/src/value.cpp
+++ b/src/value.cpp
@@ -511,7 +511,7 @@ namespace Exiv2 {
         }
         if (charsetId == unicode) {
             const char* to = byteOrder_ == littleEndian ? "UCS-2LE" : "UCS-2BE";
-            convertStringCharset(c, "UTF-8", to);
+            convertStringCharset(c, "UTF-16", to);
         }
         const std::string code(CharsetInfo::code(charsetId), 8);
         return StringValueBase::read(code + c);
775 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ 

As I said yesterday. I need specialist help to deal with this.

@clanmills clanmills removed this from the v0.27.4 milestone Sep 7, 2020
@clanmills clanmills removed their assignment Sep 7, 2020
@clanmills
Copy link
Collaborator

clanmills commented Sep 7, 2020

I am reopening this issue, removing it from the 0.27.4 milestone and I am no longer assigned to this issue.

@clanmills clanmills reopened this Sep 7, 2020
@tester0077
Copy link
Collaborator

@Carnildo: things are not as simple as you seem to think. Exiv2 does not live in its own world where it can do what ever the coders (or some users) think appropriate. Even though the Exif spec 2.3 may be ambiguous, Robin, and who ever else modifies the code, has to stick to the spec because other users depend on that commitment. Perhaps the documentation does not spell out the restrictions imposed by this particular field, but in that case it is the documentation which needs updating.

@LeoHsiao1
Copy link
Contributor

I just saw this issue today.
I am not familiar with Basic Multilingual Plane. As a Chinese, the encoding format I use most often is UTF-8.
Exiv2 supports UTF-8 characters, which would have satisfied almost all my needs. Otherwise I wouldn't have continued to use exiv2.

@clanmills
Copy link
Collaborator

Thank You @LeoHsiao1 and Thank You @tester0077 for your feedback and input on this issue.

There was a discussion last month concerning charset=Unicode and Chinese. I believe the comments here by @LeoHsiao1 confirm that our support works adequately. #1258 (comment)

@LeoHsiao1
Copy link
Contributor

I developed my own version of pyexiv2, which converts Unicode strings to bytes and then saves them to the image.

C++ principle code:

std::string key = py::bytes('something');
std::string value = py::bytes('something');
exifData[key] = value;

The Python example:

>>> import pyexiv2
>>> img = pyexiv2.Image(r'D:\test.png')
>>> img.modify_exif({'Exif.Image.ImageDescription': 'test-中文-'}, encoding='UTF-8')
>>> img.read_exif(encoding='UTF-8')
{'Exif.Image.ImageDescription': 'test-中文-'}

@tester0077
Copy link
Collaborator

tester0077 commented Sep 8, 2020 via email

@clanmills
Copy link
Collaborator

Arnold. There are three "Comment" Exif tags and they are UserComment, GPSProcessingMethod and GPSAreaInformation. They are in fact stored as an undefined byte stream. The first 8 bytes define the charset following by the encoded byte stream.

For example:

1143 rmills@rmillsmbp:~/temp $ curl -OL --silent https://clanmills.com/Stonehenge.jpg
1145 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Photo.UserComment charset=Unicode Robin' Stonehenge.jpg 
1146 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  18  charset=Unicode Robin   # 18 = 8 + 5x2
1147 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Photo.UserComment charset=Ascii Robin' Stonehenge.jpg 
1148 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  13  charset=Ascii Robin # 13 = 8 + 5
1149 rmills@rmillsmbp:~/temp $ 

The point that has been made by @Carnildo about accepting smiley's and other Unicode strings outside the Basic Multilingual Plane maybe worthy of attention, however I don't intend to work on this I know almost nothing about Unicode, Jis, codepages and related technology. To make progress with this, I believe we need a specialist on the team.

The Exif.Image.Artist is defined in the Adobe Tiff 6.0 specification as follows:
screenshot_31

Exiv2 allows you to enter non 7-bit ascii characters into "Ascii" tags. For example:

1155 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Image.Artist Copyright © 2020' Stonehenge.jpg 
1156 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  13  charset=Ascii Robin
1157 rmills@rmillsmbp:~/temp $ exiv2 -g Artist Stonehenge.jpg 
Exif.Image.Artist                            Ascii      18  Copyright © 2020
1158 rmills@rmillsmbp:~/temp $ 

So © are two bytes \uc2a9 which have been stored in the metadata.

1159 rmills@rmillsmbp:~/temp $ 1160 rmills@rmillsmbp:~/temp $ echo © | dmpf -
       0        0: ...                               ->  c2 a9 0a
1161 rmills@rmillsmbp:~/temp $ 

The smiley (a 4 byte character) survives similar treatment:

1162 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Image.Artist Smile 😀 please' Stonehenge.jpg 
1163 rmills@rmillsmbp:~/temp $ exiv2 -g Artist Stonehenge.jpg 
Exif.Image.Artist                            Ascii      18  Smile 😀 please
1164 rmills@rmillsmbp:~/temp $ 

The code for tvisitor.cpp (and utilities dmpf.cpp and args.cpp) is documented in my book and available (source only) from svn://dev.exiv2.org/svn/team/book The book is available at https://clanmills.com/exiv2/book Please beware those materials are "work in progress" and change frequently. Your feedback is welcome and appreciated, however I don't provide support and hope to avoid discussion while the book is in development. These utilities are written in C++11 and I believe they build on most desktop platforms. They are "single file" programs with no dependencies.

@tester0077
Copy link
Collaborator

tester0077 commented Sep 8, 2020 via email

@clanmills
Copy link
Collaborator

@tester0077 I've updated dmpf.cpp to build without warnings using msvc2019. I believe the other programs tvisitor/args/visitor are already "warning free".

parse.cpp produces many many warnings. It's fine to ignore those warnings because you should have no reason to use parse.exe. That is Dave Coffin's code and is currently in the repos (and built by Xcode) as I am using it to understand and document CRW. It will be removed before the book is finished. Dave's code is parse.c and mentioned in the book.

@clanmills
Copy link
Collaborator

I'm closing this issue. I don't believe we have the necessary skills to pursue this matter.

I'm delighted to say that Exiv2 has a team of 8 enthusiastic contributors and I am working on a plan to release v1.00 on 2021-12-15. We don't have the skills in the team to work with 'characters outside the Basic Multilingual Plane'.

@clanmills clanmills added this to the Backlog milestone Apr 17, 2021
@kevinbackhouse kevinbackhouse removed this from the Backlog milestone Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants