Clarifying Kanji Mode encodation in QR Code #267

terryburton · 2024-06-18T14:33:13Z

terryburton
Jun 18, 2024
Maintainer

The following aims to clarify the intent of the QR Code symbology specification in relation to the common misuse of Kanji Mode encodation as a character set interpretation indicator. It is intended for developers of QR Code decoders and will be raised with the standards committee responsible for the ISO/IEC 18004 standard.

Decoder developers are being commonly misled by some of the wording in the current specification which is resulting in incorrect decodes of correctly encoded symbols, especially in cases where the encoder has legitimately applied opportunistic Kanji Mode compaction to non-Kanji character data or byte data: #267 (comment)

Note: Section numbers are based on the 18004:2023 DIS version of the document. All text referred to is equivalent to the 2005 and later versions of the standard.

Kanji Mode is for data compaction, not character set indication

The purpose of Kanji Mode is to provide a data compaction opportunity, as with any other symbology encodation mode. It confers absolutely nothing about the character set interpretation for data encoded in such a mode. It is entirely legitimate for certain inputs byte sequences to be compacted using Kanji Mode encodation irrespective of whether or not they are intended to be interpreted as Kanji characters. Such byte sequences might instead originate from raw binary data, or from characters or fragments of multi-byte characters from other code pages — to be interpreted according to their respective character set (as signalled by ECI), and not as Shift JIS.

Section 7.3.6 "Kanji mode":

... It may be possible to achieve a shorter bit stream by using the Kanji mode compaction rules when an appropriate sequence of byte values occurs in the data (i.e. lead bytes in the ranges 81 to 9F and/or E0 to EB followed by trailer bytes in the range 40 to FC, except 7F, or EB followed by 40 to BF).

Figure H.1 "Shift JIS character values":

According to JIS X 0208:1997, Annex 1, leading and trailing bytes within the ranges shown shaded are assigned to Shift JIS Kanji characters. Any pairs of bytes within these ranges may be encoded using the Kanji mode compaction scheme.

Data compaction has nothing whatsoever to do with character set encoding ("interpretation"), which is communicated exclusively using ECI, nor does it have any bearing on the default interpretation.

Section 7.3.2 "Extended Channel Interpretation (ECI) mode":

The default interpretation for QR Code is ECI 000003 representing the ISO/IEC 8859-1 character set. // International applications using other character sets should use the ECI protocol. For instance, the interpretation corresponding to the JIS8 (JIS X 0201, 7-bit and 8-bit coded character sets for information interchange) and Shift JIS (JIS X 0208:1997, 7-bit and 8-bit double byte coded KANJI sets for information interchange) character sets is ECI 000020.

Wider context: Reader to host interface

Barcode symbology standards are prescriptive and presuppose a hypothetical situation in which a barcode reader is connected to a host over a physical interface than can reliably transport byte values encoded within the symbol, with no additional signalling.

The barcode symbol is a carrier for a sequence of bytes. When a barcode is scanned by a reader, the transmission protocol defined by the symbology standard specifies how the content of the symbol is to be sent over the interface in the form of raw byte values, irrespective of whether the real-world interface has rich capabilities (such as an interface driver over USB) or is especially restrictive (such as a keyboard wedge). Within the symbol there may be non-data function characters and macros that determine how the codeword contents is de-compacted into bytes for transmission to the host, but these non-data characters are not transmitted.

The transmission protocol defines that each transmitted byte of message content corresponds directly to a single data character. This simple data stream is known as the “Basic Channel”. The barcode standards specify that the interpretation of all of the bytes received from the reader (i.e. the entire message) must be strictly in accordance with the synbology's default character encoding. Typically this is ISO/IEC 8859-1, but some symbologies may differ.

Critically, the message that is received by the host is simply the concatenation of AIM Symbology Identifier with the data character bytes obtained from the decode process. The host has zero knowledge of the internal, symbology-specific encodation of the barcode message data such as whether it used ASCII Mode encodation, Byte Mode encodation or some mixture of modes.

Relating this to the issue at hand, a Kanji Mode compaction segment cannot be selectively rendered as Kanji characters (overriding the default interpretation) because according to the standard transfer protocol the mode indication is not even present in the data received by the host. This is not a limitation of the interface, but rather the intent of the symbology designers — the data compaction strategy does not fall within the purview of the receiver of the message.

ECI describes an "Extended Channel", indicated unambiguously via an AIM Symbology Identifier modifier, which provides an escape mechanism that allows "indicators" to be placed in the byte-based character data that provide the user with the opportunity to signal a change of character set interpretation (amongst other things) to be honoured by the host. Note that ECI does not provide access to the internal encodation (such as compaction modes, use of non-data characters) of the data carried within the barcode symbol. It is well-specified and ubiquitous in the sense that every barcode specification since PDF417 includes a description of the transmission protocol for the Extended Channel. Nevertheless it is yet to receive universal adoption within barcode readers. Further information about ECI (endorsed by the AIM Technical Symbology Committee) is provided here: https://www.linkedin.com/pulse/enhanced-channel-interpretation-terry-burton/

Modern decoding libraries are typically executed on devices that both host an integrated camera and the end user application ("host"), so the "wire" over which the transmission protocol is performed is nonexistent. Peeking over the fence at the internal encoding of a symbol and using this to trigger changes of character set interpretation is both fundamentally wrong and a layering violation. Doing so creates ambiguities when decoding certain types of data.

When trying to understand a symbology author's intent, you should consider that they will have designed the barcode so that it can be read out over the hyperthetical byte-mode interface described above, with the transmission either using Basic Channel (default transmission protocol) or Extended Channel (ECI transmission protocol).

The lack of universal support for ECI by decoders presents challenges for implementers. Within closed systems, and only in the absence of ECI, in order to exchange non-Latin messages using barcodes it may be necessary to mutually agree a non-standard character set interpretation (other than ISO/IEC 8859-1) for the entire message. However, when ECI is indicated the decoder must interpret all parts of the message (irrespective of the internal encodation) using the character set that is indicated by ECI — without exception.

It is entirely inappropriate for solution providers and decoder developers to reinvent the QR Code transfer protocol by electing to decode certain message segments using a character set other than that of the overall message (in the case of the Basic Channel) or the currently indicated character set (in the case of the Extended Channel) whilst claiming to conforms to the ISO 18004 standard.

Furthermore, if an ECI-enabled symbol is encountered and the scanner device does not support the Extended Channel transfer protocol, then the device must fail to scan and deliver no data to the host. Any data read from an ECI-enabled symbol that is delivered by a device that lacks support for the Extended Channel risks being invalid since proper ECI escaping will not have been performed during data transfer.

ISO/IEC 18004 specification issues

The data capacities presented in Section 5.1.e and Table 7 for what is described as Kanji character inputs do not take into account the requisite ECI \000020 indicator that is necessary for the Shift JIS double bytes to be interpreted as Kanji characters.

The values are unchanged from version of the specification prior to QR Code 2005 when the default character encoding was indeed Shift JIS, and therefore no ECI indicator was required at that time.

Additionally, whilst the use of Kanji Mode compaction is permitted in Micro QR Code symbol versions M3 and M4, ECI is not permitted, with the default interpretation set as ISO/IEC 8859-1. Therefore it is only practical to encode Kanji characters in M3 and M4 symbols within closed systems in which the character set interpretation is defined to be Shift JIS by mutual agreement between trade partners.

Proposed clarification to Section 5.1 "Basic characteristics"

Consider Clause 5.1 b) "Encodable character set":

Kanji characters. Kanji characters in QR Code can be compacted into 13 bits.

For clarity, a note should be added referring to section 7.3.2 (presented above). Whilst it is true that a QR Code symbol can efficiently encode Kanji characters having double bytes in Shift JIS using Kanji Mode compaction, the message must be transferred under ECI \000020 in order to actually express to the host that such bytes are to be interpretted as Shift JIS. Otherwise the Kanji Mode compacted bytes will have regular ISO/IEC-8859-1 interpretation, just as with any other data.

Proposed clarification to Section 7.4.7 "Kanji Mode"

To retain generality with respect to byte-based data that is not of Kanji character origin, throughout this section it should not refer to "[Kanji] input characters" but rather to "double-byte inputs having values within the Shift JIS table given in Figure H.1" (that may or may not represent an actual Kanji input character).

Example language changes:

"The number of input double bytes (having values within the Shift JIS tables given in Figure H.1) is converted to its binary equivalent and added as the character count indicator after the mode indicator and before the binary data sequence."
"For double-byte inputs with values from 8140 to 9FFC within the Shift JIS table:"

Within the worked examples, the first line should be changed to indicate that a double byte might not be intended to represent a Kanji character:

Input double byte / Kanji character:    93 5F / "点"       E4 AA / "茗"
(Shift JIS value):                          935F              E4AA
...

The introduction and key to the length formula should be changed similarly to refer to double bytes / byte pairs:

For any number of input byte pairs the length of the bit stream...
...
    D = number of input byte pairs

Worked example

A sender intends to send the following message using a QR Code: TRANS202404110011看看看16

Aside: Since this message contains the Chinese character 看 it cannot be encoded within a regular (non-ECI) QR Code which specifies ISO/IEC 8859-1 as the default character encoding. Therefore an alternative character encoding must be chosen (in this case we pick UTF-8), and this choice must be communicated either via an ECI \000026 indicator in the symbol itself (which requires that the recipient has an ECI-compliant barcode reader) or, as is common, by a mutual agreement between sender and recipient to use UTF-8 character interpretation instead of standards-compliant ISO/IEC 8859-1.

The message has the following encoding in UTF-8 (expressed as hex bytes, with spaces for effect):

54 52 41 4e 53 32 30 32 34 30 34 31 31 30 30 31 31   e7 9c 8b   e7 9c 8b   e7 9c 8b   31 36

Note: It should be clear from the above that the UTF-8 encoding of the character "看" is the three-byte sequence { e7 9c 8b }.

An optimising encoder correctly determines the shortest way to represent these values in QR Code is using the following mode segments:

ECI Indicator: "\000026" (Required unless UTF-8 is agreed between sender and recipient)
  0111          => ECI
  00011010      => \000026

Byte Mode:  { 54 52 41 4e 53 }  "TRANS"
  0100          => Byte Mode (8 bits into 1 byte)
  00000101      => Length = 5
  01010100      => "T"
  01010010      => "R"
  01000001      => "A"
  01001110      => "N"
  01010011      => "S"

Numeric Mode:  { 32 30 32 34 30 34 31 31 30 30 31 31 }  "202404110011"
  0001          => Numeric Mode (10 bits into 3 bytes)
  0000001100    => Length = 12
  0011001010    => "202"
  0110010100    => "404"
  0001101110    => "110"
  0000001011    => "011"

Kanji Mode:  { e7 9c 8b   e7 9c 8b   e7 9c }  "看看" + first two bytes of final 看
  1000          => Kanji Mode (13 bits into 2 bytes)
  00000010      => Length = 8
  1110011011100 => { e7 9c } i.e. 0xE79C (- 0xC140) => 0x265C => 26*C0 + 5C = 0x1CDC
  0100000100111 => { 8b e7 } i.e. 0x8BE7 (- 0x8140) => 0x0AA7 => 0A*C0 + A7 = 0x0827
  1010010001011 => { 9c 8b } i.e. 0x9C8B (- 0x8140) => 0x1B4B => 1B*C0 + 4B = 0x148B
  1110011011100 => { e7 9c } i.e. 0xE79C (- 0xC140) => 0x265C => 26*C0 + 5C = 0x1CDC

Byte Mode:  { 8b   31 36 }  Last byte of final 看 + "16"
  0100          => Byte Mode (8 bits into 1 byte)
  00000011      => Length = 3
  10001011      => { 8b }
  00110001      => "1"
  00110110      => "6"

Kanji Mode has been selected by the optimiser simply because it compacts a subset of the bytes ({ e7 9c 8b e7 9c 8b e7 9c }) more optimally than any other mode. What message characters those bytes represent is entirely irrelevant for the purpose of the bitstream encodation process. (They happen to encode parts of the three-byte Chinese characters, and have nothing to do with two-byte Kanji characters.)

These are correct renditions of QR Code symbols, encoded as described above (shown with and without an UTF-8 ECI indicator):

ECI \000026	Without ECI

(The rightmost symbol is only suitable where UTF-8 encoding is mutually agreed between sender and recipient.)

A correctly functioning scanner will concatenate the bytes decoded from each mode segment to produce an overall byte sequence:

54 52 41 4e 53 32 30 32 34 30 34 31 31 30 30 31 31   e7 9c 8b   e7 9c 8b   e7 9c 8b   31 36

This overall byte sequence is then interpreted using the character set encoding that is in effect (i.e. UTF-8) to recover the intended messsage:

TRANS202404110011看看看16

Malfunctioning scanners are observed to erroneously interpret each decoded mode segment separately, and then concatenate the individually interpreted messages. When they encounter the Kanji mode segment they wrongly interpret the four compacted pairs of bytes according to the Shift-JIS character encoding (E79C => 逵, 8BE7 => 狗, 9C8B => 恚, E79C => 逵). Recall that Kanji Mode is only a means of data compaction and has nothing whatsoever to do with the data's interpretation.

An incorrect extraction may look something like this:

TRANS202404110011逵狗恚逵�16

Note: "�" represents the byte { 8b } which is unused in UTF-8.

Incidentally, if a scanner returns the following message when scanning the non-ECI symbol then it is likely that it is interpreting the encoded byte sequence using ISO/IEC 8859-1:

TRANS202404110011ç��ç��ç��16

Note: "�" represent the bytes { 9c } and { 8b } which are unused in ISO/IEC 8859-1.

This decoding is entirely correct according to the standard, but is not the original intended message. Hence the need for ECI or mutually agreed interpretation as UTF-8 between sender and recipient.

However, if a scanner returns the above message when scanning the ECI \000026 symbol then it is defective: It does not support ECI (since it is erroneously decoding the message as ISO/IEC 8859-1 rather than as the ECI-indicated UTF-8) and therefore must not return any scan result at all when encountering an ECI-enabled symbol.

terryburton · 2024-06-18T16:32:20Z

terryburton
Jun 18, 2024
Maintainer Author

Incorrect behaviour (conflates Kanji Mode with Shift JIS interpretation):

ZXing: Add decoding hint DecodeHintType.QR_ASSUME_SPEC_CONFORM_INPUT zxing/zxing#1498, generate QRCode with Chinese character is not right metafloor/bwip-js#335 (comment)
zxing-cpp for non-ECI symbols: Tested version 2.2.0 on 2024-09-20.
Google ML Kit barcode scanning API (used by recent Android and iPhone native camera apps): Tested version 3.2.0 and 6.0.0 on both platforms on 2024-06-19.
Scandit SDK. Tested version 6.26.0 on 2024-09-20.
REA CodeScan. Tested version 4.5.2.356369 on 2024-09-20.

Correct behaviour (does not conflate Kanji Mode with Shift JIS interpretation):

zxing when ECI is indicated. Tested version 2.2.0 on 2024-09-20.
Zebra DS6707 scanner: generate QRCode with Chinese character is not right metafloor/bwip-js#335 (comment)
Alipay: generate QRCode with Chinese character is not right metafloor/bwip-js#335 (comment)
Cognex Mobile Barcode SDK. Tested version 2.7.0 on 2024-09-20.

Related issues:

Lynx D432: Hard lockup when encountering bits that decode to an invalid Shift JIS value, e.g. 0000000111111 -> 817F

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifying Kanji Mode encodation in QR Code #267

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Clarifying Kanji Mode encodation in QR Code #267

terryburton Jun 18, 2024 Maintainer

Kanji Mode is for data compaction, not character set indication

Wider context: Reader to host interface

ISO/IEC 18004 specification issues

Proposed clarification to Section 5.1 "Basic characteristics"

Proposed clarification to Section 7.4.7 "Kanji Mode"

Worked example

Replies: 1 comment

terryburton Jun 18, 2024 Maintainer Author

terryburton
Jun 18, 2024
Maintainer

terryburton
Jun 18, 2024
Maintainer Author