Clarifying Kanji Mode encodation in QR Code #267
Unanswered
terryburton
asked this question in
Questions and Answers
Replies: 1 comment
-
Incorrect behaviour (conflates Kanji Mode with Shift JIS interpretation):
Correct behaviour (does not conflate Kanji Mode with Shift JIS interpretation):
Related issues:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The following aims to clarify the intent of the QR Code symbology specification in relation to the common misuse of Kanji Mode encodation as a character set interpretation indicator. It is intended for developers of QR Code decoders and will be raised with the standards committee responsible for the ISO/IEC 18004 standard.
Decoder developers are being commonly misled by some of the wording in the current specification which is resulting in incorrect decodes of correctly encoded symbols, especially in cases where the encoder has legitimately applied opportunistic Kanji Mode compaction to non-Kanji character data or byte data: #267 (comment)
Note: Section numbers are based on the 18004:2023 DIS version of the document. All text referred to is equivalent to the 2005 and later versions of the standard.
Kanji Mode is for data compaction, not character set indication
The purpose of Kanji Mode is to provide a data compaction opportunity, as with any other symbology encodation mode. It confers absolutely nothing about the character set interpretation for data encoded in such a mode. It is entirely legitimate for certain inputs byte sequences to be compacted using Kanji Mode encodation irrespective of whether or not they are intended to be interpreted as Kanji characters. Such byte sequences might instead originate from raw binary data, or from characters or fragments of multi-byte characters from other code pages — to be interpreted according to their respective character set (as signalled by ECI), and not as Shift JIS.
Section 7.3.6 "Kanji mode":
Figure H.1 "Shift JIS character values":
Data compaction has nothing whatsoever to do with character set encoding ("interpretation"), which is communicated exclusively using ECI, nor does it have any bearing on the default interpretation.
Section 7.3.2 "Extended Channel Interpretation (ECI) mode":
Wider context: Reader to host interface
Barcode symbology standards are prescriptive and presuppose a hypothetical situation in which a barcode reader is connected to a host over a physical interface than can reliably transport byte values encoded within the symbol, with no additional signalling.
The barcode symbol is a carrier for a sequence of bytes. When a barcode is scanned by a reader, the transmission protocol defined by the symbology standard specifies how the content of the symbol is to be sent over the interface in the form of raw byte values, irrespective of whether the real-world interface has rich capabilities (such as an interface driver over USB) or is especially restrictive (such as a keyboard wedge). Within the symbol there may be non-data function characters and macros that determine how the codeword contents is de-compacted into bytes for transmission to the host, but these non-data characters are not transmitted.
The transmission protocol defines that each transmitted byte of message content corresponds directly to a single data character. This simple data stream is known as the “Basic Channel”. The barcode standards specify that the interpretation of all of the bytes received from the reader (i.e. the entire message) must be strictly in accordance with the synbology's default character encoding. Typically this is ISO/IEC 8859-1, but some symbologies may differ.
Critically, the message that is received by the host is simply the concatenation of AIM Symbology Identifier with the data character bytes obtained from the decode process. The host has zero knowledge of the internal, symbology-specific encodation of the barcode message data such as whether it used ASCII Mode encodation, Byte Mode encodation or some mixture of modes.
Relating this to the issue at hand, a Kanji Mode compaction segment cannot be selectively rendered as Kanji characters (overriding the default interpretation) because according to the standard transfer protocol the mode indication is not even present in the data received by the host. This is not a limitation of the interface, but rather the intent of the symbology designers — the data compaction strategy does not fall within the purview of the receiver of the message.
ECI describes an "Extended Channel", indicated unambiguously via an AIM Symbology Identifier modifier, which provides an escape mechanism that allows "indicators" to be placed in the byte-based character data that provide the user with the opportunity to signal a change of character set interpretation (amongst other things) to be honoured by the host. Note that ECI does not provide access to the internal encodation (such as compaction modes, use of non-data characters) of the data carried within the barcode symbol. It is well-specified and ubiquitous in the sense that every barcode specification since PDF417 includes a description of the transmission protocol for the Extended Channel. Nevertheless it is yet to receive universal adoption within barcode readers. Further information about ECI (endorsed by the AIM Technical Symbology Committee) is provided here: https://www.linkedin.com/pulse/enhanced-channel-interpretation-terry-burton/
Modern decoding libraries are typically executed on devices that both host an integrated camera and the end user application ("host"), so the "wire" over which the transmission protocol is performed is nonexistent. Peeking over the fence at the internal encoding of a symbol and using this to trigger changes of character set interpretation is both fundamentally wrong and a layering violation. Doing so creates ambiguities when decoding certain types of data.
When trying to understand a symbology author's intent, you should consider that they will have designed the barcode so that it can be read out over the hyperthetical byte-mode interface described above, with the transmission either using Basic Channel (default transmission protocol) or Extended Channel (ECI transmission protocol).
The lack of universal support for ECI by decoders presents challenges for implementers. Within closed systems, and only in the absence of ECI, in order to exchange non-Latin messages using barcodes it may be necessary to mutually agree a non-standard character set interpretation (other than ISO/IEC 8859-1) for the entire message. However, when ECI is indicated the decoder must interpret all parts of the message (irrespective of the internal encodation) using the character set that is indicated by ECI — without exception.
It is entirely inappropriate for solution providers and decoder developers to reinvent the QR Code transfer protocol by electing to decode certain message segments using a character set other than that of the overall message (in the case of the Basic Channel) or the currently indicated character set (in the case of the Extended Channel) whilst claiming to conforms to the ISO 18004 standard.
Furthermore, if an ECI-enabled symbol is encountered and the scanner device does not support the Extended Channel transfer protocol, then the device must fail to scan and deliver no data to the host. Any data read from an ECI-enabled symbol that is delivered by a device that lacks support for the Extended Channel risks being invalid since proper ECI escaping will not have been performed during data transfer.
ISO/IEC 18004 specification issues
The data capacities presented in Section 5.1.e and Table 7 for what is described as Kanji character inputs do not take into account the requisite ECI \000020 indicator that is necessary for the Shift JIS double bytes to be interpreted as Kanji characters.
The values are unchanged from version of the specification prior to QR Code 2005 when the default character encoding was indeed Shift JIS, and therefore no ECI indicator was required at that time.
Additionally, whilst the use of Kanji Mode compaction is permitted in Micro QR Code symbol versions M3 and M4, ECI is not permitted, with the default interpretation set as ISO/IEC 8859-1. Therefore it is only practical to encode Kanji characters in M3 and M4 symbols within closed systems in which the character set interpretation is defined to be Shift JIS by mutual agreement between trade partners.
Proposed clarification to Section 5.1 "Basic characteristics"
Consider Clause 5.1 b) "Encodable character set":
For clarity, a note should be added referring to section 7.3.2 (presented above). Whilst it is true that a QR Code symbol can efficiently encode Kanji characters having double bytes in Shift JIS using Kanji Mode compaction, the message must be transferred under ECI \000020 in order to actually express to the host that such bytes are to be interpretted as Shift JIS. Otherwise the Kanji Mode compacted bytes will have regular ISO/IEC-8859-1 interpretation, just as with any other data.
Proposed clarification to Section 7.4.7 "Kanji Mode"
To retain generality with respect to byte-based data that is not of Kanji character origin, throughout this section it should not refer to "[Kanji] input characters" but rather to "double-byte inputs having values within the Shift JIS table given in Figure H.1" (that may or may not represent an actual Kanji input character).
Example language changes:
"The number of input double bytes (having values within the Shift JIS tables given in Figure H.1) is converted to its binary equivalent and added as the character count indicator after the mode indicator and before the binary data sequence."
"For double-byte inputs with values from 8140 to 9FFC within the Shift JIS table:"
Within the worked examples, the first line should be changed to indicate that a double byte might not be intended to represent a Kanji character:
The introduction and key to the length formula should be changed similarly to refer to double bytes / byte pairs:
Worked example
A sender intends to send the following message using a QR Code:
TRANS202404110011看看看16
Aside: Since this message contains the Chinese character 看 it cannot be encoded within a regular (non-ECI) QR Code which specifies ISO/IEC 8859-1 as the default character encoding. Therefore an alternative character encoding must be chosen (in this case we pick UTF-8), and this choice must be communicated either via an
ECI \000026
indicator in the symbol itself (which requires that the recipient has an ECI-compliant barcode reader) or, as is common, by a mutual agreement between sender and recipient to use UTF-8 character interpretation instead of standards-compliant ISO/IEC 8859-1.The message has the following encoding in UTF-8 (expressed as hex bytes, with spaces for effect):
Note: It should be clear from the above that the UTF-8 encoding of the character "看" is the three-byte sequence
{ e7 9c 8b }
.An optimising encoder correctly determines the shortest way to represent these values in QR Code is using the following mode segments:
Kanji Mode has been selected by the optimiser simply because it compacts a subset of the bytes (
{ e7 9c 8b e7 9c 8b e7 9c }
) more optimally than any other mode. What message characters those bytes represent is entirely irrelevant for the purpose of the bitstream encodation process. (They happen to encode parts of the three-byte Chinese characters, and have nothing to do with two-byte Kanji characters.)These are correct renditions of QR Code symbols, encoded as described above (shown with and without an UTF-8 ECI indicator):
(The rightmost symbol is only suitable where UTF-8 encoding is mutually agreed between sender and recipient.)
A correctly functioning scanner will concatenate the bytes decoded from each mode segment to produce an overall byte sequence:
This overall byte sequence is then interpreted using the character set encoding that is in effect (i.e. UTF-8) to recover the intended messsage:
Malfunctioning scanners are observed to erroneously interpret each decoded mode segment separately, and then concatenate the individually interpreted messages. When they encounter the Kanji mode segment they wrongly interpret the four compacted pairs of bytes according to the Shift-JIS character encoding (E79C => 逵, 8BE7 => 狗, 9C8B => 恚, E79C => 逵). Recall that Kanji Mode is only a means of data compaction and has nothing whatsoever to do with the data's interpretation.
An incorrect extraction may look something like this:
Note: "�" represents the byte
{ 8b }
which is unused in UTF-8.Incidentally, if a scanner returns the following message when scanning the non-ECI symbol then it is likely that it is interpreting the encoded byte sequence using ISO/IEC 8859-1:
Note: "�" represent the bytes
{ 9c }
and{ 8b }
which are unused in ISO/IEC 8859-1.This decoding is entirely correct according to the standard, but is not the original intended message. Hence the need for ECI or mutually agreed interpretation as UTF-8 between sender and recipient.
However, if a scanner returns the above message when scanning the ECI \000026 symbol then it is defective: It does not support ECI (since it is erroneously decoding the message as ISO/IEC 8859-1 rather than as the ECI-indicated UTF-8) and therefore must not return any scan result at all when encountering an ECI-enabled symbol.
Beta Was this translation helpful? Give feedback.
All reactions