Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate QRCode with Chinese character is not right #335

Closed
xinglie opened this issue May 9, 2024 · 23 comments
Closed

generate QRCode with Chinese character is not right #335

xinglie opened this issue May 9, 2024 · 23 comments

Comments

@xinglie
Copy link

xinglie commented May 9, 2024

reproduction steps

  1. open http://bwip-js.metafloor.com/demo/demo.html
  2. Barcode Type choose QR Code
  3. Bar Text input TRANS202404110011看16
  4. press Show Barcode button
  5. scan QR Code image
  6. scan result is TRANS202404110011看q6 NOT same as input TRANS202404110011看16

I also try to modify Options use eclevel=H or eclevel=M or eclevel=L or eclevel=Q , but all failed

@metafloor
Copy link
Owner

Interesting. The utf-8 bytes are correctly preserved, but it appears the transition from one encoding mode to another is adding a bit: 'q' is ASCII 0x71, whereas '1' is 0x31. This will take some effort to track down. Probably wont have time until the weekend.

@metafloor
Copy link
Owner

@terryburton: Terry, could you take a look at this? I used both ghostscript/BWIPP and bwip-js to generate barcodes, and their output matches. This is what zxing says about the barcodes:

image

Note that the scanned text does not match the OP's reported result, but also doesn't match the input text.

The input text used was: TRANS202404110011看16

For reference, the 16-bit character is U+770B (UTF-8 \xe7\x9c\x8b).

The resulting scanned (zxing) text is: TRANS202404110011逵客6

The UTF-16 sequence is U+9035 U+5BA2 (UTF-8 \xe9\x80\xb5 \xe5\xae\xa2).

Thanks!

@terryburton
Copy link

@metafloor I'll investigate. Thanks.

@terryburton
Copy link

terryburton commented Jun 12, 2024

The decoders are likely affected by a flaw that stems from a misunderstanding of the difference between symbology compaction modes ("encodation modes") and character encodings. And (to a less extent) they are ignorant to the fact that the default interpretation of QR Code data must be Latin-1, not UTF-8, unless ECI is used to specify otherwise — but this is a distraction from the main issue.

Internally a QR Code message is a sequence of bytes, encoded compactly into a bitstream based on the fact that some byte sequences are more likely to be encountered than others, which when read out the resulting symbol will result in the same sequence of bytes. The barcode symbology concerns itself with the durability of byte values throughout an encode/decode cycle. It does not concern itself with character set encodings other than to say that the message character encoding should be considered as Latin-1 for compatibility between parties. (See my note at the end for extra detail concerning ECI, which is not at play here.)

The QR Code encoder has been passed input byte data: TRANS202404110011 \347 \234 \213 16 (Here \nnn is octal notation and other character have their regular Latin-1 ordinal values. Ignore spaces added for clarity.)

The encoder opportunistically represents these byte values as succinctly as possible using a concatenation of multiple encodation segments:

  • Byte Mode: "TRANS"
  • Numeric Mode: "202404110011"
  • Kanji Mode: "\347 \234 \213 1" (Octal; ignore spaces)
  • Numeric Mode: "6"

Note that the penultimate "1" character has been included in the Kanji Mode encoding segment. This is fine. [ Edit: It turned out that the actual issue is that [\213 "1"] is an invalid Kanji Mode byte pairing since the transformation function requires that the two most-significant-bits are set on the second byte. See this comment. But the following discussion is generally correct. ] Again, the encoder and decoder are blind to the interpretation of those bytes in accordance with any particular character set. (With certain exceptions, the barcode symbol is a "mere data channel".) The process at play here is data compaction, not character set encoding.

Each segment is written to the bitstream (before being padded, terminated, chunked, and having ECC data added, which forms the basis for the graphical rendition). For a 25x25 symbol, the resulting bitstream is:

0100          => Byte Mode (8 bits into 1 byte)
00000101      => Length = 5
01010100      => "T"
01010010      => "R"
01000001      => "A"
01001110      => "N"
01010011      => "S"
0001          => Numeric Mode (10 bits into 3 bytes)
0000001100    => Length = 12
0011001010    => "202"
0110010100    => "404"
0001101110    => "110"
0000001011    => "011"
1000          => Kanji Mode (13 bits into 2 bytes)
00000010      => Length = 2
1110011011100 => "\347 \234" i.e. 0xE79C (- 0xC140) => 0x265C => 26*C0 + 5C = 0x1CDC
0011110110001 => "\213 1"    i.e. 0x8B31 (- 0x8140) => 0x09F1 => 09*C0 + F1 = 0x07B1  [ Edit: Encoding error! ]
0001          => Numeric Mode
0000000001    => Length = 1
0110          => "6"

The role of the decoder is to undo the internal encoding (i.e. de-compact the bitstream) to recover the original data bytes... for subsequent interpretation and transmission. Only once the data bytes are recovered should the overall message be interpreted: By default in accordance with Latin-1, unless ECI specifies otherwise. (Or standards be damned and just interpret it as UTF-8 anyway as everyone one seems to!)

In the case of the ZXing decoder, the issue appears to be that it is erroneously interpreting the Kanji Mode compacted segment in accordance to the Shift-JIS character set (noting that 0xE79C is "逵"). This appears to be a fundamental misunderstanding: Kanji Mode compaction does not imply Shift-JIS interpretation of the data in the section — it's all just binary data during symbology encoding and decoding; encodation modes (as applied to a sequence of data values) are not ECI indicators (as applied to the extracted message itself).

In the case of the OP's decoder, there is some other error at play. Nevertheless it seems that BWIPP's encoding is correct, and I would be happy for someone else to step through it manually to assure themselves of this or find my error. [ Edit: The OP's decoder was behaving correctly. Other decoders mentioned in this thread are in error. ]

@metafloor
Copy link
Owner

@terryburton : Thank you for the detailed investigation!

@xinglie : What scanner / software are you using to decode? Is it possible to ask the vendor to investigate this issue?

@xinglie
Copy link
Author

xinglie commented Jun 13, 2024

@terryburton Thank you !
@metafloor I use iPhone Camera app to scan the qrcode, the app scan result is not right yet.

image

touch above image yellow area, iPhone Camera app will open Safari to search

image

above image is iPhone Camera tell me qrcode result is TRANS202404110011逵客6

when I use Alipay to scan the same qrcode ,Alipay tell me result is TRANS202404110011看q6

so interesting

@metafloor
Copy link
Owner

Decided to try your example text on few different generators. I have a Symbol DS6707 handheld 2D scanner (mfg date May 2010 - so old firmware) that I use for testing.

Site: https://bwipjs-api.metafloor.com/?bcid=qrcode&text=TRANS202404110011%E7%9C%8B16&eclevel=L&padding=8&backgroundcolor=fff&scale=4

image

  • The DS6707 scanner returned the same result as Alipay: TRANS202404110011看q6.
  • zxing returned TRANS202404110011逵客6.

Site: https://www.qr-code-generator.com/

image

  • The DS6707 couldn't scan the barcode. Not sure what is going on there!
  • zxing returned the correct text TRANS202404110011看16.

Site: https://qrcode.tec-it.com/en

image

  • The DS6707 returned the correct text.
  • zxing returned the correct text.

I also tried the npm package qrcode-generator using the following code:

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>npm qrcode-generator</title>
<script type="text/javascript" src="node_modules/qrcode-generator/qrcode.js"></script>
</head><body>
<div id="qrcode"></div>
<script type="text/javascript">
let qr = qrcode(0, 'L');
qr.addData('TRANS202404110011\xe7\x9c\x8b16');
qr.make();
document.getElementById('qrcode').innerHTML = qr.createImgTag(6, 20);
</script>
</body>
</html>

The result was:

image

  • The DS6707 returned the correct text.
  • zxing returned the correct text.

All of the barcodes were generated with error correction level L for consistency.

@terryburton : Does any of this make sense?

@terryburton
Copy link

Take a look at the symbols using the following (legacy) app which lists the modes used: https://play.google.com/store/apps/details?id=de.stefanarnhold.RawCodeScan

You can see in that both of the other encoders do not spot the opportunity to use Kanji Mode.

The sad fact is that if decoders can't be trusted to handle Kanji Mode reliably then we might also chose to avoid it. But that would be sad...

Given that the DS6707 also chokes on the most basic of all encodings (pure Byte Mode) I'm not going to give it much credibility!

@metafloor
Copy link
Owner

metafloor commented Jun 13, 2024

@terryburton : The app you linked is too old to install on my phone but was able to get a version of the quagga.js library to work. Just to document what the above barcodes contain, here are the parse results of each:

// bwip-js:
[
  { type: 'byte', bytes: [ 84, 82, 65, 78, 83 ], text: 'TRANS' },
  { type: 'numeric', text: '202404110011' },
  { type: 'kanji', bytes: [ 231, 156, 139, 113 ], text: '逵客' },
  { type: 'numeric', text: '6' }
]

This decode behavior matches that of zxing. Edit: Actually it doesn't. It's a bad utf-8 interpretation bug in zxing and quagga. The result matches Alipay.

// qr-code-generator.com
[
  { type: 'eci', assignmentNumber: 26 },
  {
    type: 'byte',
    bytes: [
      84, 82, 65,  78,  83,  50, 48,
      50, 52, 48,  52,  49,  49, 48,
      48, 49, 49, 231, 156, 139, 49,
      54
    ],
    text: 'TRANS202404110011看16'
  }
]

I suspect the eci caused the handheld scanner to refuse to play.

// tec-it.com
[
  { type: 'alphanumeric', text: 'TRANS' },
  { type: 'numeric', text: '202404110011' },
  { type: 'byte', bytes: [ 231, 156, 139, 49, 54 ], text: '看16' }
]
// qrcode-generator npm package
[
  {
    type: 'byte',
    bytes: [
      84, 82, 65,  78,  83,  50, 48,
      50, 52, 48,  52,  49,  49, 48,
      48, 49, 49, 231, 156, 139, 49,
      54
    ],
    text: 'TRANS202404110011看16'
  }
]

@metafloor
Copy link
Owner

metafloor commented Jun 13, 2024

Edit: Please ignore this. Mixing up UTF-16 and Shift JIS double-byte characters.

@terryburton : Possibly beginning to understand the decoding problem. According to this document:

Kanji mode can only encode double-byte Shift JIS characters whose bytes are in the ranges 0x8140 to 0x9FFC and 0xE040 to 0xEBBF (hexadecimal). The characters in this set can be found on Rikai's [Shift JIS Kanji Code Table] (http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml).

The OP's example text contains the code point U+770B, which is not part of the Shift JIS code table. Could that be the issue?

@terryburton
Copy link

Actual characters are irrelevant, however an issue is that not every codepoint value within the Shift-JIS ranges 0x8140-0x9FFC and 0xE040-0xEBBF is occupied. There are gaps in the low order bytes.

BWIPP does not take this into account when determining whether a sequence is amenable to Kanji Mode compaction, so the transformation results in clashes, e.g.

"\213 1" = 8B31 => 0x07B1

But also:

"\213 q" = 8B71 => 0x07B1

This explains what @xinglie is observing.

I'll push a fix tomorrow.

@terryburton
Copy link

terryburton commented Jun 14, 2024

The OP's issue has been fixed with bwipp/postscriptbarcode@8fcbe7c

This ensures that the low byte of a "Kanji Mode pair" is always >= 64 since avoids overlap with the transformation function. The standard doesn't mention this requirement.

However, exercise caution because from the testing perspective the patch will appear to fix the issue for the wrong underlying reason: It causes Kanji Mode to be avoided entirely for the specific example. (The Kanji Mode run would now be too short to overcome the mode switching cost.)

Therefore, for A/B testing (old code versus new code behaviour) please use the following input that will invoke Kanji Mode:

TRANS202404110011看看看16

At least with this you will know that the issue is fixed (or not) for the right reason.

As an aside, I note that the first range of acceptable Kanji Mode pairs should be specified as 0x8140-0x9FFF (not 0x9FFC) as (1) this would make the overall transformation function a bijection between those ranges and 13-bit numbers, and (2) the limits really have nothing whatsoever to do with actual Shift-JIS codepoints. But I bet that some decoders hardcore a limits related to the given ranges that would prevent decode of the extension. Pfft.

If this small patch is reported as working then I'll cut a release.

@metafloor
Copy link
Owner

metafloor commented Jun 14, 2024

@terryburton : Unfortunately, I have not found a single qrcode reader that correctly decodes the new barcode:

image

According to quagga.js, this is how it is formatted:

[
  { type: 'byte', bytes: [ 84, 82, 65, 78, 83 ], text: 'TRANS' },
  { type: 'numeric', text: '202404110011' },
  { type: 'kanji', bytes: [ 231, 156 ], text: '逵' },
  { type: 'byte', bytes: [ 139, 49, 54 ], text: '�16' }
]

Some readers return the latin-1 character for code 139. (Correction: They're probably using the windows-1252 code page to get that character.)

It appears the de facto standard is to interpret kanji mode as Shift JIS and not as a part of a byte stream.

@xinglie
Copy link
Author

xinglie commented Jun 14, 2024

@metafloor I use Alipay app to scan the newest qrcode image you upload, now Alipay scan result is correct !!

@terryburton
Copy link

terryburton commented Jun 14, 2024

According to quagga.js, this is how it is formatted:

[
  { type: 'byte', bytes: [ 84, 82, 65, 78, 83 ], text: 'TRANS' },
  { type: 'numeric', text: '202404110011' },
  { type: 'kanji', bytes: [ 231, 156 ], text: '逵' },
  { type: 'byte', bytes: [ 139, 49, 54 ], text: '�16' }
]

That's a correct representation of what's in the barcode. BWIPP now does the right thing and the "1" is no longer falsely adopted into an invalid Kanji Mode byte pair [ 139 , 49 ].

Some readers return the latin-1 character for code 139. (Correction: They're probably using the windows-1252 code page to get that character.)

It appears the de facto standard is to interpret kanji mode as Shift JIS and not as a part of a byte stream.

Indeed, it seems as though certain broken decoders are essentially creating the message as the concatenation of the text components from each segment, rather that properly concatenating the bytes from each segment and then interpreting the result as Latin-1 (or something else "agreed between trading partners").

What to do about such fundamental misunderstandings of the protocol?

Short term: We could just add an options flag (possibly the default) that disables Kanji Mode compression. A bit defeatist...

Longer term: Improvements to the ISO standard and education.


The QR Code standard needs to be made clearer. There are several improvements that should be made to sections related to Kanji Mode (and Byte Mode) that are outright misleading. I'll see about proposing changes, but this might be best to wait behind some whitepapers/articles that I intend to write that presents the overall framework / themes and gory details common to generic symbology standards. It's surprising how "vague on the details" even symbology experts can be when it comes to encoding and decoding, which makes having productive discussions difficult. If everyone has read up on the same background reading material then there's a good chance of alignment and improvements being made.

Technically the ISO standard is currently open to edits, but unfortunately is in the final draft approval phase and any changes that I propose would likely require debate. Such focused discussions are best had within the AIM Technical Symbology Committee industry group along with invited experts from Denso, with the goal of that group creating a expert submission for consideration by the ISO committee. It'll likely be a while before WG1 reopens the standard after the current round.

@metafloor
Copy link
Owner

@xinglie : It's good to see that at least one scanner/reader implemented decode correctly...

@terryburton : Since it appears to be a widespread implementation bug, my initial thinking was to disable kanji mode by default. This is potentially problematic since upstream users of BWIPP may expect kanji mode's smaller encoding size, which can affect layout.

So my vote is to add a new option to disable kanji mode, off by default. The bwip-js build script makes dozens of changes to the raw postscript before transpiling to make the code easier to convert to javascript. I can simply add another awk/sed edit to toggle the flag to true before building. Since javascript is a utf-16 environment, the text will always be converted to utf-8, making binary encoding the natural choice.

@terryburton
Copy link

I've created bwipp/postscriptbarcode#266 which I'll discuss with the Zint project to ensure alignment / knowledge sharing.

Feel free to close this issue if you want since the OP's issue is solved.

@metafloor
Copy link
Owner

@xinglie : Please try the following with your barcode readers. It is the result with the new suppresskanjimode option enabled:

image

The text is encoded as:

[
  { type: 'alphanumeric', text: 'TRANS202404110011' },
  { type: 'byte', bytes: [ 231, 156, 139, 49, 54 ], text: '看16' }
]

@xinglie
Copy link
Author

xinglie commented Jun 18, 2024

@metafloor I use iPhone Camera, Alipay and JD , all result is correct !

@terryburton
Copy link

FYI, I pushed out BWIPP 2024-06-18 with some related changes (that do not impact any examples provided here).

@terryburton
Copy link

Please refer the responsible parties for any affected decoders to this write up: bwipp/postscriptbarcode#267

@metafloor
Copy link
Owner

@xinglie : Please give version 4.4.0 a try.

suppresskanjimode is enabled by default. To disable (default in BWIPP), specify suppresskanjimode : false in the options object or !suppresskanjimode in the URL query string.

@xinglie
Copy link
Author

xinglie commented Jun 19, 2024

@metafloor I test 4.4.0, all scanner result is ok ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants