-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate QRCode with Chinese character is not right #335
Comments
Interesting. The utf-8 bytes are correctly preserved, but it appears the transition from one encoding mode to another is adding a bit: 'q' is ASCII 0x71, whereas '1' is 0x31. This will take some effort to track down. Probably wont have time until the weekend. |
@terryburton: Terry, could you take a look at this? I used both ghostscript/BWIPP and bwip-js to generate barcodes, and their output matches. This is what zxing says about the barcodes: Note that the scanned text does not match the OP's reported result, but also doesn't match the input text. The input text used was: TRANS202404110011看16 For reference, the 16-bit character is U+770B (UTF-8 \xe7\x9c\x8b). The resulting scanned (zxing) text is: TRANS202404110011逵客6 The UTF-16 sequence is U+9035 U+5BA2 (UTF-8 \xe9\x80\xb5 \xe5\xae\xa2). Thanks! |
@metafloor I'll investigate. Thanks. |
The decoders are likely affected by a flaw that stems from a misunderstanding of the difference between symbology compaction modes ("encodation modes") and character encodings. And (to a less extent) they are ignorant to the fact that the default interpretation of QR Code data must be Latin-1, not UTF-8, unless ECI is used to specify otherwise — but this is a distraction from the main issue. Internally a QR Code message is a sequence of bytes, encoded compactly into a bitstream based on the fact that some byte sequences are more likely to be encountered than others, which when read out the resulting symbol will result in the same sequence of bytes. The barcode symbology concerns itself with the durability of byte values throughout an encode/decode cycle. It does not concern itself with character set encodings other than to say that the message character encoding should be considered as Latin-1 for compatibility between parties. (See my note at the end for extra detail concerning ECI, which is not at play here.) The QR Code encoder has been passed input byte data: The encoder opportunistically represents these byte values as succinctly as possible using a concatenation of multiple encodation segments:
Note that the penultimate "1" character has been included in the Kanji Mode encoding segment. This is fine. [ Edit: It turned out that the actual issue is that Each segment is written to the bitstream (before being padded, terminated, chunked, and having ECC data added, which forms the basis for the graphical rendition). For a 25x25 symbol, the resulting bitstream is:
The role of the decoder is to undo the internal encoding (i.e. de-compact the bitstream) to recover the original data bytes... for subsequent interpretation and transmission. Only once the data bytes are recovered should the overall message be interpreted: By default in accordance with Latin-1, unless ECI specifies otherwise. (Or standards be damned and just interpret it as UTF-8 anyway as everyone one seems to!) In the case of the ZXing decoder, the issue appears to be that it is erroneously interpreting the Kanji Mode compacted segment in accordance to the Shift-JIS character set (noting that 0xE79C is "逵"). This appears to be a fundamental misunderstanding: Kanji Mode compaction does not imply Shift-JIS interpretation of the data in the section — it's all just binary data during symbology encoding and decoding; encodation modes (as applied to a sequence of data values) are not ECI indicators (as applied to the extracted message itself). In the case of the OP's decoder, there is some other error at play. Nevertheless it seems that BWIPP's encoding is correct, and I would be happy for someone else to step through it manually to assure themselves of this or find my error. [ Edit: The OP's decoder was behaving correctly. Other decoders mentioned in this thread are in error. ] |
@terryburton : Thank you for the detailed investigation! @xinglie : What scanner / software are you using to decode? Is it possible to ask the vendor to investigate this issue? |
@terryburton Thank you ! touch above image yellow area, iPhone above image is iPhone when I use so interesting |
Decided to try your example text on few different generators. I have a Symbol DS6707 handheld 2D scanner (mfg date May 2010 - so old firmware) that I use for testing.
Site: https://www.qr-code-generator.com/
Site: https://qrcode.tec-it.com/en
I also tried the npm package qrcode-generator using the following code:
The result was:
All of the barcodes were generated with error correction level L for consistency. @terryburton : Does any of this make sense? |
Take a look at the symbols using the following (legacy) app which lists the modes used: https://play.google.com/store/apps/details?id=de.stefanarnhold.RawCodeScan You can see in that both of the other encoders do not spot the opportunity to use Kanji Mode. The sad fact is that if decoders can't be trusted to handle Kanji Mode reliably then we might also chose to avoid it. But that would be sad... Given that the DS6707 also chokes on the most basic of all encodings (pure Byte Mode) I'm not going to give it much credibility! |
@terryburton : The app you linked is too old to install on my phone but was able to get a version of the quagga.js library to work. Just to document what the above barcodes contain, here are the parse results of each:
This decode behavior matches that of zxing. Edit: Actually it doesn't. It's a bad utf-8 interpretation bug in zxing and quagga. The result matches Alipay.
I suspect the eci caused the handheld scanner to refuse to play.
|
Edit: Please ignore this. Mixing up UTF-16 and Shift JIS double-byte characters. @terryburton : Possibly beginning to understand the decoding problem. According to this document:
The OP's example text contains the code point U+770B, which is not part of the Shift JIS code table. Could that be the issue? |
Actual characters are irrelevant, however an issue is that not every codepoint value within the Shift-JIS ranges 0x8140-0x9FFC and 0xE040-0xEBBF is occupied. There are gaps in the low order bytes. BWIPP does not take this into account when determining whether a sequence is amenable to Kanji Mode compaction, so the transformation results in clashes, e.g. "\213 1" = 8B31 => 0x07B1 But also: "\213 q" = 8B71 => 0x07B1 This explains what @xinglie is observing. I'll push a fix tomorrow. |
The OP's issue has been fixed with bwipp/postscriptbarcode@8fcbe7c This ensures that the low byte of a "Kanji Mode pair" is always >= 64 since avoids overlap with the transformation function. The standard doesn't mention this requirement. However, exercise caution because from the testing perspective the patch will appear to fix the issue for the wrong underlying reason: It causes Kanji Mode to be avoided entirely for the specific example. (The Kanji Mode run would now be too short to overcome the mode switching cost.) Therefore, for A/B testing (old code versus new code behaviour) please use the following input that will invoke Kanji Mode:
At least with this you will know that the issue is fixed (or not) for the right reason. As an aside, I note that the first range of acceptable Kanji Mode pairs should be specified as 0x8140-0x9FFF (not 0x9FFC) as (1) this would make the overall transformation function a bijection between those ranges and 13-bit numbers, and (2) the limits really have nothing whatsoever to do with actual Shift-JIS codepoints. But I bet that some decoders hardcore a limits related to the given ranges that would prevent decode of the extension. Pfft. If this small patch is reported as working then I'll cut a release. |
@terryburton : Unfortunately, I have not found a single qrcode reader that correctly decodes the new barcode: According to quagga.js, this is how it is formatted:
Some readers return the latin-1 character It appears the de facto standard is to interpret kanji mode as Shift JIS and not as a part of a byte stream. |
@metafloor I use |
That's a correct representation of what's in the barcode. BWIPP now does the right thing and the "1" is no longer falsely adopted into an invalid Kanji Mode byte pair
Indeed, it seems as though certain broken decoders are essentially creating the message as the concatenation of the What to do about such fundamental misunderstandings of the protocol? Short term: We could just add an options flag (possibly the default) that disables Kanji Mode compression. A bit defeatist... Longer term: Improvements to the ISO standard and education. The QR Code standard needs to be made clearer. There are several improvements that should be made to sections related to Kanji Mode (and Byte Mode) that are outright misleading. I'll see about proposing changes, but this might be best to wait behind some whitepapers/articles that I intend to write that presents the overall framework / themes and gory details common to generic symbology standards. It's surprising how "vague on the details" even symbology experts can be when it comes to encoding and decoding, which makes having productive discussions difficult. If everyone has read up on the same background reading material then there's a good chance of alignment and improvements being made. Technically the ISO standard is currently open to edits, but unfortunately is in the final draft approval phase and any changes that I propose would likely require debate. Such focused discussions are best had within the AIM Technical Symbology Committee industry group along with invited experts from Denso, with the goal of that group creating a expert submission for consideration by the ISO committee. It'll likely be a while before WG1 reopens the standard after the current round. |
@xinglie : It's good to see that at least one scanner/reader implemented decode correctly... @terryburton : Since it appears to be a widespread implementation bug, my initial thinking was to disable kanji mode by default. This is potentially problematic since upstream users of BWIPP may expect kanji mode's smaller encoding size, which can affect layout. So my vote is to add a new option to disable kanji mode, off by default. The bwip-js build script makes dozens of changes to the raw postscript before transpiling to make the code easier to convert to javascript. I can simply add another awk/sed edit to toggle the flag to true before building. Since javascript is a utf-16 environment, the text will always be converted to utf-8, making binary encoding the natural choice. |
I've created bwipp/postscriptbarcode#266 which I'll discuss with the Zint project to ensure alignment / knowledge sharing. Feel free to close this issue if you want since the OP's issue is solved. |
@xinglie : Please try the following with your barcode readers. It is the result with the new The text is encoded as:
|
@metafloor I use iPhone |
FYI, I pushed out BWIPP 2024-06-18 with some related changes (that do not impact any examples provided here). |
Please refer the responsible parties for any affected decoders to this write up: bwipp/postscriptbarcode#267 |
@xinglie : Please give version 4.4.0 a try.
|
@metafloor I test 4.4.0, all scanner result is ok ~ |
reproduction steps
Barcode Type
choose QR CodeBar Text
inputTRANS202404110011看16
Show Barcode
buttonQR Code
imageTRANS202404110011看q6
NOT same as inputTRANS202404110011看16
I also try to modify
Options
useeclevel=H
oreclevel=M
oreclevel=L
oreclevel=Q
, but all failedThe text was updated successfully, but these errors were encountered: