Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some PUA characters don't show #275

Closed
takumadnl opened this issue Aug 22, 2022 · 5 comments
Closed

Some PUA characters don't show #275

takumadnl opened this issue Aug 22, 2022 · 5 comments

Comments

@takumadnl
Copy link

Hi!

Using less, some Private Use Area characters don't show.

OS: macOS 12.5.1 Monterey
less: v590

ss_echo_unicodes

Unicode - Private Use Area

Range Name
E000 - F8FF Private Use Area
F0000 - FFFFD Supplementary Private Use Area-A
100000 - 10FFFD Supplementary Private Use Area-B

It seem that PUA characters are treat as Binary.
PUA should not be treat as Binary Because Unicode specification don't define its use purpose.
I would expect that PUA characters display as it is.

FWIW

Click to expand

Running following script,
there seems to be a problem with range definition of Binary too.

#!/bin/bash
#   D800 -   DBFF: High-Surrogate
#   DC00 -   DFFF: Low-Surrogate
#   E000 -   F8FF: Private Use Area
#   F900 -   FAFF: Cjk Compatibility Ideograph
#  EFFFE -  EFFFF: noncharacters
#  F0000 -  FFFFD: Supplementary Private Use Area-A
#  FFFFE -  FFFFF: noncharacters
# 100000 - 10FFFD: Supplementary Private Use Area-B
# 10FFFE - 10FFFF: noncharacters
test_chars=(
  "DFFE"
  "DFFF"   "E000"   "E001"   # PUA start boundary
  "F8FE"   "F8FF"   "F900"   # PUA end   boundary
  "EFFFF"  "F0000"  "F0001"  # SPUA-A start boundary
  "FFFFC"  "FFFFD"  "FFFFE"  # SPUA-A end   boundary
  "FFFFF"  "100000" "100001" # SPUA-B start boundary
  "10FFFC" "10FFFD" "10FFFE" # SPUA-B end   boundary
)

function print_unicodes() {
  for c in ${test_chars[@]}
  do
    printf "%6s: \\U${c}\\n" $c
  done
}

echo "- without less"
print_unicodes

echo
echo "- with less"
print_unicodes | less --quit-if-one-screen

note: using nerd-fonts for screenshot.

@gwsw
Copy link
Owner

gwsw commented Aug 22, 2022

Well, since Unicode does not define the characteristics of PUA characters, it's not possible to determine the printable size of each character. Any PUA character could be a normal one-space printable character, or it could be a combining or control character (zero width) or a double-width character, or anything else. Treating them as binary seems the safest as far as maintaining the screen display correctly. However I see your point that in most cases the user would want the characters to display directly. Perhaps there could be an extension to the LESSCHARDEF syntax that would allow the user to specify how each PUA character should be treated.

@gwsw
Copy link
Owner

gwsw commented Sep 25, 2022

Commit dc4fa8c adds environment variable LESSUTFCHARDEF that can be used to set the type of Private Use (or any) characters.
Note that prior to this change there was a bug where only the two characters U+E000 and U+F8FF were treated as control characters, but the intention was for all characters numerically between them to be similarly treated. This has been fixed, so it is now necessary to set LESSUTFCHARDEF if any PUA characters are to be treated as printable,

@Finii
Copy link

Finii commented Nov 20, 2024

Well, since Unicode does not define the characteristics of PUA characters, it's not possible to determine the printable size of each character. Any PUA character could be a normal one-space printable character, or it could be a combining or control character (zero width) or a double-width character, or anything else.

Can you point out where Unicode says a PUA codepoint can be a control character? I believe this is not the case.

I think a much more sane approach would be to handle all PUAs as ordinary printable character because that is the usual case. If people have it somehow different in their client (terminal emulator), they could use LESSUTFCHARDEF. Note that this does only depend on the client and not advance width in the font. I know of no terminal emulator that handles the PUA as two cell and they all advance one 'cell'; and only that is important for the positioning count inside less, right?

@gwsw
Copy link
Owner

gwsw commented Nov 20, 2024

Section 23.5 in the Unicode Core Specification says

Private-use characters ... are designated for private use and do not have defined, interpretable semantics except by private agreement.

and

For example, a private agreement might specify that two private-use characters are to be treated as a case mapping pair, or a private agreement could specify that a private-use character is to be rendered and otherwise treated as a combining mark.

I interpret this to mean that any given PU character might or might not advance one space in the terminal.

Before LESSUTFCHARDEF was implemented, the safe approach was to treat all PU chars as control (that is, with unknown behavior). Now that LESSUTFCHARDEF is available as an override, it may make sense to treat PU chars as normal by default, requiring LESSUTFCHARDEF to be used only when a char is not a one-space printable char. It does seem plausible that in most cases PU chars are printable rather than control, combining, etc.

@Finii
Copy link

Finii commented Nov 21, 2024

Thanks for the answer!

And I believe you are right! In the paragraph

In particular, when a private agreement overrides the General_Category of a private-use character from the default value of gc = Co to some other value such as gc = Lu or gc = Nd, such a change [...]

only 'Graphic' ('printables') (Lu, Nd) are mentioned, but it seems possible to set gc = Cc 🤔

Out in the wild I have not seen such a thing (which might say something, or maybe it's completely irrelevant); it were always Graphic characters; sometimes Ligatures that should have a different codepoint (like putting 'fi-lig' onto F001 instead of FB01; "Ubuntu"-font)

Thank you again, Fini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants