Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S with caron rendered as tofu by elinks, rendered correctly by links #249

Open
0-issue opened this issue Jul 24, 2023 · 36 comments
Open

S with caron rendered as tofu by elinks, rendered correctly by links #249

0-issue opened this issue Jul 24, 2023 · 36 comments

Comments

@0-issue
Copy link

0-issue commented Jul 24, 2023

S with caron (Š) rendered as tofu by elinks, rendered correctly by links

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<head>
<title>Test UTF-8</title>
</head>
<body>
<p class="indent">Jaro Šnajdrov</p>
</body>
</html>

links output:
Screenshot 2023-07-23 at 7 41 26 PM

elinks output:
Screenshot 2023-07-23 at 7 42 14 PM

@0-issue
Copy link
Author

0-issue commented Jul 24, 2023

Similar problem with unicode non breaking space... it is rendered as tofu by elinks, and not by links.

% printf "a 8" | xxd
00000000: 61c2 a038                                a..8
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<head>
<title>Test UTF-8 NBSP</title>
</head>
<body>
<p class="indent">Chapter 8</p>
</body>
</html>

links output:
Screenshot 2023-07-23 at 9 04 35 PM

elinks output:
Screenshot 2023-07-23 at 9 04 21 PM

@rkd77
Copy link
Owner

rkd77 commented Jul 24, 2023

Did you compile elinks with utf-8 enabled?
-Dutf-8=true

@0-issue
Copy link
Author

0-issue commented Jul 24, 2023

@rkd77 I have never used meson, so am not sure if the build options inmeson_options.txt are picked up by the script. Here's are the steps I followed:

./autogen.sh
./configure
make
sudo make install

I am not sure if meson_options.txt is picked up by this flow. I would guess not? I couldn't find any installation instructions using meson in your install instructions file INSTALL. Excerpt from ./configure. As I can see it did not pick up true field for 256-colors/true-color options from meson_options.txt. Are there 2 different build flows here? If yes, how can I pass those options to configure? (it doesn't like options like -Dutf-8=true).

The following feature summary has been saved to features.log
Feature summary:
Documentation Tools ............. AsciiDoc, XmlTo, Pod2HTML
Manual Formats .................. HTML (one file), HTML (multiple files)
Man Page Formats ................ HTML, man (groff)
API Documentation ............... no
gpm ............................. no
terminfo ........................ no
zlib ............................ yes
bzlib ........................... yes
zstd ............................ no
brotli .......................... no
lzma ............................ no
idn2 ............................ no
Bookmarks ....................... yes
XBEL bookmarks .................. yes
ECMAScript (JavaScript) ......... no
Browser scripting ............... no
libev ........................... no
libevent ........................ no
SSL ............................. GNUTLS
Native Language Support ......... yes
System gettext .................. no
Cookies ......................... yes
Form history .................... yes
Global history .................. yes
Mailcap ......................... yes
Mimetypes files ................. yes
IPv6 ............................ yes
BitTorrent protocol ............. no
Data protocol ................... yes
URI rewriting ................... yes
Local CGI ....................... no
DOS Gateway Interface ........... no
Finger protocol ................. no
FSP protocol .................... no
FTP protocol .................... yes
Gemini protocol ................. no
Gopher protocol ................. no
NNTP protocol ................... no
Samba protocol .................. no
Mouse handling .................. yes
BSD sysmouse .................... no
88 colors ....................... no
256 colors ...................... no
true color ...................... no
Exmode interface ................ no
LEDs ............................ yes
Marks ........................... yes
Cascading Style Sheets .......... yes
HTML highlighting ............... no
DOM engine ...................... no
Backtrace ....................... yes
No root exec .................... no
Debug mode ...................... no
Fast mode ....................... no
Own libc stubs .................. no
Small binary .................... no
UTF-8 ........................... yes
Combining characters ............ no
Reproducible builds ............. no
Check codepoints ................ no
Regexp searching ................ no (TRE not found)

@rkd77
Copy link
Owner

rkd77 commented Jul 24, 2023

Here is a simple build script for meson:


rm -rf /dev/shm/builddir

meson setup /dev/shm/builddir \
-D88-colors=false \
-D256-colors=true \
-Dapidoc=false \
-Dpdfdoc=false
...
and so on

meson compile -C /dev/shm/builddir

and cd /dev/shm/builddir && ninja install

Seems configure script also built binary with utf-8 support.
What is your locale LANG, LC_ALL ?
Which terminal?
Which distribution?

On Debian 12, konsole and LANG=pl_PL.UTF-8 is displayed fine.

@0-issue
Copy link
Author

0-issue commented Jul 24, 2023

@rkd77 On macOS aarch64. macOS don't have pl_PL.UTF-8, it is en_US.UTF-8.

% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Terminal: tested on multiple: iTerm2, Alacritty, WezTerm, Kitty. They all have different fonts too... I wonder if it is the Combining characters feature? Meson config stage passed after disabling a whole bunch of options like gpm, libcss, etc. But then the compile stage fails with errors I don't understand. Am adding it the error information below for your reference. My experience with make has been way smoother (no errors). Will try manipulating options for it and get back.

% meson compile -C ~/.build/elinks
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /opt/homebrew/bin/ninja -C /Users/amanmehra/.build/elinks
ninja: Entering directory `/Users/amanmehra/.build/elinks'
[6/185] Compiling C object src/elinks.p/config_cmdline.c.o
FAILED: src/elinks.p/config_cmdline.c.o
cc -Isrc/elinks.p -Isrc -I../../packages/elinks/src -I. -I../../packages/elinks -I/opt/homebrew/Cellar/zlib/1.2.13/include -I/opt/homebrew/Cellar/tre/0.8.0/include -I/opt/homebrew/Cellar/openssl@3/3.1.1_1/include -I/opt/homebrew/Cellar/libidn2/2.3.4_1/include -I/opt/homebrew/opt/icu4c/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/ncurses/include -I/opt/homebrew/opt/ncurses/include/ncursesw -I/opt/homebrew/opt/gdk-pixbuf/include/gdk-pixbuf-2.0 -I/opt/homebrew/opt/zlib/include -fcolor-diagnostics -Wall -Winvalid-pch -O0 -g '-DGETTEXT_PACKAGE="elinks"' '-DBUILD_ID="c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty"' -DHAVE_CONFIG_H -fno-strict-aliasing -Wno-address -MD -MQ src/elinks.p/config_cmdline.c.o -MF src/elinks.p/config_cmdline.c.o.d -o src/elinks.p/config_cmdline.c.o -c ../../packages/elinks/src/config/cmdline.c
../../packages/elinks/src/config/cmdline.c:173:14: error: call to undeclared function 'idn2_to_ascii_lz'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                int code = idn2_to_ascii_lz(idname, &idname2, 0);
                           ^
../../packages/elinks/src/config/cmdline.c:175:15: error: use of undeclared identifier 'IDN2_OK'
                if (code == IDN2_OK) {
                            ^
2 errors generated.
[17/185] Compiling C object src/elinks.p/document_html_parser_forms.c.o
ninja: build stopped: subcommand failed.

UPDATE: configure with --enable-combining doesn't change anything.

@0-issue
Copy link
Author

0-issue commented Jul 25, 2023

UPDATE: I installed elinks in an Arch VM and opened it in the same tmux session on macOS. The locale, font, terminal, tmux, terminfo is same, but it renders correctly in the Arch VM but not in macOS in adjacent pane of the same tmux session. For some reasons --version does not produce anything on macOS. Here's the --version output:

macOS (+/- Fastmem doesn't matter):

ELinks 0.17.GIT c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty
Built on Jul 24 2023 15:07:24

Features:
Standard, Fastmem, IPv6, gzip(1.2.13), bzip2(1.0.8), UTF-8, Periodic
Saving, Viewer (Search History, Timer, Marks), gettext (ELinks),
Cascading Style Sheets, Protocol (Authentication, File, FTP, HTTP, URI
rewrite, User protocols), SSL (GnuTLS), MIME (Option system, Mailcap,
Mimetypes files), LED indicators, Bookmarks, Cookies, Form History,
Global History, Goto URL History

Arch Linux:

ELinks 0.16.1.1
Built on Jul 24 2023 19:33:32

Features:
Standard, IPv6, gzip(1.2.13), bzip2(1.0.8), zstd(1.5.5), gpm(2.1.0),
UTF-8, Periodic Saving, Viewer (Search History, Timer, Marks), gettext
(ELinks), Cascading Style Sheets, Protocol (Authentication, File, CGI,
FTP, Gemini, HTTP, URI rewrite, User protocols), SSL (OpenSSL), MIME
(Option system, Mailcap, Mimetypes files), LED indicators, Bookmarks,
Cookies, Form History, Global History, Scripting (Lua), Goto URL History

@rkd77
Copy link
Owner

rkd77 commented Jul 25, 2023

@amanvm Could you confirm, that the same bug (wrong utf-8 letter) occurs on FreeBSD VM ? I have no access to such hardware, but I guess FreeBSD is similar to MacOS in this case.

@0-issue
Copy link
Author

0-issue commented Jul 25, 2023

@rkd77 Just tested, it does not happen in FreeBSD VM! I mean it renders correctly in FreeBSD and Linux. Both tested in same tmux session on macOS with defaults (no config). I also tried this with default config (no config) on macOS, but the problem still persists. So it is not a config problem either... Mine is a aarch64 macOS machine, not sure if that affects anything. Searching "macos virtual machine on linux" shows a whole bunch of videos and guides...

@0-issue
Copy link
Author

0-issue commented Jul 25, 2023

@rkd77 One observation: Unlike most other systems where libs/include files are in standard directories `/usr/localor/usr/``, home-brew on aarch64 macOS recommends ``/opt/homebrew``. I ran the ``otool -L`` ( ``ldd`` command's equivalent on macOS and ``ldd`` on Linux to find that my macOS elinks version had a bunch of missing libs (it still has less). It didn't even link to libssl or libiconv. Updating the configure path for those libs does link it to the respective libs, but still things are the same. Can you eyeball the linked libraries to see if anything more is needed?

On macOS I used this:

./configure --with-openssl=/opt/homebrew/Cellar/openssl@3/3.1.1_1/ --withlibiconv=/opt/homebrew/Cellar/libiconv/1.17/

macOS (otool -L /path/to/elinks):

% otool -L /usr/local/bin/elinks
/usr/local/bin/elinks:
        /opt/homebrew/opt/tre/lib/libtre.5.dylib (compatibility version 6.0.0, current version 6.0.0)
        /opt/X11/lib/libX11.6.dylib (compatibility version 11.0.0, current version 11.0.0)
        /opt/homebrew/opt/openssl@3/lib/libssl.3.dylib (compatibility version 3.0.0, current version 3.0.0)
        /opt/homebrew/opt/openssl@3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
        /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.8)
        /opt/homebrew/opt/zlib/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.13)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
        /usr/lib/libexpat.1.dylib (compatibility version 7.0.0, current version 8.0.0)
        /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1500.65.0)

ArchLinux (ldd /path/to/elinks):

% ldd /usr/bin/elinks
        linux-vdso.so.1 (0x0000ffffaaf45000)
        libtre.so.5 => /usr/lib/libtre.so.5 (0x0000ffffaacf0000)
        libssl.so.3 => /usr/lib/libssl.so.3 (0x0000ffffaac20000)
        libcrypto.so.3 => /usr/lib/libcrypto.so.3 (0x0000ffffaa780000)
        liblua.so.5.4 => /usr/lib/liblua.so.5.4 (0x0000ffffaa720000)
        libidn.so.12 => /usr/lib/libidn.so.12 (0x0000ffffaa6d0000)
        libzstd.so.1 => /usr/lib/libzstd.so.1 (0x0000ffffaa600000)
        libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x0000ffffaa5d0000)
        libz.so.1 => /usr/lib/libz.so.1 (0x0000ffffaa5a0000)
        libgpm.so.2 => /usr/lib/libgpm.so.2 (0x0000ffffaa580000)
        libexpat.so.1 => /usr/lib/libexpat.so.1 (0x0000ffffaa540000)
        libc.so.6 => /usr/lib/libc.so.6 (0x0000ffffaa380000)
        /lib/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffffaaf0c000)
        libm.so.6 => /usr/lib/libm.so.6 (0x0000ffffaa2d0000)
        libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x0000ffffaa240000)

@rkd77
Copy link
Owner

rkd77 commented Jul 26, 2023

unsigned char -> char conversion in the past is suspected.
There are not too many released tarballs, so checking them by one by one or by "bisection" could prove this hipotesis. If 0.13.0 fails, I have no idea.

@0-issue
Copy link
Author

0-issue commented Jul 26, 2023

@rkd77 Tested on 0.13.0, its the same there! I wonder if there is any other unicode shaping library that you could use, or how links is able to display these glyphs.

@rkd77
Copy link
Owner

rkd77 commented Jul 27, 2023

What if charset is added in meta?

<!DOCTYPE html>
<head>
<meta charset="utf-8"/>
<title>Test UTF-8 NBSP</title>
</head>
<body>
<p class="indent">Jaro Šnajdrov</p>
</body>
</html>```

Also how looks like dump:
elinks -dump file.html
and
links -dump file.html
?

@0-issue
Copy link
Author

0-issue commented Jul 27, 2023

@rkd77 For 1) I get an error "Bad url syntax" with elinks, but not with links.

For 2) (the -dump option): same problem. The output from links has correct S with caron, the dump from links has tofu there.

It's a little late here, so any other followup will be after a while. Thanks!

Screenshot 2023-07-27 at 12 43 11 AM

@rkd77
Copy link
Owner

rkd77 commented Jul 27, 2023

I guess it has something common with detection of encoding. If this ^ commit did not resolve it, I have no idea.

@0-issue
Copy link
Author

0-issue commented Jul 27, 2023

@rkd77 Didn't resolve it. One comment I have is: a lot of unicode seems to just render fine. It's only a subset that doesn't. If you could think of a patch that does some kind of text log generation for interesting function arguments and ret values for a input test document like this, I can volunteer for that for sure.

@rkd77
Copy link
Owner

rkd77 commented Jul 27, 2023

@amanvm you can prepare test cases and save dumps (elinks --dump) . And show hex view of these dumps.

@0-issue
Copy link
Author

0-issue commented Jul 27, 2023

@rkd77 Am not a unicode/utf-8 expert and we might end up doing a lot of back and forth that way. Don't you want to add a fprintf or two to some important functions that shapes/processes unicode data so there is faster convergence? A branch or patch with some fprintfs would help.

@rkd77
Copy link
Owner

rkd77 commented Jul 28, 2023

@amanvm There are many places where it can break. First I want to know how it "looks" like.
In elinks F9 -> File -> Save formatted document (save with .txt extension). Please, create tarball with a few cases, original files and formatted documents. BTW, in one of previous message there was:
ELinks 0.17.GIT c09b5da-dirty
dirty means that you modified sources. What was changed?

@rkd77
Copy link
Owner

rkd77 commented Jul 28, 2023

Another question. How is rendered plain text with this character? Also "tofu" or ok?

@0-issue
Copy link
Author

0-issue commented Jul 29, 2023

@rkd77 Ok I have something for you! I used the test document from pragmatapro's repository. As that is one of the most comprehensive terminal fonts, and the test file has all of its glyphs mentioned with unicode code points. There is a free clone called pragmasevka, you are welcome to try viewing the documents with that.

There are 3 files, and 2 screenshots attached here:

  1. All_chars.txt: This is unmodified test document mentioned above.
  2. All_chars_elinks.txt: This is the output of elinks -dump All_chars.txt.
  3. All_chars_links.txt: This is the output of links -dump All_chars.txt.

Files:
All_chars.txt
All_chars_elinks.txt
All_chars_links.txt

Screenshots:

  1. Compares All_chars.txt with All_chars_elinks.txt. As you can see the first tofu appears at U+00C5. Compare the glyph near the cursor in right window to the glyph in left window.
Screenshot 2023-07-28 at 5 24 35 PM
  1. Compares All_chars.txt with All_chars_links.txt. As you can see no tofu appears at U+00C5, links doesn't have the problem. Compare the glyph near the cursor in right window to the glyph in left window.
Screenshot 2023-07-28 at 5 26 19 PM

More tofu's can be seen on the screen, and by downloading the txt files to see it for yourself.


Regarding your other question about "ELinks 0.17.GIT c09b5da-dirty": I just tried installing the homebrew version with from head branch using command brew install -s --HEAD, and --version on homebrew's head version build also had -dirty at the end. So, there is something else that is causing it, other than manual updates to files.

@rkd77
Copy link
Owner

rkd77 commented Jul 29, 2023

@amanvm on branch utf I added debug statements and test/chars.txt.
Please compile, and check elinks -dump chars.txt 2> log and show log.

@0-issue
Copy link
Author

0-issue commented Jul 29, 2023

@rkd77 Here we go, this is what the log file has:

goto charsets.c:758:utf8_to_unicode
charsets.c:751:utf8_to_unicode

rkd77 added a commit that referenced this issue Jul 30, 2023
@rkd77
Copy link
Owner

rkd77 commented Jul 30, 2023

Added more debug statements. Could you rerun test?
Which compiler is it?

@0-issue
Copy link
Author

0-issue commented Jul 30, 2023

@rkd77 Here's the output:

goto charsets.c:759:utf8_to_unicode
str[0]=195
str[1]=32
str[1] & 0xc0 = 0
charsets.c:751:utf8_to_unicode
str[0]=195

Clang, as gcc/g++ is actually clang/clang++ on Apple macOS. This is also the aarch64 (ARM64) version.

% /usr/bin/gcc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

% /usr/bin/g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

@rkd77
Copy link
Owner

rkd77 commented Jul 31, 2023

Added another commit to the utf branch. I disabled maybe_preformat_hook in dump to exclude it from suspected.
Second issue is str[1]=32 (space). In chars.txt I replaced spaces with digits, so this time I guess instead of 32 there will be some digit. If there is no error, then guilty is preformat hook.
Please git pull, compile and rerun elinks -dump chars.txt

@0-issue
Copy link
Author

0-issue commented Jul 31, 2023

@rkd77 It's the same stderr output as before:

goto charsets.c:759:utf8_to_unicode
str[0]=195
str[1]=32
str[1] & 0xc0 = 0
charsets.c:751:utf8_to_unicode
str[0]=195

stdout has this:

   U+00C0 1À2Á3Â4Ã5Ä6�7Æ8Ç9È

@0-issue
Copy link
Author

0-issue commented Jul 31, 2023

@rkd77 One observation: If I open the document without -dump, the character renders correctly (no error is seen). Are you sure the path taken by application for -dump is the same?

Actually, the error wasn't seen on a previous version of elinks too for this Å when not opened with -dump. But error is always still seen with S with Caron (Š). So there is some difference in -dump and non-dump behavior. Let's try the S with Caron (Š) and its neighbors perhaps?

rkd77 added a commit that referenced this issue Jul 31, 2023
rkd77 added a commit that referenced this issue Jul 31, 2023
@rkd77
Copy link
Owner

rkd77 commented Jul 31, 2023

dump interprets document as html, normal view as plain text. You can check latest commits and show log.
I'm slowly running out of ideas.

@0-issue
Copy link
Author

0-issue commented Jul 31, 2023

@rkd77 Here's the log: chars.log. Not sure if it matters, here's the entire build log: build.log.

@rkd77
Copy link
Owner

rkd77 commented Jul 31, 2023

@amanvm, thanks, could you continue? I make mistake in commit log, but we are closer.
git pull, compile and the same log.

rkd77 added a commit that referenced this issue Jul 31, 2023
@0-issue
Copy link
Author

0-issue commented Jul 31, 2023

@rkd77 Sure, I am here to help. Here's the output chars.log, there are new compiler warnings not sure if they matter: build.log.

@rkd77
Copy link
Owner

rkd77 commented Jul 31, 2023

I guess isspace returns different results than on Linux. warnings don't matter here, at least not yet.
Please rerun test.

@0-issue
Copy link
Author

0-issue commented Jul 31, 2023

@rkd77 Here it is: chars.log, and build.log.

@rkd77
Copy link
Owner

rkd77 commented Jul 31, 2023

I added code for isspace. Could you check whether it works? You can redirect stderr to /dev/null

@0-issue
Copy link
Author

0-issue commented Jul 31, 2023

@rkd77 It works well everywhere now! No tofus!

Btw, I would recommend you to mention this somewhere that users should close the other instances of elinks before they try their hands on a new version. If old version is open, the old behavior persists for some reason even with new binary. When I closed all old instances, the new binary's behavior kicked in. I know you have some socket file to communicate between elinks instances, not sure though how it is being used, couldn't find much info in documentation.

@rkd77
Copy link
Owner

rkd77 commented Aug 1, 2023

This commit was added to the master branch. Likely more characters must be added to isspace.

In docs there is info about sessions and elinks instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants