-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[66_13] Reasonable herk->utf8 and utf8->herk #2150
Conversation
Render of U+00A9 will be solved in the later pull requests. This pull request aims to pin the definition of the herk encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thank you for the efforts to improve TeXmacs. As I said before, I have a fair amount of experience in encoding & i18n but not much in TeXmacs itself. So please take what I said below with a grain of salt.
I do have a few questions here:
- From my understanding, TeXmacs Cork is a variation of the TeX Cork that is used in early TeX T1 fonts. However, since now the internals of Mogan is (is it?) all in Unicode (UTF-8/16?) there is no need to convert anymore.
- If (1) is true, then is the new encoding just UTF-8 (maybe with escapes for <> tags)?
- If (1) is false, then what is preventing the internals to be Unicode? Upon reviewing the TM source code, I suppose it is because TeXmacs still uses the cork-based hyphenation rules, and this requires Cork encoding. Is this true?
- If (3) is true, then is it possible to take the same approach as https://hyphenation.org/pdf/tb93miklavec.pdf and get rid of cork-specific code in the hyphenation engine, and ideally just use the same LaTeX hyphenation rules in https://hyphenation.org and https://ctan.org/pkg/hyph-utf8?
The internals of Mogan is still in TeXmacs Cork encoding. TMU format tries to get rid of the TeXmacs Cork encoding in the file format scope. I believe it is the first step to get rid of the TeXmacs Cork encoding in the codebase. |
The full codebase is based on Cork encoding, for example:
No one is preventing the internals to be Unicode. But if you wanna use Unicode , you have to first support Unicode in S7 Scheme. S7 Scheme does not support Unicode string and Unicode char. GNU TeXmacs is using GNU Guile 1.8.x. GNU Guile 1.8.x does not support Unicode string too. GNU Guile 3 does support Unicode string. But if we adopt GNU Guile 3, it is a nightmare to make it work on Windows. I started the Goldfish Scheme project. Assuming that I've completed the Unicode support (string and char) in Goldfish Scheme. There still a lot to do to make the codebase Unicode based but not TeXmacs Cork based. And first of all, we have to introduce a UTF-8 format: TMU. The TM format is using ISO-8859 series. It depends on the natural languages:
|
Thanks for pointing out the relation of Cork encoding and the hyphenation engine. |
Cork and TeXmacs Cork and Herksee https://en.wikipedia.org/wiki/Cork_encoding Cork -> Unicode: Cork+0000 to Cork+00FF
You can run Cork to Unicode: > Cork+00FFFor characters beyond the Cork scope. We will encode it to a hex format. For example:
The leading 0 could be stripped. That's the same in TeXmacs Cork encoding and in Herk encoding. Unicode to CorkFor Unicode to Cork conversion, it is much more complicated. There are several reason why I have to introduce Herk encoding:
For Herk Encoding, the rule is much simpler:
And Herk is named after the first two letters of Hex and the last two letters of Cork. Just like Cork, it is a city name in Europe. |
I'm moving on to improve the font rendering part of well-defined Herk encoding. That's why I merged this pull request in a hurry. |
Why
Try to solve the Cork encoding defects by introducing the Herk encoding with minimal changes.
Herk encoding is adopted in TMU serialization and deserialization. It is much better than
utf8->cork
andcork->utf8
. Because inutf8->cork
andcork->utf8
, there may be two unicode maps to the same cork code.It does bring breaking changes for the TMU format, that's why we need to bump the version. But it is not a big change.
What
How to test
Unit tests on branch-1.2
Before
Now TeXmacs/tests/66_13.scm should work fine!
Test doc
Several test cases are listed in TeXmacs/tests/tmu/unicode_256.tmu
The bug lies in the TMU reader.