Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text encoding detection #580

Closed
vranki opened this issue May 17, 2018 · 7 comments
Closed

Text encoding detection #580

vranki opened this issue May 17, 2018 · 7 comments

Comments

@vranki
Copy link
Contributor

vranki commented May 17, 2018

Traditional IRC networks don't specify which text encoding is used. This has led to situation where some users use UTF-8, but some still use Latin-1 or other encodings. This is pain for non-english speaking users as clients must be able to detect which encoding is used by other users.

I suggest:

  • Add option to enable automatic recode of character sets (most importantly Latin1, others as needed) to UTF-8. This could be enabled by default on selected networks.
  • Add option to set outgoing character set to something else than UTF-8 per channel.

For example Irssi does this really well.

@vranki
Copy link
Contributor Author

vranki commented Aug 2, 2018

For implementation there are 2 approaches:

  • KISS lookup table to replace invalid UTF-8 chars with UTF-8 counterparts.
  • Use a library such as iconv-lite to convert invalid UTF-8 strings to UTF-8.

I believe one fallback (ISO-8859-15) would be enough for almost everyone but for enhanced version user could define the fallback charset per-channel.

@kaiyou
Copy link
Contributor

kaiyou commented Dec 23, 2018

We implemented this using the node-irc encoding option: TeDomum@6f556eb

It is far from perfect and the heuristics sometimes backfire, but it does 99% of the job.

@vranki
Copy link
Contributor Author

vranki commented Dec 23, 2018

If I understand correctly, it sets encoding for server connection. It does not prevent clients from using any other encoding, unless the server has some logic for it.

@kaiyou
Copy link
Contributor

kaiyou commented Dec 25, 2018

The option is weirdly named in node-irc, but it does enable heuristic detection of other clients encoding and automatic transcription to utf8 (default encoding for JS strings).

@vranki
Copy link
Contributor Author

vranki commented Apr 17, 2019

Just checked how this could be done client-side. Looks like it's not possible as all invalid chars come as 65535's and it's not possible to distinguish between ä's, ö's and other problem chars.

vranki added a commit to vranki/matrix-appservice-irc that referenced this issue Apr 26, 2020
vranki added a commit to vranki/matrix-appservice-irc that referenced this issue Apr 27, 2020
Half-Shot added a commit that referenced this issue May 6, 2020
Added ability to set fallback encoding for non-UTF-8 strings. Implements #580.
@vranki
Copy link
Contributor Author

vranki commented May 16, 2020

This has now been implemented and works fine. Thanks!

@vranki vranki closed this as completed May 16, 2020
@zouppen
Copy link

zouppen commented Jun 3, 2020

It has difficulties on IRC messages containing "mIRC" colour codes leading to double encoding issue (e.g. ä translates to À)

See the following Matrix message ($1591180123573904vFEHU:irc.snt.utwente.nl). It contains doubly encoded UTF-8 because the message contains some colour codes.

{
  "content": {
    "body": "/// Nytsoi pÀivitetty: Kaaosradio 24h Ke - Techno/Electro",
    "format": "org.matrix.custom.html",
    "formatted_body": "<font color=\"#7F0000\">///</font> Nytsoi pÀivitetty: <b>Kaaosradio 24h Ke - Techno/Electro</b>",
    "msgtype": "m.text"
  },
  "event_id": "$1591180123573904vFEHU:irc.snt.utwente.nl",
  "origin_server_ts": 1591180123623,
  "sender": "@_ircnet_kaaosradio:irc.snt.utwente.nl",
  "type": "m.room.message",
  "unsigned": {
    "age": 122
  },
  "room_id": "!rPDKWxyLDIMEdLFtXF:irc.snt.utwente.nl"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants