IRC formatting codes break UTF-8 auto detection and force encodingFallback use #44

ilmari · 2020-05-18T11:19:47Z

On the IRCnet bridge, actions appear to always be interpreted as the fallback encoding (ISO-8859-15)

On the IRC side (client configured to send UTF-8)

<@ilmari> æøå
* ilmari æøå
[notice(#chan)] æøå

On the Matrix side:

ilmari (@_ircnet_ilmari:irc.snt.utwente.nl)
æøå
* ilmari (@_ircnet_ilmari:irc.snt.utwente.nl) ÃŠÃžÃ¥
æøå

While my notice was correctly decoded as UTF-8 here, on another channel a bot's UTF-8 notices are decoded as ISO-8859-15:

IRC:

-lorelai- [1] spot boston dynamics horse mask (0:13) by Мария Прокопенко [+24/-0, 1519 views]

Matrix:

[1] spot boston dynamics horse mask (0:13) by Ð�Ð°Ñ�ÐžÑ� Ð�Ñ�ÐŸÐºÐŸÐ¿ÐµÐœÐºÐŸ [+24/-0, 1519 views]

The text was updated successfully, but these errors were encountered:

ilmari · 2020-05-18T11:24:07Z

On another channel where my client is configured to send ISO-8859-15, it works correctly:

IRC:

< ilmari> skjærgårdsøl?
* ilmari liker skjærgårdsøl
[notice(#otherchan)] skjærgårdsøl!

Matrix:

ilmari (@_ircnet_ilmari:irc.snt.utwente.nl)
skjærgårdsøl?
* ilmari (@_ircnet_ilmari:irc.snt.utwente.nl) liker skjærgårdsøl
skjærgårdsøl!

ilmari · 2020-06-03T11:28:26Z

matrix-org/matrix-appservice-irc#580 (comment) made me realise that the notices that get mangled have text formatting codes in them:

{
  "content": {
    "body": "[1] Dagfinn Ilmari MannsÃ¥ker",
    "format": "org.matrix.custom.html",
    "formatted_body": "<b>[1]</b> Dagfinn Ilmari MannsÃ¥ker",
    "msgtype": "m.notice"
  },
  "event_id": "$1591183492575491NNmPv:irc.snt.utwente.nl",
  "origin_server_ts": 1591183492274,
  "sender": "@_ircnet_lorelai:irc.snt.utwente.nl",
  "type": "m.room.message",
  "unsigned": {
    "age": 3376
  },
  "room_id": "!WNiVmWxmsBkMsusLnT:irc.snt.utwente.nl"
}

ilmari · 2020-06-03T12:04:28Z

Explanation by @leonerd:

PRIVMSG contents can swap between regular text and "CTCP", client-to-client protocol, with a single \x01 byte
An action is the CTCP ACTION command, so in full it looks like :[email protected] PRIVMSG #target :\x01ACTION the action here
it probably doesn't strip CTCPs apart properly
on IRC you have to strip out CTCP -before- you apply Unicode decoding

zouppen · 2020-06-03T12:50:33Z

Not only used in CTCP's but the payload itself may contain codes as well in case of colours and text formatting. So, proper implementation should parse the message codes first and then recode.

vranki · 2020-06-08T20:07:17Z

Yep, this becomes a bit complex as the CTCP messages need quite lot more tuning.

Perhaps we could just disable fallback encoding on any CTCP strings - this way at least UTF-8 actions would work as expected.

ilmari · 2020-06-08T20:13:24Z

Perhaps we could just disable fallback encoding on any CTCP strings - this way at least UTF-8 actions would work as expected.

That still doesn't solve it for messages with colour/formatting codes.

vranki · 2020-06-10T19:57:25Z

Discussed this today and got a new possible solution idea:

Before checking isUtf8() replace \x01 (and other possible formatting codes) with valid utf-8 placeholders
Check isUtf8() and recode if needed
Replace placeholders back with original codes

Implementing this should be relatively simple. Although a bit hacky, it should do the trick.

ilmari · 2020-06-15T10:21:29Z

This turns out to be because is-utf8 mistakenly classifies strings with C0 control characters (except TAB, CR and LF) as non-UTF-8.

An alternative mentioned in the above ticket is utf-8-validate.

vranki · 2020-06-15T10:34:41Z

Good find, thanks.

vranki · 2020-06-17T19:51:32Z

PR #49 made.

Half-Shot · 2020-08-10T12:34:55Z

Believe this is fixed now.

ilmari changed the title ~~encodingFallback always used for actions and (some) notices~~ IRC formatting codes break UTF-8 auto detection and force encodngFallback use Jun 3, 2020

ilmari changed the title ~~IRC formatting codes break UTF-8 auto detection and force encodngFallback use~~ IRC formatting codes break UTF-8 auto detection and force encodingFallback use Jun 3, 2020

Half-Shot closed this as completed Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRC formatting codes break UTF-8 auto detection and force encodingFallback use #44

IRC formatting codes break UTF-8 auto detection and force encodingFallback use #44

ilmari commented May 18, 2020

ilmari commented May 18, 2020

ilmari commented Jun 3, 2020

ilmari commented Jun 3, 2020

zouppen commented Jun 3, 2020

vranki commented Jun 8, 2020

ilmari commented Jun 8, 2020

vranki commented Jun 10, 2020

ilmari commented Jun 15, 2020

vranki commented Jun 15, 2020

vranki commented Jun 17, 2020

Half-Shot commented Aug 10, 2020

IRC formatting codes break UTF-8 auto detection and force encodingFallback use #44

IRC formatting codes break UTF-8 auto detection and force encodingFallback use #44

Comments

ilmari commented May 18, 2020

ilmari commented May 18, 2020

ilmari commented Jun 3, 2020

ilmari commented Jun 3, 2020

zouppen commented Jun 3, 2020

vranki commented Jun 8, 2020

ilmari commented Jun 8, 2020

vranki commented Jun 10, 2020

ilmari commented Jun 15, 2020

vranki commented Jun 15, 2020

vranki commented Jun 17, 2020

Half-Shot commented Aug 10, 2020