Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRC formatting codes break UTF-8 auto detection and force encodingFallback use #44

Closed
ilmari opened this issue May 18, 2020 · 11 comments
Closed

Comments

@ilmari
Copy link

ilmari commented May 18, 2020

On the IRCnet bridge, actions appear to always be interpreted as the fallback encoding (ISO-8859-15)

On the IRC side (client configured to send UTF-8)

<@ilmari> æøå
* ilmari æøå
[notice(#chan)] æøå

On the Matrix side:

ilmari (@_ircnet_ilmari:irc.snt.utwente.nl)
æøå
* ilmari (@_ircnet_ilmari:irc.snt.utwente.nl) ÊÞå
æøå

Screenshot from 2020-05-18 12-14-43

While my notice was correctly decoded as UTF-8 here, on another channel a bot's UTF-8 notices are decoded as ISO-8859-15:

IRC:

-lorelai- [1] spot boston dynamics horse mask (0:13) by Мария Прокопенко [+24/-0, 1519 views]

Matrix:

[1] spot boston dynamics horse mask (0:13) by �а�О� ��ПкПпеМкП [+24/-0, 1519 views]

Screenshot from 2020-05-18 12-18-47

@ilmari
Copy link
Author

ilmari commented May 18, 2020

On another channel where my client is configured to send ISO-8859-15, it works correctly:

IRC:

< ilmari> skjærgårdsøl?
* ilmari liker skjærgårdsøl
[notice(#otherchan)] skjærgårdsøl!

Matrix:

ilmari (@_ircnet_ilmari:irc.snt.utwente.nl)
skjærgårdsøl?
* ilmari (@_ircnet_ilmari:irc.snt.utwente.nl) liker skjærgårdsøl
skjærgårdsøl!

Screenshot from 2020-05-18 12-23-24

@ilmari
Copy link
Author

ilmari commented Jun 3, 2020

matrix-org/matrix-appservice-irc#580 (comment) made me realise that the notices that get mangled have text formatting codes in them:

{
  "content": {
    "body": "[1] Dagfinn Ilmari Mannsåker",
    "format": "org.matrix.custom.html",
    "formatted_body": "<b>[1]</b> Dagfinn Ilmari Mannsåker",
    "msgtype": "m.notice"
  },
  "event_id": "$1591183492575491NNmPv:irc.snt.utwente.nl",
  "origin_server_ts": 1591183492274,
  "sender": "@_ircnet_lorelai:irc.snt.utwente.nl",
  "type": "m.room.message",
  "unsigned": {
    "age": 3376
  },
  "room_id": "!WNiVmWxmsBkMsusLnT:irc.snt.utwente.nl"
}

@ilmari ilmari changed the title encodingFallback always used for actions and (some) notices IRC formatting codes break UTF-8 auto detection and force encodngFallback use Jun 3, 2020
@ilmari ilmari changed the title IRC formatting codes break UTF-8 auto detection and force encodngFallback use IRC formatting codes break UTF-8 auto detection and force encodingFallback use Jun 3, 2020
@ilmari
Copy link
Author

ilmari commented Jun 3, 2020

Explanation by @leonerd:

PRIVMSG contents can swap between regular text and "CTCP", client-to-client protocol, with a single \x01 byte
An action is the CTCP ACTION command, so in full it looks like :[email protected] PRIVMSG #target :\x01ACTION the action here
it probably doesn't strip CTCPs apart properly
on IRC you have to strip out CTCP -before- you apply Unicode decoding

@zouppen
Copy link

zouppen commented Jun 3, 2020

Not only used in CTCP's but the payload itself may contain codes as well in case of colours and text formatting. So, proper implementation should parse the message codes first and then recode.

@vranki
Copy link

vranki commented Jun 8, 2020

Yep, this becomes a bit complex as the CTCP messages need quite lot more tuning.

Perhaps we could just disable fallback encoding on any CTCP strings - this way at least UTF-8 actions would work as expected.

@ilmari
Copy link
Author

ilmari commented Jun 8, 2020

Perhaps we could just disable fallback encoding on any CTCP strings - this way at least UTF-8 actions would work as expected.

That still doesn't solve it for messages with colour/formatting codes.

@vranki
Copy link

vranki commented Jun 10, 2020

Discussed this today and got a new possible solution idea:

  • Before checking isUtf8() replace \x01 (and other possible formatting codes) with valid utf-8 placeholders
  • Check isUtf8() and recode if needed
  • Replace placeholders back with original codes

Implementing this should be relatively simple. Although a bit hacky, it should do the trick.

@ilmari
Copy link
Author

ilmari commented Jun 15, 2020

This turns out to be because is-utf8 mistakenly classifies strings with C0 control characters (except TAB, CR and LF) as non-UTF-8.

An alternative mentioned in the above ticket is utf-8-validate.

@vranki
Copy link

vranki commented Jun 15, 2020

Good find, thanks.

@vranki
Copy link

vranki commented Jun 17, 2020

PR #49 made.

@Half-Shot
Copy link

Believe this is fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants