Improve encoding detection heuristics #257

fdelapena · 2018-10-05T01:49:10Z

There are games with encoding detection issues which generate a lot of bug reports (there are some few problematic but they are widely popular, and they are mostly translations from Japanese).

Currently, liblcf detects encoding by passing strings from database terms in a dirty way to ICU encoding detector, this is done at ReaderUtil::DetectEncodings and works for most games. There have been improvements to heuristics to mitigate additional game issues for the most popular ones, however:

Some RPG Maker translations contain a few untranslated Japanese strings in the terms tab in the default RPG_RT.ldb.dat (database template for new games) and quite few games didn't translate them, mostly because they didn't use texts from there in-game. This makes encoding heuristics fail because they guess the game is encoded with Shift_JIS.
Several translated games from Japanese are poorly translated, containing in the terms tab several strings in Japanese. There games don't use default menu or battle system or they are empty or just use English there, so a very few terms (if any) are reliable.

So, database terms are not a reliable source of text strings to pass to ICU ucsdet_* functions. It should use existing game maps to find in-game strings, e.g. a way more reliable source are "show message" lines, but these need to be traversed to get parsed, which involves parsing .lmt and pick .lmu files from there, which means loading times to detect encoding may vary, most specially on ports with slow file I/O if the first picked lmu does not contain enough event and show message command data.

But there's more (if this is too much, separating into smaller issues may be worth):

There are game translations with broken filename encoding, which may happen when they unpack non-unicode zip from original translations, generating mojibake filenames (but this prevents renaming references in ldb, which gets useful for lazy game translators), so it would make sense to handle separate encodings between filesystem and ldb/lmt/lmu.
Additionally, it may even make sense to use yet another encoding for .lsd, because data gets loaded and saved depending on current game encoding. There are games with bundled savegames for different purposes, so there may be a point of breakage with some few possible strings, e.g. actor names, which are customizable from game.
However .lsd also may contain strings which refer to filenames, e.g. like currently running music or something like that, if I recall correctly, which might be related with several issues playing silence on savegame load 🤕. I think hero names are not as important as filename references, or even handle these strings from the same file separately.

The text was updated successfully, but these errors were encountered:

fdelapena added the Encoding label Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve encoding detection heuristics #257

Improve encoding detection heuristics #257

fdelapena commented Oct 5, 2018 •

edited

Loading

Improve encoding detection heuristics #257

Improve encoding detection heuristics #257

Comments

fdelapena commented Oct 5, 2018 • edited Loading

fdelapena commented Oct 5, 2018 •

edited

Loading