Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve encoding detection heuristics #257

Open
fdelapena opened this issue Oct 5, 2018 · 0 comments
Open

Improve encoding detection heuristics #257

fdelapena opened this issue Oct 5, 2018 · 0 comments
Labels

Comments

@fdelapena
Copy link
Contributor

fdelapena commented Oct 5, 2018

There are games with encoding detection issues which generate a lot of bug reports (there are some few problematic but they are widely popular, and they are mostly translations from Japanese).

Currently, liblcf detects encoding by passing strings from database terms in a dirty way to ICU encoding detector, this is done at ReaderUtil::DetectEncodings and works for most games. There have been improvements to heuristics to mitigate additional game issues for the most popular ones, however:

  • Some RPG Maker translations contain a few untranslated Japanese strings in the terms tab in the default RPG_RT.ldb.dat (database template for new games) and quite few games didn't translate them, mostly because they didn't use texts from there in-game. This makes encoding heuristics fail because they guess the game is encoded with Shift_JIS.
  • Several translated games from Japanese are poorly translated, containing in the terms tab several strings in Japanese. There games don't use default menu or battle system or they are empty or just use English there, so a very few terms (if any) are reliable.

So, database terms are not a reliable source of text strings to pass to ICU ucsdet_* functions. It should use existing game maps to find in-game strings, e.g. a way more reliable source are "show message" lines, but these need to be traversed to get parsed, which involves parsing .lmt and pick .lmu files from there, which means loading times to detect encoding may vary, most specially on ports with slow file I/O if the first picked lmu does not contain enough event and show message command data.

But there's more (if this is too much, separating into smaller issues may be worth):

  • There are game translations with broken filename encoding, which may happen when they unpack non-unicode zip from original translations, generating mojibake filenames (but this prevents renaming references in ldb, which gets useful for lazy game translators), so it would make sense to handle separate encodings between filesystem and ldb/lmt/lmu.

  • Additionally, it may even make sense to use yet another encoding for .lsd, because data gets loaded and saved depending on current game encoding. There are games with bundled savegames for different purposes, so there may be a point of breakage with some few possible strings, e.g. actor names, which are customizable from game.
    However .lsd also may contain strings which refer to filenames, e.g. like currently running music or something like that, if I recall correctly, which might be related with several issues playing silence on savegame load 🤕. I think hero names are not as important as filename references, or even handle these strings from the same file separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant