Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size impact of editor and class reference translations on editor binaries #3421

Open
akien-mga opened this issue Oct 13, 2021 · 12 comments

Comments

@akien-mga
Copy link
Member

akien-mga commented Oct 13, 2021

Describe the project you are working on

Godot editor localization

Describe the problem or limitation you are having in your project

We discussed this on Rocket.Chat #translation, so I'm summarizing the findings here so that we can work on solving some or all of the inefficiencies in our current internationalization workflow.

We currently have two engine resources which can be localized uses gettext PO files:

The current process is that we embed the PO files directly in the editor binary by generating a header with zlib compressed contents of each file: https://github.com/godotengine/godot/blob/d742dcd3ceaa614d2688caed59ec0c75d4041985/editor/editor_builders.py#L75-L122

These embedded PO files are then loading by the editor using the PO resource loader (as if they were external .po files included with the editor binary).

This worked OK while we had only the editor translations, but now with the much bigger class reference resource, we're starting to see a big impact on binary size:

$ ls -lh editor/*_translation*
-rw-r--r-- 1 akien akien 90M Oct 12 13:11 editor/doc_translations.gen.h
-rw-r--r-- 1 akien akien 26M Oct  7 11:19 editor/editor_translations.gen.h

Once compiled and optimized, the 90M doc_translations.gen.h leads to a 16M increase of the size of the Windows editor binary:

$ ls -lh godot.*exe
-rwxr-xr-x. 1 akien akien 88M Oct 12 12:35 godot.windows.opt.tools.64.docsl10n.exe
-rwxr-xr-x. 1 akien akien 72M Oct 12 12:35 godot.windows.opt.tools.64.nodocsl10n.exe

The 26M of editor_translations.gen.h must similarly account for a handful of MBs too, didn't check.

Some problems identified:

  • We include each PO file as is in the header, and even though we compress the data with zlib, it's written as an array of bytes/integers and that seems to defeat some of the gains of compression.
  • Including all files as is means that we embed the msgids (source strings) and PO metadata (comments), and that for every single PO file. So even languages which are only 1% translated add a whopping 2.5M to editor/doc_translations.gen.h for example, as it's the size of the classes.pot file once compressed and written as a byte array.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

I don't have a full-fledged proposal to change this yet, finding one will be the aim of this proposal.

There are a few low-hanging fruits we can work on though:

  • Add a filter in editor/SCsub to only include translations with a high enough completion ratio. This could be done using gettext to get a percentage, but that would add a dependency on our buildsystem, so the simplest is probably to just hardcode a list based on the completion ratios on Weblate. This will significantly reduce the cost of embedding near empty translations (that we need to keep in the repo for Weblate itself so that they can be worked on by translators).

  • Strip comments (aside from #, fuzzy as we need it in the PO loader to skip fuzzy msgstrs) in editor/editor_builders.py before compressing the file and writing to the header. This should save a significant amount of bytes as there's a lot of comments indicating the provenance of strings in the source code.

  • Check if the compressed file contents could be saved in a more size-optimized format than an endless array of integers. It's weird that zlib compressed data ends up taking more space than the original files.

Later on, we might want to rethink how we handle those translations so that we don't need to bundle the whole PO files (which is a source format, not meant for direct consumption, though that's the workflow we use also for game translations).

We could write a Python parser that converts the gettext PO file contents to data structures that Godot can consume readily. So instead of using the PO resource loader, we could write a new Translation format directly with the contents optimized for minimal size usage (e.g. in a Map<StringName msgid, Map<StringName lang, StringName msgstr>>, so we "pay" only once for msgids, and not for each language). @reduz also suggested using a md5 hash of the source strings as keys.

To be discussed further, but I'd first start by doing some of the low hanging fruits above so we see how much size is left taken by translations. A few MBs is a small price to pay for internationalization, but if it's 30% of the binary it starts being more problematic.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

To be determined based on discussion.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No, it's about optimizing the editor binary size.

Is there a reason why this should be core and not an add-on in the asset library?

See above.

@akien-mga
Copy link
Member Author

akien-mga commented Oct 13, 2021

Some numbers of the actual impact of each resource on binary size. Builds made with scons p=x11 tools=yes target=release_debug production=yes on latest 3.x branch (4186c5e75).

Editor l10n only

(All files in doc/translations/ removed.)

Test \ Object Editor binary Generated header Binary size diff
No l10n 69.39 MiB 304 B n/a
msgids only1 69.44 MiB 242 KiB 44 KiB
zh_CN translation only2 69.49 MiB 546 KiB 96 KiB
All translations (68) 74.01 MiB 25.7 MiB 4.62 MiB

Classref l10n only

(All files in editor/translations/ removed.)

Test \ Object Editor binary Generated header Binary size diff
No l10n 69.39 MiB 289 B n/a
msgids only1 69.87 MiB 2.65 MiB 473 KiB
zh_CN translation only2 70.32 MiB 5.13 MiB 922 KiB
All translations (23) 85.44 MiB 89.38 MiB 16.04 MiB

All included

Editor binary Generated headers Binary size diff
90.06 MiB 115.13 MiB 20.66 MiB

Footnotes

  1. Tested by removing all .po files, then copying the .pot file as fr.po (i.e. we include one translation with no actual msgstrs, only the msgids). That tells us the base cost of any added translation file, even empty (msgids + comments and other non-content PO formatting). 2

  2. Taking zh_CN as test point as it's 100% complete for the editor and around ~70% complete for the classref. Encoding Chinese glyphs as bytes might also take significantly more space than equivalent English strings. Note that this includes the size of msgids + the translated msgstrs. 2

@bruvzg
Copy link
Member

bruvzg commented Oct 13, 2021

Maybe we should remove all translations from the editor binary, and distribute them as a separate "language packs", and auto download the required pack when the language is selected / on the first start.

It this case, we can also move all extra editor fonts (about 7 MB) to the language packs as well (and use better CJK fonts, with the better coverage and regional variants). Currently, we use DroidSansFallback + DroidSansJapanese (4 MB). Language packs could use NotoSansCJK-sc/tc/kr/jp (4 x 16 MB) instead, which would be more consistent with the rest of editor fonts.

@reduz
Copy link
Member

reduz commented Oct 13, 2021

I dont really think its a problem, the editor gets bigger, but we could probably improve the translation compression code. One obvious thing that comes to mind is using an md5 or even a 64 bit hash rather than the English key.

@akien-mga
Copy link
Member Author

It this case, we can also move all extra editor fonts (about 7 MB) to the language packs as well (and use better CJK fonts, with the better coverage and regional variants).

That could be interesting, but it would need significant changes to the way we package and distribute releases, and ensure that the process for fetching new translations and fonts is seamless (which is not trivial if e.g. fonts need to be imported in the project to be usable in the editor).

It's also worth noting that translations and fonts for a given language don't necessary go together. You do need the fonts for e.g. Arabic to use the Arabic editor translations, but you do not need the translations for all the languages that you want to support for game i18n (i.e. you may develop a game using the editor in pt-BR while you want to support game localization to Arabic and CJK - you don't need their editor translations, only their fonts).

@Calinou
Copy link
Member

Calinou commented Oct 18, 2021

As for reducing editor binary size due to font embedding, loading system fonts is worth investigating too: #306

This will result in a slightly different appearance across platforms, but I think this is acceptable for non-Latin languages.

@Feniks-Gaming
Copy link

Editor size isn't a huge problem to me but if we can figure out the way to keep it us small as possible that is obviously a bonus. Increase by 20% in editor size due to a feature majority of users won't need and those who need it will only need tiny fraction of available translations anyway stands in a big contrast to how Godot normally handles things avoiding bloat.

Is 90mb acceptable size, yes it is but is 70 MB a better size obviously. So if I was to vote I would vote for something similar to the way we handle Godot exports you don't have all exports available until you need them and then you just download the one you need which is nice solution. And likely exporting to mobile is used by more people than German translation for example.

@akien-mga
Copy link
Member Author

akien-mga commented Oct 20, 2021

  • Add a filter in editor/SCsub to only include translations with a high enough completion ratio. This could be done using gettext to get a percentage, but that would add a dependency on our buildsystem, so the simplest is probably to just hardcode a list based on the completion ratios on Weblate. This will significantly reduce the cost of embedding near empty translations (that we need to keep in the repo for Weblate itself so that they can be worked on by translators).

Started with this, which brings the size down to 75.12 MiB for the same build conditions as #3421 (comment), i.e. a relative increase of 1.11 MiB compared to pre-classref translation builds. That's a total of 5.73 MiB used for translations in the binary (compared to 30 MiB before).

akien-mga added a commit to akien-mga/godot that referenced this issue Oct 20, 2021
This reduces the size of the editor binaries significantly, as we otherwise
embed all WIP translations, including ones with very low completion ratios,
and end up paying for the size of all `msgid`s for each locale.

Cf. godotengine/godot-proposals#3421 for details.

The thresholds used are:
- 30% for the editor interface (should already include most common strings
  while more obscure ones like UndoRedo action names might be untranslated).
- 10% for the class reference: this is a HUGE resource and 10% is already
  a lot of useful content, especially if focused on the most used APIs.

For 3.x, we also exclude languages that require complex text layout support
to be displayed properly.

This currently reduces the size of the editor binary by 17% on Linux.

The list will be synced manually every now and then.
akien-mga added a commit to akien-mga/godot that referenced this issue Oct 20, 2021
This reduces the size of the editor binaries significantly, as we otherwise
embed all WIP translations, including ones with very low completion ratios,
and end up paying for the size of all `msgid`s for each locale.

Cf. godotengine/godot-proposals#3421 for details.

The thresholds used are:
- 30% for the editor interface (should already include most common strings
  while more obscure ones like UndoRedo action names might be untranslated).
- 10% for the class reference: this is a HUGE resource and 10% is already
  a lot of useful content, especially if focused on the most used APIs.

This currently reduces the size of the editor binary by 17% on Linux.

The list will be synced manually every now and then.

(cherry picked from commit 8425c58)
@bruvzg
Copy link
Member

bruvzg commented Oct 21, 2021

which is not trivial if e.g. fonts need to be imported in the project to be usable in the editor.

There's no need to add fonts to the project / import, using the same global config folder as export templates do is probably a better way.

It's also worth noting that translations and fonts for a given language don't necessary go together.

Fonts and translations can be separate downloads ("download full language pack" or "download only fonts" options). It's also giving users more control over fonts in general (unlike the custom font editor setting, it allows more than one file in the font stack, e.g. adding emoji fonts).

I'm currently experimenting with moving stuff out of the binary in this branch https://github.com/bruvzg/godot/tree/lang_packs_poc, right now it's done for the fonts and translations (no installation/download UI), and seems to work fine.

Screenshot Screenshot 2021-10-21 at 13 06 37

@Riteo
Copy link

Riteo commented Oct 24, 2021

@bruvz's idea sounds interesting, altough it risks to become way too complicated IMO. One nice thing about godot is that it's pretty much self-contained, with the exception of release templates (which you technically don't need).

Anyways, modular language packs or not, @reduz' idea seems like the next logical step to do IMO.

@Calinou
Copy link
Member

Calinou commented Nov 3, 2021

We could remove the Pirate locale to save some space in the editor binary. I know it's no fun, but it's one way to further reduce the impact on binary size 🙂

sairam4123 pushed a commit to sairam4123/godot that referenced this issue Nov 10, 2021
This reduces the size of the editor binaries significantly, as we otherwise
embed all WIP translations, including ones with very low completion ratios,
and end up paying for the size of all `msgid`s for each locale.

Cf. godotengine/godot-proposals#3421 for details.

The thresholds used are:
- 30% for the editor interface (should already include most common strings
  while more obscure ones like UndoRedo action names might be untranslated).
- 10% for the class reference: this is a HUGE resource and 10% is already
  a lot of useful content, especially if focused on the most used APIs.

For 3.x, we also exclude languages that require complex text layout support
to be displayed properly.

This currently reduces the size of the editor binary by 17% on Linux.

The list will be synced manually every now and then.
lekoder pushed a commit to KoderaSoftwareUnlimited/godot that referenced this issue Dec 18, 2021
This reduces the size of the editor binaries significantly, as we otherwise
embed all WIP translations, including ones with very low completion ratios,
and end up paying for the size of all `msgid`s for each locale.

Cf. godotengine/godot-proposals#3421 for details.

The thresholds used are:
- 30% for the editor interface (should already include most common strings
  while more obscure ones like UndoRedo action names might be untranslated).
- 10% for the class reference: this is a HUGE resource and 10% is already
  a lot of useful content, especially if focused on the most used APIs.

For 3.x, we also exclude languages that require complex text layout support
to be displayed properly.

This currently reduces the size of the editor binary by 17% on Linux.

The list will be synced manually every now and then.
@bruvzg
Copy link
Member

bruvzg commented Mar 18, 2022

Some other potential ways to decrease editor size:

@akien-mga
Copy link
Member Author

akien-mga commented Mar 18, 2022

We could remove the Pirate locale to save some space in the editor binary. I know it's no fun, but it's one way to further reduce the impact on binary size

It's already excluded since godotengine/godot#54020 (same with all translations that don't go over the threshold for inclusion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants