Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future Direction for i18n of Web Applications #50

Closed
turquoiseowl opened this issue Apr 6, 2013 · 12 comments
Closed

Future Direction for i18n of Web Applications #50

turquoiseowl opened this issue Apr 6, 2013 · 12 comments
Milestone

Comments

@turquoiseowl
Copy link
Owner

No doubt we have all come to this i18n project looking for a better way to internationalize our web applications. We see that doing the old .NET resource look-up is backward. I expect we also see that leveraging the PO infrastructure for getting messages translated is the way forward.

Unfortunately, the PO infrastructure (i.e. the GNU Portable Object file format specification and the world of tools for translating the files) is very much tethered to GetText, the latter being very backward IMO.

Whoever invented GetText had a brain-wave: we can encode strings in our source code in such a way that A) they can be hooked at run-time, and B) we can find and extract those strings from the source code. A very nice duality! So he or she wrote a library of functions that can look-up and swap message strings, and a tool for scanning source code files for those function calls and extracting message strings to be translated. It therefore assumes that all your message strings are contained in source code files which it can parse, and that they can be encoded as an argument to a function call e.g. _("Translate me!");

For someone facing the problem of how to internationalize a GUI app written in C, GetText is a good approach. For a back-end server program (like a web application), I suggest it is also reasonable, but not the best. With a back-end application, we have access to the output stream, and with a web application it is very easy to get at the HTTP response and do our translations there.

Now, as soon as we drop one side of the duality, one might start to wonder about the other side.

The question is, why bother with all those _() functions when we only need them to mark the message strings (given that we can hook into the HTTP response body). The reason, of course, is that we still need to mark the message strings so they can be extracted into the PO file. Okay, but if we were going to choose a method for marking message strings for extraction, unhindered by any considerations other than it needs to be reliable, would we choose prefixing the string with _(" and suffixing with ")?

There must be a better way to mark message strings, so that they can be easily picked up in source code and the HTTP response. The same algorithm can be used for both. Better still would be compatibility with SQL LIKE so that they can be extracted from database tables too e.g. product descriptions.

The marking can be done in the string itself, so message strings can be written straight into source files without the need to call any helper functions. Very useful for const strings such as C# attributes and data annotations. They would be entirely language independent: C#, Razor, JavaScript, HTML. They can also be written straight into database fields. No need to think "how do I access that helper function?"

Performing the translations at the HTTP response layer has the advantage of confining message look-up and patching to a single place, hence efficiency gains. It reduces dependency on any particular web development platform; we can forget about MVC and drop down the stack to ASP.NET (or even lower).

So where are we with this? With Issue #37 I have taken a stab at defining a suitable message marking syntax, called the Nugget syntax. There will be scope for improvement on the syntax I have no doubt (and a better name). It would be great to have a discussion with you guys on this. I'm sure we can come up with a syntax that is easy to remember and use, and yet robust. Support for string formatting is essential (i.e. {0} substitution), and pluralization would be nice.

With the marking syntax defined, the only outstanding work is to swap out (or augment) the GetText-dependent post-build task with new logic for extracting the marked message strings and adding them to the PO output. My preference here would be to drop GetText altogether (along with the _() calls), but that would mean dropping backward compatibility for projects.

The v2.0 branch includes all the other support necessary for post-processing the HTTP response. At the moment it has support for processing the Nugget marking syntax, and changing that to support any new syntax would be trivial.

We then get to keep the best bits of the GNU translation project:

  • PO message file format
  • PO editor tools including collaborative ones

It has been a few months now that I have been developing a web app using i18n v2.0 branch, where there is the option to encode a message string as either _("Translate Me") or "[[[Translate Me]]]". Given the latter takes no extra thought other than including the [[[ and ]]] it wins every time.

Martin Connell

@danielcrenna
Copy link
Collaborator

Very interesting. You say you would like to remove GetText, I tend to agree from the standpoint of requiring yet another dependency with cross platform concerns. Instead this is much more suitable as a standard with platform specific inplementations or in our case Mono/CIL.

Is your proposal to replace it with a new parser that outputs nuggets into PO format? I would be in favor of this as I am biased to the localization happening in the HTTP pipeline and irrespective of tier. _("") supports neither.

On 2013-04-06, at 8:00 AM, Martin Connell [email protected] wrote:

No doubt we have all come to this i18n project looking for a better way to internationalize their web application. We see that doing the old .NET resource look-up is backward. I expect we also see that leveraging the PO infrastructure for getting messages translated is the way forward.

Unfortunately, the PO infrastructure (i.e. the GNU Portable Object file format specification and the world of tools for translating the files) is very much tethered to GetText, the latter being very backward IMO.

Whoever invented GetText had a brain-wave: we can encode strings in our source code in such a way that A) they can be hooked at run-time, and B) we can find and extract those strings from the source code. A very nice duality! So he or she wrote a library of functions that can look-up and swap message strings, and a tool for scanning source code files for those function calls and extracting message strings to be translated. It therefore assumes that all your message strings are contained in source code files which it can parse, and that they can be encoded as an argument to a function call e.g. _("Translate me!");

For someone facing the problem of how to internationalize a GUI app written in C, GetText is a good approach. For a back-end server program (like a web application), I suggest it is also reasonable, but not the best. With a back-end application, we have access to the output stream, and with a web application it is very easy to get at the HTTP response and do our translations there.

Now, as soon as we drop one side of the duality, one might start to wonder about the side.

The question is, why bother with all those _() function when we only need them to mark the message strings (given that we can hook into the HTTP response body). The reason, of course, is that we still need to mark the message strings so they can be extracted into the PO file. Okay, but if we were going to choose a method for marking message strings for extraction, unhindered by any considerations other than it needs to be reliable, would we choose prefixing the string with _(" and suffixing with ")?

There must be a better way to mark message strings, so that they can be easily picked up in source code and the HTTP response. The same algorithm can be used for both. Better still would be compatibility with SQL LIKE so that they can be extracted from database tables too e.g. product descriptions.

The marking can be done in the string itself, so message strings can be written straight into source files without the need to call any helper functions. Very useful for const strings such as C# attributes and data annotations. They would be entirely language independent: C#, Razor, JavaScript, HTML. They can also be written straight into database fields. No need to think "how do I access that helper function?"

Performing the translations at the HTTP response layer has the advantage of confining message look-up and patching to a single place, hence efficiency gains. It reduces dependency on any particular web development platform; we can forget about MVC and drop down the stack to ASP.NET (or even lower).

So where are we with this? With Issue #37 I have taken a stab at defining a suitable message marking syntax, called the Nugget syntax. There will be scope for improvement on the syntax I have no doubt (and a better name). It would be great to have a discussion with you guys on this. I'm sure we can come up with a syntax that is easy to remember and use, and yet robust. Support for string formatting is essential (i.e. {0} substitution), and pluralization would be nice.

With the marking syntax defined, the only outstanding work is to swap out (or augment) the GetText-dependent post-build task with new logic for extracting the marked message strings and adding them to the PO output. My preference here would be to drop GetText altogether (along with the _() calls), but that would mean dropping backward compatibility for projects.

The v2.0 branch includes all the other support necessary for post-processing the HTTP response. At the moment it has support for processing the Nugget marking syntax, and changing that to support any new syntax would be trivial.

We then get to keep the best bits of the GNU translation project:

PO message file format
PO editor tools including collaborative ones
It has been a few months now that I have been developing a web app using i18n v2.0 branch, where there is the option to encode a message string as either _("Translate Me") or "[[[Translate Me]]]". Given the latter takes no extra thought other than including the [[[ and ]]] it wins every time.

Martin Connell


Reply to this email directly or view it on GitHub.

@turquoiseowl
Copy link
Owner Author

Yes, drop GetText and replace with our own parser to extract nuggets from sources in a Visual Studio project and/or folder branch. This could be called GetMessages to keep with PO terminology, or GetNuggets :)

I'm not so sure about dropping msgmerge, however. This seems to live more in the PO world and does a good job as far as I can see, though of course perfectly includable in GetMessages.

Back to GetText: the current regex in the v2.0 branch for post-processing the [[[message]]] format nuggets in the HTTP response is reliable for starters (though there may be issues with strings spread over separate lines etc.) It is then a case of parsing the source files line-by-line, building a dictionary Keyed by each nugget and storing the location as the Value, and finally outputting the dictionary to a PO format. That format is potentially quite complex but for immediate needs is simple enough.

My plan at the moment is to write this parser when my current web project needs to go international. Not sure when that will be and it could be a year or so away; but in the meantime I'm coding translatable messages in the app as nuggets. Given they don't exist in the PO file at present (because of course GetText ignores them), they are being output as they are with the markers removed (by the v2.0 post processing).

@rickardliljeberg
Copy link
Contributor

I just want to add that [ and ] are not super comfortable to write on a Swedish keyboard.
Alt Gr + 8 and 9 to make them.

but on the other hand all a swedish person has access to without pushing a secondary button is
',.-

so that sort of suck... but i write _() faster than i write []

@danielcrenna
Copy link
Collaborator

As Martin mentioned, it's less about what the nugget tokens are vs. the
approach. We could easily allow you to change the nugget you use, and so we
might parse string content for ---text-- or similar. Though the keys you
mentioned are problematic inside code that isn't wrapped in a string
literal.

On Sat, Apr 6, 2013 at 1:07 PM, Rickard [email protected] wrote:

I just want to add that [ and ] are not super comfortable to write on a
Swedish keyboard.
Alt Gr + 8 and 9 to make them.

but on the other hand all a swedish person has access to without pushing a
secondary button is
',.-

so that sort of suck... but i write _() faster than i write []


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-15999828
.

Daniel Crenna
Conatus Creative Inc.
cell:613.400.4286

@rickardliljeberg
Copy link
Contributor

Agreed, I think it sounds lovely with post work since it must make random attributes and similar work better and not needed overloading.

while someone is here, i can't find how to programatically change language. LanguageFilter.RedirectWithLanguage is set protected but would do the trick. am i missing something.
Must be common to expose language choice to the user and thereby needing to change it programatically.

@turquoiseowl
Copy link
Owner Author

Good point about the markers/tokens being user-configurable. I can't see any reason why not, and that would be very nice.

It might be slightly more tricky when it comes down to formatted nuggets. An example of the syntax at the moment for these is:

[[[Welcome %0, you last signed in on %1|||{0}|||{1}]]]

which is used, say, in Razor like:

<text>@string.Format("[[[Welcome %0, you last signed in on %1|||{0}|||{1}]]]", userName, lastDate)</text>

The extra level of indirection is required to pass the userName and lastDate values through to the post processor where they are passed through formatting once more (with any message string got from the PO file, the translator thus having the freedom to put %0 and %1 anywhere they want in the message).

So we have:

  • Nugget start string, defaults to [[[ at present
  • Nugget end string, default to ']]]'
  • Nugget formatting delimiter, defaults to '|||'

Making the start and end strings different such that they don't overlap eases the parsing of nuggets considerably. For instance, with '###' for both start and end, any parser needs to keep a progress track of whether it is on the start or end marker. And checking for closed markers becomes necessary etc. That is why I went for square brackets, because they naturally formed open and close pairs, weren't HTML/XML markup, and were less common than (). Oh, and weren't used by C# string formatter i.e. {}.

Personally I would have used «««Translate me»»» as I have a macro set up for those, but appreciate not many other will. Perhaps Germans? How about Swedes Rickard?

@rickardliljeberg
Copy link
Contributor

No, sorry I have no idea on how to create those on my keyboard, nor my girlfriends German one ;-)

but you have a valid point that macros are a sweet way to go.
user configurable characters would nice but irregardless, a simple VS-plugin that allows you to tie up whatever character you select to a macro on your keyboard would be amazing.

I have never written a VS plugin but several ideas pops in to my head... such as select any text and double tap ctrl to wrap... or similar... a VS plugin would make macros the ultimate way to go i think

@rickardliljeberg
Copy link
Contributor

A then when the VS plugin supports inline translation it will be absolutely amazing.

I think big parts of the world is in my situation... have a "small" language like Swedish... Swedish is required but English is usually needed soon thereafter. After that tho it's usually fine for a while.

now point here is that programmers in my situation usually speak both their native tongue and English... so I can translate everything myself to the first language (Swedish, since default is english). and that would be pretty slick to be able to do "inline" with a popup of some sort from the vs plugin, as soon as i have typed a line.

oh well, one can dream :-)

@turquoiseowl
Copy link
Owner Author

This might be mission creep, but agree it would be nice if the nugget syntax allowed for inline translations. E.g.

[[[Welcome|||de|||Willkommen]]]

IIRC, the post processor at the moment stops at the first ||| when it extracts the identity of the token, so only Welcome would want to go in the PO file.

The post -processor would then check for any PO translations first, then any inline ones, and fallback to token/default.

@rickardliljeberg
Copy link
Contributor

Actually I am not sure I would like to have multiples in the file. I would rather want a tool come up and merge in the translation into the PO file... but details.

Martin, do you have a minute, i have two questions on multiple projects and postbuild... I input both projects with inputpaths but it does not seem to parse second project.
It's a shame git does not have live chat. but if you have a moment to help me i have my firstname at liljeberg.se both for email and gtalk

@raulvejar
Copy link
Contributor

This all sounds good, I'm a bit worried about supporting backwards compatibility, maybe we should just do a clean break to avoid having to package gettext at all.
Allowing people to change the way strings are encoded makes it a much harder problem when you consider that multiple projects could use different markers and we would need to keep that configuration on a per-project basis to be able to generate the correct translation or even build the PO database.
It does feel more and more that we need to build a VS plugin that will include a general configuration tab and allow people to generate the PO file on-demand and not just on project build. Have I told you guys what a mess it's doing for my subversion repository have the catalogues touched every time I build?

@rickardliljeberg
Copy link
Contributor

I agree, I think it is time to drop all backwards compability.

so both gettext/msgmerge but also all classes, overloads and interfaces that was there simply to handle _() function call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants