Translations source copy check #13419

daschuer · 2024-06-30T12:42:53Z

When a source string is copied to translation it is listed as translated=yes and will be likely not become translated later.
This is the issue, this PR tries to fix.

On the other hand there are strings like "1/4" which are the same in all languages or terms that are by luck the same as the target language.

It is solved by checking all new translations for source == translation. If this is the case an allow list is consulted and than the commit is rejected. This need to be fixed at Transiflex or if it is a valid untranslated string the allow list has to be maintained.

The last days I have used the script to put Transiflex into a good shape. But that was a really tedious work. Especially because these bogus translations pop up again from the translation memory when not explicit deleted for each language.
This script will hopefully prevent future faults.

.pre-commit-config.yaml

…le multi threading

daschuer · 2024-07-07T22:35:04Z

Now I am finished to clean up the translations form false source copies. The resulting source_copy_allow_list.xml is up to date with all allowed source copies as far as I was able to check it.

Swiftb0y · 2024-10-15T17:58:06Z

.pre-commit-config.yaml

-        exclude: ^(packaging/wix/LICENSE.rtf.in|src/dialog/dlgabout\.cpp|.*\.(?:pot?|(?<!d\.)ts|wxl|svg))$
+        exclude: ^(packaging/wix/LICENSE.rtf.in|src/dialog/dlgabout\.cpp|.*\.(?:pot?|ts|wxl|svg))$


this excludes .d.ts files used for the controller API

same issues below

all ts files are excluded but not "d.ts" files so it should be OK

Thats not true. It matches both:

You need the negative lookbehind

Ah, I see. Thank you. fixed.

Swiftb0y · 2024-10-15T18:00:04Z

res/translations/source_copy_allow_list.xml

+    <source>Param EQ</source>
+    <allow_all_languages>true</allow_all_languages>


questionable

What do you mean? Any suggestion?

Well, in other languages, it may be translated differently. I don't think we can make this blanket statement that its "Param EQ" in all languages. This concern applies to many strings here.

Swiftb0y · 2024-10-15T18:00:12Z

res/translations/source_copy_allow_list.xml

+    <source>Loudness</source>
+    <allow_all_languages>true</allow_all_languages>


questionable

Swiftb0y · 2024-10-15T18:01:05Z

res/translations/source_copy_allow_list.xml

I'm not sure I like this allowlist approach. There are many strings which are debatable and having to maintain this giant list is not great either.

Also, XML diffs are painful to review.

For my understanding the reasons why xml is painful to review is if a tool restructures it. This should not happen here, because additional texts are appended.
I have picked XML, because the ts files are also XML.
Any suggestions?

@Swiftb0y what could be the alternative?

If we really want an allow list (not sure if we really want this), let's use a plain text file with one source per line, followed by a tab character and then a comma separated list of fnmatch expressions:

Phase allowed for en and German variants en,de* This is allowed for all languages. *

That would be much less verbose than a huge XML.

I have not much interest to write a custom parser. Can we decide for an established format.

not sure if we really want this

Can you confirm the issue? Is there a alternative to distinguish wanted form unwanted source copies?
I have checked the ts format and transifex but there is nothing we can use as a flag.

Custom parser would be straightforward:

def parse_allowlist(path: pathlib.Path) -> typing.Iterable[tuple[str, list[str]]: with path.open(mode="r") as fp: for line in fp: source, _, langstr = line.partition("\t") assert source langs = langstr.split(",") assert langs yield source, langs allowlist = dict(parse_allowlist(pathlib.Path("path/to/allowlist"))) # Check if (current_source, current_lang) is on allowlist is_allowed = any(fnmatch.fnmatchcase(current_lang, lang) for lang in allowlist.get(current_source, []))

(wrote this on my phone, so it's untested)

Btw, I can't open the allowlist in the GitHub app on my phone because it is already too large.

The allow list has currently 3755 lines with ~4 lines per string we will still almost 2000 lines. Still long.

Any custom format is not well defined, not extensible, without escape rules. While all these isues are solved with xml because this format is also used for source TS files.

Since the script extends this file automatically, there is no need to build it by hand. We need just confirm new entries which are only new lines, not suffreing any review issues.

I am not convinced to replace XML.

if the lack of definition and the unwillingness to write a parser is the problem, just use CSV with two columns... The tree nature of XML is overkill.

The allow list has currently 3755 lines with ~4 lines per string we will still almost 2000 lines. Still long.

Where does the 4 lines per string come from? With my proposal it would be 1 line per allowed string (you can list multiple allowed languages in the same line)

@Swiftb0y actually the format I proposed is already CSV with tab delimiter (or TSV). You could probably already use the stdlib csv module with delimiter='\t' if there are multiline strings instead of str.partition.

TOML would also work (see tomllib in the python stdlib), but that would not be much shorter than XML (but way more readable).

The nice thing about TSV is that it's rendered as a table by github when you use the Web UI (does not work in the Android app unfortunately).

Example: https://github.com/Holzhaus/helicon/blob/main/mapping.tsv

Holzhaus · 2024-10-15T18:26:24Z

Missing context here. Under which circumstances are English source strings copied to the translated target language? Is it a manual thing from a transifex user? Or why does this happen?

daschuer · 2024-10-15T22:37:06Z

Yes, it is a manual think from a user. If they are too lazy and like to gain some more percentage they seems to just copy the source strings.

Holzhaus · 2024-10-15T22:49:41Z

In that case I'm questioning if we really want to check it on pull/during committing.

Can this somehow be prevented on transifex? Or maybe a monthly check which opens a github issue if necessary?

Because you cannot really fix the commit locally anyway. Instead you need to go to transifex and remove translations and the perform a fresh tx pull, or am I misunderstanding this?

daschuer · 2024-10-15T23:09:12Z

pre-commit takes care that the check is only done if one is committing changes to the ts files. This is the right moment to reject false translations.

Because you cannot really fix the commit locally anyway. Instead you need to go to transifex and remove translations and the perform a fresh tx pull, or am I misunderstanding this?

Correct.

I think the CI of the new automatic created PR will also fail in that case, which is desired, right?

…translated strings

daschuer · 2024-10-27T23:07:52Z

Done.

Holzhaus · 2024-11-02T17:07:45Z

res/translations/source_copy_allow_list.tsv

@@ -0,0 +1,436 @@
+lang	source


If you swap the columns, it's much more readable IMHO.

Unfortunately not. This was my first version and I have swapped columns to have languages aligned.

Holzhaus · 2024-11-02T17:08:57Z

res/translations/source_copy_allow_list.tsv

+nl	is
+nl	Cover
+nl	Track BPM: 
+nl	Artist + Title


Isn't "Titel" the durch word?

Probably yes. I will remove the translation.

Holzhaus · 2024-11-02T17:10:30Z

res/translations/source_copy_allow_list.tsv

+vi	Shuffle
+vi	Relink
+nl	Lossy
+nl	Lossless


Isn't this "exact omkeerbaar"?

Holzhaus · 2024-11-02T17:11:24Z

res/translations/source_copy_allow_list.tsv

+de,nl	Decks
+de,nl	Track
+de,nl	Tracks
+de	Add Crate as Track Source


Holzhaus · 2024-11-02T17:18:51Z

tools/ts_source_copy_check.py

+            if source == source_text:
+                if lang == "*":
+                    return True
+                if language in lang:


This means that en will also match if the string is de,en_US,fr although en is not in the list.

Suggested change

if language in lang:

if language in lang.split(","):

Even better would be to use fnmatch as I suggested before, because if you really want to match all English dialects, you could write de,en*,fr without having to list each and every one of them.

Holzhaus · 2024-11-02T17:21:20Z

tools/ts_source_copy_check.py

+    return False
+
+
+def add_to_allow_list(source_text, language):


This function basically duplicates the parsing logic from above.

Holzhaus · 2024-11-02T17:21:47Z

tools/ts_source_copy_check.py

+    if ret:
+        print(
+            "\n"
+            "All not allowed copied source translations need to be removed"


Suggested change

"All not allowed copied source translations need to be removed"

"All disallowed copied source translations need to be removed"

Holzhaus · 2024-11-02T17:22:25Z

tools/ts_source_copy_check.py

+
+    if ret:
+        print(
+            "\n"


Suggested change

"\n"

This is desired to have a distance to the individual complains.

…e code.

daschuer · 2024-11-02T22:55:58Z

Done

github-actions bot added the code quality label Jun 30, 2024

daschuer force-pushed the ts_source_copy_check branch 4 times, most recently from d83d292 to 7afcc71 Compare June 30, 2024 23:43

daschuer changed the base branch from 2.5 to 2.4 June 30, 2024 23:43

ronso0 reviewed Jul 1, 2024

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

daschuer force-pushed the ts_source_copy_check branch from 3da94ca to a2cdac2 Compare July 3, 2024 06:40

daschuer added 5 commits July 8, 2024 00:26

Add a pre-commit hook to check for copied source texts

9e1e7ef

Integrate it to .pre-commit-config.yaml

5879a40

use an allow list for intended source copies

65e24f5

for English languages all source copies are allowed

223b736

Add exception handling and prevent reformatting of ts files and disab…

255c7d6

…le multi threading

daschuer force-pushed the ts_source_copy_check branch from a2cdac2 to a26bc01 Compare July 7, 2024 22:31

JoergAtGithub added the needs review label Aug 6, 2024

Swiftb0y reviewed Oct 15, 2024

View reviewed changes

daschuer added 2 commits October 24, 2024 07:53

Add initial source_copy_allow_list that contains all alowed copied/Un…

17b9206

…translated strings

Transition to a tsv based allow list

cd83745

daschuer force-pushed the ts_source_copy_check branch 2 times, most recently from 9a47c31 to a6e2409 Compare October 27, 2024 15:51

daschuer added 2 commits November 2, 2024 17:44

Use tsv files for the allow list

dcb606b

Don't exclude *.d.ts files

0dee602

daschuer force-pushed the ts_source_copy_check branch from a6e2409 to f97a0c7 Compare November 2, 2024 16:45

Holzhaus reviewed Nov 2, 2024

View reviewed changes

Allow a wild card at any position in laguage strings. deduplictaed th…

233777b

…e code.

daschuer force-pushed the ts_source_copy_check branch from f97a0c7 to 233777b Compare November 2, 2024 22:54

		exclude: ^(packaging/wix/LICENSE.rtf.in\|src/dialog/dlgabout\.cpp\|.*\.(?:pot?\|(?<!d\.)ts\|wxl\|svg))$
		exclude: ^(packaging/wix/LICENSE.rtf.in\|src/dialog/dlgabout\.cpp\|.*\.(?:pot?\|ts\|wxl\|svg))$

		<source>Param EQ</source>
		<allow_all_languages>true</allow_all_languages>

		<source>Loudness</source>
		<allow_all_languages>true</allow_all_languages>

	"All not allowed copied source translations need to be removed"
	"All disallowed copied source translations need to be removed"

Translations source copy check #13419

Are you sure you want to change the base?

Translations source copy check #13419

Conversation

daschuer commented Jun 30, 2024

daschuer commented Jul 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Holzhaus Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Holzhaus commented Oct 15, 2024

daschuer commented Oct 15, 2024

Holzhaus commented Oct 15, 2024 • edited Loading

daschuer commented Oct 15, 2024

daschuer commented Oct 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daschuer commented Nov 2, 2024

Holzhaus Oct 15, 2024 •

edited

Loading

Holzhaus Oct 16, 2024 •

edited

Loading

Holzhaus Oct 16, 2024 •

edited

Loading

Holzhaus commented Oct 15, 2024 •

edited

Loading

Holzhaus Nov 2, 2024 •

edited

Loading