Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translations source copy check #13419

Open
wants to merge 10 commits into
base: 2.4
Choose a base branch
from

Conversation

daschuer
Copy link
Member

When a source string is copied to translation it is listed as translated=yes and will be likely not become translated later.
This is the issue, this PR tries to fix.

On the other hand there are strings like "1/4" which are the same in all languages or terms that are by luck the same as the target language.

It is solved by checking all new translations for source == translation. If this is the case an allow list is consulted and than the commit is rejected. This need to be fixed at Transiflex or if it is a valid untranslated string the allow list has to be maintained.

The last days I have used the script to put Transiflex into a good shape. But that was a really tedious work. Especially because these bogus translations pop up again from the translation memory when not explicit deleted for each language.
This script will hopefully prevent future faults.

@daschuer daschuer force-pushed the ts_source_copy_check branch 4 times, most recently from d83d292 to 7afcc71 Compare June 30, 2024 23:43
@daschuer daschuer changed the base branch from 2.5 to 2.4 June 30, 2024 23:43
@daschuer
Copy link
Member Author

daschuer commented Jul 7, 2024

Now I am finished to clean up the translations form false source copies. The resulting source_copy_allow_list.xml is up to date with all allowed source copies as far as I was able to check it.

exclude: ^(packaging/wix/LICENSE.rtf.in|src/dialog/dlgabout\.cpp|.*\.(?:pot?|(?<!d\.)ts|wxl|svg))$
exclude: ^(packaging/wix/LICENSE.rtf.in|src/dialog/dlgabout\.cpp|.*\.(?:pot?|ts|wxl|svg))$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this excludes .d.ts files used for the controller API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issues below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all ts files are excluded but not "d.ts" files so it should be OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats not true. It matches both:
image
You need the negative lookbehind
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thank you. fixed.

Comment on lines 873 to 874
<source>Param EQ</source>
<allow_all_languages>true</allow_all_languages>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Any suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in other languages, it may be translated differently. I don't think we can make this blanket statement that its "Param EQ" in all languages. This concern applies to many strings here.

Comment on lines 861 to 862
<source>Loudness</source>
<allow_all_languages>true</allow_all_languages>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this allowlist approach. There are many strings which are debatable and having to maintain this giant list is not great either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, XML diffs are painful to review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding the reasons why xml is painful to review is if a tool restructures it. This should not happen here, because additional texts are appended.
I have picked XML, because the ts files are also XML.
Any suggestions?

@Swiftb0y what could be the alternative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want an allow list (not sure if we really want this), let's use a plain text file with one source per line, followed by a tab character and then a comma separated list of fnmatch expressions:

Phase allowed for en and German variants    en,de*
This is allowed for all languages.    *

That would be much less verbose than a huge XML.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not much interest to write a custom parser. Can we decide for an established format.

not sure if we really want this

Can you confirm the issue? Is there a alternative to distinguish wanted form unwanted source copies?
I have checked the ts format and transifex but there is nothing we can use as a flag.

Copy link
Member

@Holzhaus Holzhaus Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom parser would be straightforward:

def parse_allowlist(path: pathlib.Path) -> typing.Iterable[tuple[str, list[str]]:
    with path.open(mode="r") as fp:
        for line in fp:
            source, _, langstr = line.partition("\t")
            assert source
            langs = langstr.split(",")
            assert langs
            yield source, langs

allowlist = dict(parse_allowlist(pathlib.Path("path/to/allowlist"))) 

# Check if (current_source, current_lang) is on allowlist
is_allowed = any(fnmatch.fnmatchcase(current_lang, lang) for lang in allowlist.get(current_source, []))

(wrote this on my phone, so it's untested)

Btw, I can't open the allowlist in the GitHub app on my phone because it is already too large.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allow list has currently 3755 lines with ~4 lines per string we will still almost 2000 lines. Still long.

Any custom format is not well defined, not extensible, without escape rules. While all these isues are solved with xml because this format is also used for source TS files.

Since the script extends this file automatically, there is no need to build it by hand. We need just confirm new entries which are only new lines, not suffreing any review issues.

I am not convinced to replace XML.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the lack of definition and the unwillingness to write a parser is the problem, just use CSV with two columns... The tree nature of XML is overkill.

Copy link
Member

@Holzhaus Holzhaus Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allow list has currently 3755 lines with ~4 lines per string we will still almost 2000 lines. Still long.

Where does the 4 lines per string come from? With my proposal it would be 1 line per allowed string (you can list multiple allowed languages in the same line)

@Swiftb0y actually the format I proposed is already CSV with tab delimiter (or TSV). You could probably already use the stdlib csv module with delimiter='\t' if there are multiline strings instead of str.partition.

TOML would also work (see tomllib in the python stdlib), but that would not be much shorter than XML (but way more readable).

Copy link
Member

@Holzhaus Holzhaus Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nice thing about TSV is that it's rendered as a table by github when you use the Web UI (does not work in the Android app unfortunately).

Example: https://github.com/Holzhaus/helicon/blob/main/mapping.tsv

@Holzhaus
Copy link
Member

Missing context here. Under which circumstances are English source strings copied to the translated target language? Is it a manual thing from a transifex user? Or why does this happen?

@daschuer
Copy link
Member Author

Yes, it is a manual think from a user. If they are too lazy and like to gain some more percentage they seems to just copy the source strings.

@Holzhaus
Copy link
Member

Holzhaus commented Oct 15, 2024

In that case I'm questioning if we really want to check it on pull/during committing.

Can this somehow be prevented on transifex? Or maybe a monthly check which opens a github issue if necessary?

Because you cannot really fix the commit locally anyway. Instead you need to go to transifex and remove translations and the perform a fresh tx pull, or am I misunderstanding this?

@daschuer
Copy link
Member Author

pre-commit takes care that the check is only done if one is committing changes to the ts files. This is the right moment to reject false translations.

Because you cannot really fix the commit locally anyway. Instead you need to go to transifex and remove translations and the perform a fresh tx pull, or am I misunderstanding this?

Correct.

I think the CI of the new automatic created PR will also fail in that case, which is desired, right?

@daschuer daschuer force-pushed the ts_source_copy_check branch 2 times, most recently from 9a47c31 to a6e2409 Compare October 27, 2024 15:51
@daschuer
Copy link
Member Author

Done.

@@ -0,0 +1,436 @@
lang source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you swap the columns, it's much more readable IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. This was my first version and I have swapped columns to have languages aligned.

nl is
nl Cover
nl Track BPM:
nl Artist + Title
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "Titel" the durch word?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes. I will remove the translation.

vi Shuffle
vi Relink
nl Lossy
nl Lossless
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this "exact omkeerbaar"?

de,nl Decks
de,nl Track
de,nl Tracks
de Add Crate as Track Source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if source == source_text:
if lang == "*":
return True
if language in lang:
Copy link
Member

@Holzhaus Holzhaus Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that en will also match if the string is de,en_US,fr although en is not in the list.

Suggested change
if language in lang:
if language in lang.split(","):

Even better would be to use fnmatch as I suggested before, because if you really want to match all English dialects, you could write de,en*,fr without having to list each and every one of them.

return False


def add_to_allow_list(source_text, language):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function basically duplicates the parsing logic from above.

if ret:
print(
"\n"
"All not allowed copied source translations need to be removed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"All not allowed copied source translations need to be removed"
"All disallowed copied source translations need to be removed"


if ret:
print(
"\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\n"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is desired to have a distance to the individual complains.

@daschuer
Copy link
Member Author

daschuer commented Nov 2, 2024

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants