New `lt-merge` command to merge LU's from BEG to END tag #193

unhammer · 2024-12-10T20:26:40Z

The reason for this is to be able to use CG (or any tool inside the stream really) to mark which parts of the stream should be shielded from translation, e.g. quotes (but perhaps not all quotes).

The pipeline then becomes something like this:

So lt-merge runs after tagger:

$ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^så/så<adv>$ ^veldig/v<adv>$^»/»<rquot><MERGE_END>$ ^bra/bra<adj>$' | lt-merge

^ikke/ikke<adv>$ ^«så veldig»/«så veldig»<MERGED>$ ^bra/bra<adj>$

If bidix does <w/><s n="MERGED"/> then this passes through unharmed (similarly generator <w/><l/><r><s n="MERGED"/></r>), and we get

^ikkje<adv>/ikkje$ ^«så veldig»<MERGED>/«så veldig»$ ^bra<adj>/bra$

after generation. Then unmerge

$ echo '^ikkje<adv>/ikkje$ ^«så veldig»<MERGED>/«så veldig»$ ^bra<adj>/bra$' |lt-merge --unmerge

^ikkje<adv>/ikkje$ «så veldig» ^bra<adj>/bra$

drops it into the stream before cg-proc genprefs.

The tags are hardcoded, MERGE_BEG, MERGE_END → MERGED. I don't see a reason for making them configurable (unless someone else starts using this and has a use-case), they're only used within the lt-merge tool so that should have no effect on existing pairs.

We need to be able to pass MERGED stuff unchanged through biltrans and generator – this PR adds ANY_CHAR (lsx <w/>) support to lt-proc -b. This is the only change here to existing code, it should have no effect unless you for some reason named your bidix tag ANY_CHAR :)

Effects on people not using this

ANY_CHAR is now treated specially when using lt-proc -b (just like it is in lsx)
there is a new binary called lt-merge (should that name be used for something else?)

Example language pair usage

apertium/apertium-nno-nob@8ca111d
implements the necessary language pair changes for using lt-merge to protect anything between «» (if the user has requested so with AP_SETVAR / style preferences). In summary:

an rlx file adds MERGE_BEG / MERGE_END tags
two lsx files create a simple pass-through entry for bidix/generator
the mode needs three new entries: merge-quotes.rlx, lt-merge, lt-merge --unmerge
makefile needs to build the rlx and lt-append the lsx onto bidix/generator

Detailed escaping details

Most of the above is simple, but escaping can look a bit messy (this is why we need the --unmerge).

If any of the LU's have word-bound blanks, the [] need escaping:

$ echo '^«/«<lquot><MERGE_BEG>$[[tf:i:a]]^veldig/veldig<adv>$[[/]]^»/»<rquot><MERGE_END>$' | lttoolbox/lt-merge

^«\[\[tf:i:a\]\]veldig\[\[\/\]\]»/«\[\[tf:i:a\]\]veldig\[\[\/\]\]»<MERGED>$

to ensure we have legal stream format.

If any of the forms contain already escaped chars, these now need double-escaping. Why? We need to know the difference between a \[ meaning word-blank or \\[ meaning literal [.

We run an "unmerge" step towards the end of the pipeline, while still outputting Apertium Stream Format, which extracts merged forms and drops one level of escaping.

Example of typical quoted char @:

$ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^til/til<pr>$ ^x\@y.com/x\@y.com<email>$^»/»<rquot><MERGE_END>$ ^da/da<adv>$' | lttoolbox/lt-merge

^ikke/ikke<adv>$ ^«til x\\\@y.com»/«til x\\\@y.com»<MERGED>$ ^da/da<adv>$

If we run lt-merge between analysis and wblank-attach, then after the lt-proc -b generator.bin step we should have e.g.

^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$

We extract the merged form:

$ echo '^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$'|lt-merge --unmerge

^ikkje<adv>/ikkje$ «til x\@y.com» ^då<adv>/då$

Here \\\@ turned into \@ – we removed one layer of quoting, but this is still in the apertium stream so special chars stay quoted, e.g. cg-proc -g leaves it alone:

$ echo '^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$'|lt-merge --unmerge |cg-proc -1ng nob-nno.genprefs.rlx.bin

ikkje «til x\@y.com» då

until the final tf-inject removes the last escape.

And the word-blanks only have a single \ so lt-merge --unmerge ensures they take effect:

$ echo '^ikkje<adv>/ikkje$ ^«\[\[tf:i:a\]\]s\\\^å»<MERGED>/«\[\[tf:i:a\]\]s\\\^å»$' | lt-merge --unmerge

^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å»

which after cg-proc -g becomes

$ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å»' |cg-proc -1ng nob-nno.genprefs.rlx.bin

ikkje «[[tf:i:a]]s\^å»

which tf-inject is happy to handle.

$ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^så/så<adv>$ ^veldig/v<adv>$^»/»<rquot><MERGE_END>$ ^bra/bra<adj>$' | lt-merge ^ikke/ikke<adv>$ ^«så veldig»/«så veldig»<MERGED>$ ^bra/bra<adj>$ Mostly simple, but escaping can look a bit messy. If any of the LU's have word-bound blanks, the [] need escaping: $ echo '^«/«<lquot><MERGE_BEG>$[[tf:i:a]]^veldig/veldig<adv>$[[/]]^»/»<rquot><MERGE_END>$' | lttoolbox/lt-merge ^«\[\[tf:i:a\]\]veldig\[\[\/\]\]»/«\[\[tf:i:a\]\]veldig\[\[\/\]\]»<MERGED>$ to ensure we have legal stream format. If any of the forms contain already escaped chars, these now need double-escaping. Why? Because we need to run an "unmerge" step towards the end of the pipeline, while still outputting Apertium Stream Format, and need to know the difference between a \[ meaning word-blank or \\[ meaning literal [. $ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^til/til<pr>$ ^x\@y.com/x\@y.com<email>$^»/»<rquot><MERGE_END>$ ^da/da<adv>$' | lttoolbox/lt-merge ^ikke/ikke<adv>$ ^«til x\\\@y.com»/«til x\\\@y.com»<MERGED>$ ^da/da<adv>$ If we run lt-merge between analysis and wblank-attach, then after the `lt-proc -b generator.bin` step we should have e.g. ^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$ which after `cg-proc -1 -n -g genprefs.bin` would turn into ikkje «til x\@y.com» då Note how \\\@ turned into \@ – we removed one layer of quoting, but this is still in the apertium stream so special chars stay quoted until the final tf-inject. TODO: * We need to be able to pass MERGED stuff unchanged through biltrans and generator, would like to `<w/><s n="MERGED"/>` but `ANY_CHAR` isn't supported yet in `lt-proc -b`. * We need an `lt-merge --unmerge` to undo the merge: $ echo '^ikkje<adv>/ikkje$ ^«\[\[tf:i:a\]\]s\\\^å\[\[\/\]\]»<MERGED>/«\[\[tf:i:a\]\]s\\\^å\[\[\/\]\]»$' | lt-merge --unmerge ^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]» which then becomes $ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]»' |cg-proc -1ng nob-nno.genprefs.rlx.bin ikkje «[[tf:i:a]]s\^å[[/]]» which `tf-inject` is happy to handle.

cf. HEAD^

I thought we were good because ```sh $ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]»' |cg-proc -1ng nob-nno.genprefs.rlx.bin ikkje «[[tf:i:a]]s\^å[[/]]» ``` worked, but If there's an analysis after the unmerged word blank, cg-proc errors out: ```sh $ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å» ^.<sent>/.$' |cg-proc -1ng nob-nno.genprefs.rlx.bin Error: Word-bound blank was not immediately prior to token on line 0 ``` Fair enough, so lt-merge must put an empty ^$ after word blanks to appease cg-proc. Seems like tf-inject finds the right point at which to end it anyway?

unhammer force-pushed the sitat branch 2 times, most recently from 4f2f60c to 1137e78 Compare December 19, 2024 15:17

unhammer changed the title ~~WIP: new lt-merge command to merge LU's from BEG to END tag~~ New lt-merge command to merge LU's from BEG to END tag Dec 19, 2024

unhammer marked this pull request as ready for review December 19, 2024 15:19

unhammer force-pushed the sitat branch 2 times, most recently from acfcc4d to 51898cc Compare December 19, 2024 21:50

unhammer force-pushed the sitat branch from 51898cc to c7b2eb2 Compare December 19, 2024 22:25

unhammer marked this pull request as draft December 19, 2024 22:26

unhammer force-pushed the sitat branch from 3463317 to 8ebdfc2 Compare December 19, 2024 23:03

unhammer marked this pull request as ready for review December 20, 2024 10:29

unhammer added 3 commits December 20, 2024 16:09

Implement lt-merge --unmerge

1925a7e

cf. HEAD^

Let lt-proc -b handle special ANY_CHAR tag (<w/> from lsx)

e7c8379

unhammer force-pushed the sitat branch from 1c95eeb to 648471e Compare December 20, 2024 15:09

unhammer merged commit 648471e into main Dec 20, 2024
2 checks passed

unhammer deleted the sitat branch December 20, 2024 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `lt-merge` command to merge LU's from BEG to END tag #193

New `lt-merge` command to merge LU's from BEG to END tag #193

unhammer commented Dec 10, 2024 •

edited

Loading

New lt-merge command to merge LU's from BEG to END tag #193

New lt-merge command to merge LU's from BEG to END tag #193

Conversation

unhammer commented Dec 10, 2024 • edited Loading

Effects on people not using this

Example language pair usage

Detailed escaping details

New `lt-merge` command to merge LU's from BEG to END tag #193

New `lt-merge` command to merge LU's from BEG to END tag #193

unhammer commented Dec 10, 2024 •

edited

Loading