Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New lt-merge command to merge LU's from BEG to END tag #193

Merged
merged 4 commits into from
Dec 20, 2024
Merged

Conversation

unhammer
Copy link
Member

@unhammer unhammer commented Dec 10, 2024

The reason for this is to be able to use CG (or any tool inside the stream really) to mark which parts of the stream should be shielded from translation, e.g. quotes (but perhaps not all quotes).

The pipeline then becomes something like this:

image

So lt-merge runs after tagger:

$ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^så/så<adv>$ ^veldig/v<adv>$^»/»<rquot><MERGE_END>$ ^bra/bra<adj>$' | lt-merge
^ikke/ikke<adv>$ ^«så veldig»/«så veldig»<MERGED>$ ^bra/bra<adj>$

If bidix does <i><w/><s n="MERGED"/></i> then this passes through unharmed (similarly generator <i><w/></i><p><l/><r><s n="MERGED"/></r></p>), and we get

^ikkje<adv>/ikkje$ ^«så veldig»<MERGED>/«så veldig»$ ^bra<adj>/bra$

after generation. Then unmerge

$ echo '^ikkje<adv>/ikkje$ ^«så veldig»<MERGED>/«så veldig»$ ^bra<adj>/bra$' |lt-merge --unmerge
^ikkje<adv>/ikkje$ «så veldig» ^bra<adj>/bra$

drops it into the stream before cg-proc genprefs.

The tags are hardcoded, MERGE_BEG, MERGE_ENDMERGED. I don't see a reason for making them configurable (unless someone else starts using this and has a use-case), they're only used within the lt-merge tool so that should have no effect on existing pairs.

We need to be able to pass MERGED stuff unchanged through biltrans and generator – this PR adds ANY_CHAR (lsx <w/>) support to lt-proc -b. This is the only change here to existing code, it should have no effect unless you for some reason named your bidix tag ANY_CHAR :)

Effects on people not using this

  • ANY_CHAR is now treated specially when using lt-proc -b (just like it is in lsx)
  • there is a new binary called lt-merge (should that name be used for something else?)

Example language pair usage

apertium/apertium-nno-nob@8ca111d
implements the necessary language pair changes for using lt-merge to protect anything between «» (if the user has requested so with AP_SETVAR / style preferences). In summary:

  • an rlx file adds MERGE_BEG / MERGE_END tags
  • two lsx files create a simple pass-through entry for bidix/generator
  • the mode needs three new entries: merge-quotes.rlx, lt-merge, lt-merge --unmerge
  • makefile needs to build the rlx and lt-append the lsx onto bidix/generator

Detailed escaping details

Most of the above is simple, but escaping can look a bit messy (this is why we need the --unmerge).

If any of the LU's have word-bound blanks, the [] need escaping:

$ echo '^«/«<lquot><MERGE_BEG>$[[tf:i:a]]^veldig/veldig<adv>$[[/]]^»/»<rquot><MERGE_END>$' | lttoolbox/lt-merge
^«\[\[tf:i:a\]\]veldig\[\[\/\]\]»/«\[\[tf:i:a\]\]veldig\[\[\/\]\]»<MERGED>$

to ensure we have legal stream format.

If any of the forms contain already escaped chars, these now need double-escaping. Why? We need to know the difference between a \[ meaning word-blank or \\[ meaning literal [.

We run an "unmerge" step towards the end of the pipeline, while still outputting Apertium Stream Format, which extracts merged forms and drops one level of escaping.

Example of typical quoted char @:

$ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^til/til<pr>$ ^x\@y.com/x\@y.com<email>$^»/»<rquot><MERGE_END>$ ^da/da<adv>$' | lttoolbox/lt-merge
^ikke/ikke<adv>$ ^«til x\\\@y.com»/«til x\\\@y.com»<MERGED>$ ^da/da<adv>$

If we run lt-merge between analysis and wblank-attach, then after the lt-proc -b generator.bin step we should have e.g.

^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$

We extract the merged form:

$ echo '^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$'|lt-merge --unmerge
^ikkje<adv>/ikkje$ «til x\@y.com» ^då<adv>/då$

Here \\\@ turned into \@ – we removed one layer of quoting, but this is still in the apertium stream so special chars stay quoted, e.g. cg-proc -g leaves it alone:

$ echo '^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$'|lt-merge --unmerge |cg-proc -1ng nob-nno.genprefs.rlx.bin
ikkje «til x\@y.com» då

until the final tf-inject removes the last escape.

And the word-blanks only have a single \ so lt-merge --unmerge ensures they take effect:

$ echo '^ikkje<adv>/ikkje$ ^«\[\[tf:i:a\]\]s\\\^å»<MERGED>/«\[\[tf:i:a\]\]s\\\^å»$' | lt-merge --unmerge
^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å»

which after cg-proc -g becomes

$ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å»' |cg-proc -1ng nob-nno.genprefs.rlx.bin
ikkje «[[tf:i:a]]s\^å»

which tf-inject is happy to handle.

@unhammer unhammer force-pushed the sitat branch 2 times, most recently from 4f2f60c to 1137e78 Compare December 19, 2024 15:17
@unhammer unhammer changed the title WIP: new lt-merge command to merge LU's from BEG to END tag New lt-merge command to merge LU's from BEG to END tag Dec 19, 2024
@unhammer unhammer marked this pull request as ready for review December 19, 2024 15:19
@unhammer unhammer force-pushed the sitat branch 2 times, most recently from acfcc4d to 51898cc Compare December 19, 2024 21:50
    $ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^så/så<adv>$ ^veldig/v<adv>$^»/»<rquot><MERGE_END>$ ^bra/bra<adj>$' | lt-merge
    ^ikke/ikke<adv>$ ^«så veldig»/«så veldig»<MERGED>$ ^bra/bra<adj>$

Mostly simple, but escaping can look a bit messy. If any of the LU's
have word-bound blanks, the [] need escaping:

    $ echo '^«/«<lquot><MERGE_BEG>$[[tf:i:a]]^veldig/veldig<adv>$[[/]]^»/»<rquot><MERGE_END>$' | lttoolbox/lt-merge
    ^«\[\[tf:i:a\]\]veldig\[\[\/\]\]»/«\[\[tf:i:a\]\]veldig\[\[\/\]\]»<MERGED>$

to ensure we have legal stream format.

If any of the forms contain already escaped chars, these now need
double-escaping. Why? Because we need to run an "unmerge" step towards
the end of the pipeline, while still outputting Apertium Stream
Format, and need to know the difference between a \[ meaning
word-blank or \\[ meaning literal [.

    $ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^til/til<pr>$ ^x\@y.com/x\@y.com<email>$^»/»<rquot><MERGE_END>$ ^da/da<adv>$' | lttoolbox/lt-merge
    ^ikke/ikke<adv>$ ^«til x\\\@y.com»/«til x\\\@y.com»<MERGED>$ ^da/da<adv>$

If we run lt-merge between analysis and wblank-attach, then after the
`lt-proc -b generator.bin` step we should have e.g.

    ^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$

which after `cg-proc -1 -n -g genprefs.bin` would turn into

    ikkje «til x\@y.com» då

Note how \\\@ turned into \@ – we removed one layer of quoting, but
this is still in the apertium stream so special chars stay quoted
until the final tf-inject.

TODO:

* We need to be able to pass MERGED stuff unchanged through biltrans
  and generator, would like to `<i><w/><s n="MERGED"/></i>` but
  `ANY_CHAR` isn't supported yet in `lt-proc -b`.

* We need an `lt-merge --unmerge` to undo the merge:

    $ echo '^ikkje<adv>/ikkje$ ^«\[\[tf:i:a\]\]s\\\^å\[\[\/\]\]»<MERGED>/«\[\[tf:i:a\]\]s\\\^å\[\[\/\]\]»$' | lt-merge --unmerge
    ^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]»

  which then becomes

    $ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]»' |cg-proc -1ng nob-nno.genprefs.rlx.bin
    ikkje «[[tf:i:a]]s\^å[[/]]»

  which `tf-inject` is happy to handle.
@unhammer unhammer marked this pull request as draft December 19, 2024 22:26
@unhammer unhammer marked this pull request as ready for review December 20, 2024 10:29
I thought we were good because

```sh
$ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]»' |cg-proc -1ng nob-nno.genprefs.rlx.bin
ikkje «[[tf:i:a]]s\^å[[/]]»
```

worked, but

If there's an analysis after the unmerged word blank, cg-proc errors out:

```sh
$ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å» ^.<sent>/.$' |cg-proc -1ng nob-nno.genprefs.rlx.bin
Error: Word-bound blank was not immediately prior to token on line 0
```

Fair enough, so lt-merge must put an empty ^$ after word blanks to
appease cg-proc. Seems like tf-inject finds the right point at which
to end it anyway?
@unhammer unhammer merged commit 648471e into main Dec 20, 2024
2 checks passed
@unhammer unhammer deleted the sitat branch December 20, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant