New lt-merge
command to merge LU's from BEG to END tag
#193
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The reason for this is to be able to use CG (or any tool inside the stream really) to mark which parts of the stream should be shielded from translation, e.g. quotes (but perhaps not all quotes).
The pipeline then becomes something like this:
So
lt-merge
runs after tagger:If bidix does
<i><w/><s n="MERGED"/></i>
then this passes through unharmed (similarly generator<i><w/></i><p><l/><r><s n="MERGED"/></r></p>
), and we getafter generation. Then unmerge
drops it into the stream before cg-proc genprefs.
The tags are hardcoded,
MERGE_BEG
,MERGE_END
→MERGED
. I don't see a reason for making them configurable (unless someone else starts using this and has a use-case), they're only used within thelt-merge
tool so that should have no effect on existing pairs.We need to be able to pass
MERGED
stuff unchanged through biltrans and generator – this PR addsANY_CHAR
(lsx<w/>
) support tolt-proc -b
. This is the only change here to existing code, it should have no effect unless you for some reason named your bidix tagANY_CHAR
:)Effects on people not using this
ANY_CHAR
is now treated specially when usinglt-proc -b
(just like it is in lsx)lt-merge
(should that name be used for something else?)Example language pair usage
apertium/apertium-nno-nob@8ca111d
implements the necessary language pair changes for using lt-merge to protect anything between
«»
(if the user has requested so with AP_SETVAR / style preferences). In summary:MERGE_BEG / MERGE_END
tagsDetailed escaping details
Most of the above is simple, but escaping can look a bit messy (this is why we need the
--unmerge
).If any of the LU's have word-bound blanks, the
[]
need escaping:to ensure we have legal stream format.
If any of the forms contain already escaped chars, these now need double-escaping. Why? We need to know the difference between a
\[
meaning word-blank or\\[
meaning literal[
.We run an "unmerge" step towards the end of the pipeline, while still outputting Apertium Stream Format, which extracts merged forms and drops one level of escaping.
Example of typical quoted char
@
:If we run
lt-merge
between analysis and wblank-attach, then after thelt-proc -b generator.bin
step we should have e.g.We extract the merged form:
Here
\\\@
turned into\@
– we removed one layer of quoting, but this is still in the apertium stream so special chars stay quoted, e.g.cg-proc -g
leaves it alone:until the final
tf-inject
removes the last escape.And the word-blanks only have a single
\
solt-merge --unmerge
ensures they take effect:which after
cg-proc -g
becomeswhich
tf-inject
is happy to handle.