Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outputs extra chars #26

Closed
ftyers opened this issue Jul 14, 2020 · 14 comments
Closed

Outputs extra chars #26

ftyers opened this issue Jul 14, 2020 · 14 comments

Comments

@ftyers
Copy link
Member

ftyers commented Jul 14, 2020

fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "rumal rech che" | apertium -d . quc-spa
porque
??fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "rumal rech che." | apertium -d . quc-spa-tagger
^umal<n><rel><px3sg>$ ^ech<n><rel><px3sg>$ ^chi<pr>+ech<n><rel><px3sg>$^.<sent>$^.<sent>$
fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "rumal rech che." | apertium -d . quc-spa-separable
^rumal rech<cnjadv>$ ^chi<pr>$ ^ech<n><rel><px3sg>$^.<sent>$^.<sent>$
?
$ echo "rumal rech che." | apertium -d . quc-spa-tagger | apertium-pretransfer | lsx-proc quc-spa.autoseq.bin | unidump 
      0    005E 0072 0075 006D 0061 006C 0020 0072 0065 0063 0068 0020 0063 0068 0065 003C    ^rumal.rech.che<
     16    0063 006E 006A 0061 0064 0076 003E 0024 005E 002E 003C 0073 0065 006E 0074 003E    cnjadv>$^.<sent>
     32    0024 005E 002E 003C 0073 0065 006E 0074 003E 0024 000A 003F                        $^.<sent>$.?
$ echo "rumal rech che" | hfst-proc quc-spa.automorf.hfst | cg-proc quc-spa.rlx.bin | apertium-tagger -u 2 -g quc-spa.prob|  apertium-pretransfer | lsx-proc quc-spa.autoseq.bin | hexdump -xc
0000000    725e    6d75    6c61    7220    6365    2068    6863    3c65
0000000   ^   r   u   m   a   l       r   e   c   h       c   h   e   <
0000010    6e63    616a    7664    243e    3f0a                        
0000010   c   n   j   a   d   v   >   $  \n   ?                        
000001a
@khannatanmai
Copy link
Member

$ echo "rumal rech che" | apertium -d . quc-spa-separable
^rumal rech<cnjadv>$ ^chi<pr>$ ^ech<n><rel><px3sg>$^.<sent>$

Seems to work fine for me.

@marcriera
Copy link
Member

Happening on my side as well, with a different pair:

$ echo "hello" | apertium -d . eng-cat-autoseq
^hello<n><sg>$^.<sent>$
?

@unhammer
Copy link
Member

unhammer commented Sep 2, 2020

If we combine it with apertium/lttoolbox#104 we'll be back to zero =D

@hectoralos
Copy link
Member

It is happening in both apertium-fra-cat and apertium-fra-frp. For instance, in the first one:

$ echo "^maison<n><f><sg>$" | od -An -vtu1
  94 109  97 105 115 111 110  60 110  62  60 102  62  60 115 103
  62  36  10
$ echo "^maison<n><f><sg>$" | lsx-proc fra-cat.autosep.bin | od -An -vtu1
  94 109  97 105 115 111 110  60 110  62  60 102  62  60 115 103
  62  36  10  63

(apertium-separable has nothing to do with "maison", but simply passing it to output; instead...)

@unhammer
Copy link
Member

unhammer commented Apr 23, 2021

I'm getting the ? if use a locale other than C.UTF-8 (even utf8 locales):

$ locale -a|while read -r LANG; do echo "$LANG"; echo '^.<sent>$[][\n]' |  lsx-proc nob-nno.autoseq.bin | grep -c '?' ;done
C
1
C.UTF-8
0
POSIX
1
en_AG
1
en_GB.utf8
1
en_US.utf8
1
nn_NO.utf8
1

(since the ? is at the very end of the output without any trailing newline, my terminal doesn't always show it, but grep gives the correct answer)

@unhammer
Copy link
Member

unhammer commented Apr 24, 2021

@unhammer
Copy link
Member

So it seems like it's a -1? If I do

diff --git a/src/lsx_processor.cc b/src/lsx_processor.cc
index 5c9aec7..555a47b 100644
--- a/src/lsx_processor.cc
+++ b/src/lsx_processor.cc
@@ -172,6 +172,12 @@ LSXProcessor::processWord(FILE* input, FILE* output)
   if(lu_queue.size() == 0)
   {
     readNextLU(input);
+    while(!feof(input)) {
+        wchar_t c = fgetwc_unlocked(input);
+        wprintf(L"{0x%04x}", c);
+        fputwc_unlocked((int)c, output);
+        fputwc_unlocked(L'\n', output);
+    }
   }
   if(at_end && lu_queue.size() == 1 && lu_queue.back().size() == 0)
   {

and echo '^.<sent>$' | src/lsx-proc nob-nno.autoseq.bin, I see

{0xffffffff}?
^.<sent>$

@unhammer
Copy link
Member

I notice NUL handling is different in separable from other tools, not sure if that's relevant; most of the other tools use

  while(!feof(input) && val != 0)
  {
    val = fgetwc_unlocked(input);

while separable uses

  while(!feof(input))
  {
    wchar_t c = fgetwc_unlocked(input);
    if(null_flush && c == L'\0')
    {
      at_end = true;
      at_null = true;
      break;
    }

@mr-martian
Copy link
Contributor

That difference is because when I wrote it I was copying structure from -recursive and I don't remember if the original reason was that -recursive is more complicated or if it's just that I liked that structure better.

Oh, is the -1 WEOF?

If you insert this in the loop, does it change anything? (line 77-ish)

if (c == WEOF) { break; }

@mr-martian
Copy link
Contributor

Also, inserting an extra EOF character into the stream could explain the difference in behavior between a pipe and a single tool. If lt-proc -b has a similar issue and lsx-proc is here inserting an extra EOF into the stream then lt-proc could be reading EOF but seeing that feof(input) is false and so trying to output it and having it cast to ? because -1 isn't an actual character.

unhammer added a commit that referenced this issue Apr 25, 2021
unhammer added a commit that referenced this issue Apr 25, 2021
unhammer added a commit that referenced this issue Apr 25, 2021
@unhammer
Copy link
Member

unhammer commented Apr 26, 2021

@ftyers @marcriera @hectoralos is it fixed for you in the newest version?

@marcriera
Copy link
Member

@ftyers @marcriera @hectoralos is it fixed for you in the newest version?

Yes, after building from source I no longer get the extra ?. Thanks!

@TinoDidriksen
Copy link
Member

Don't need to build from source - it's already in nightly.

@hectoralos
Copy link
Member

@ftyers @marcriera @hectoralos is it fixed for you in the newest version?

Yes, there's no extra "?" now. Thanks a lot, @unhammer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants