Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange lines in eng.tagged corpus #20

Open
AMR-KELEG opened this issue May 21, 2019 · 4 comments
Open

Strange lines in eng.tagged corpus #20

AMR-KELEG opened this issue May 21, 2019 · 4 comments

Comments

@AMR-KELEG
Copy link

I am currently using the texts/eng.tagged file for testing the new weighting algorithms.
While using the file, I noticed that it has some lines with just a single double quotation character!
(Example: https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L823)

^the/the<det><def><sp>$
"
^golden/golden<adj>$
^axe/axe<n><sg>$
"
^competition/competition<n><sg>$

Should these lines be fixed?
I don't want to handle it in my script if it's a bug in the tagged corpus and I believe fixing these lines is just a simple find and replace command that any text editor can do easily.

@AMR-KELEG AMR-KELEG changed the title Strange lines in eng.tagged data Strange lines in eng.tagged corpus May 21, 2019
@unhammer
Copy link
Member

I'm guessing the analyser didn't have " in alphabet nor any analysis of " – in those cases, lt-proc will simply output the symbol as-is without wrapping it in ^"/"…$.

If you want to handle the apertium stream format, you should expect to see this kind of thing all the time. You could use
http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream/
to get the relevant stuff out:

$ echo '^foo/bar<fie>$ " [hippopotamus] \["^ga/ga<ga>$'|apertium-cleanstream -n

^foo/bar<fie>$

^ga/ga<ga>$

@unhammer
Copy link
Member

(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )

@AMR-KELEG
Copy link
Author

(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )

I should open an issue there, shouldn't I?

@unhammer
Copy link
Member

unhammer commented May 21, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants