-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opus_read fails to extract CCMatrix #32
Comments
I guess it's because CCMatrix is so big that reading only the sentence IDs and links to be retrieved from the monolingual corpora takes too much memory to run efficiently. We don't have a good solution for this at the moment but should add some robustness to the tools to also run on bigger data sets. The workaround with moses files is the only solution I can recommend at this moment .... |
Why can't the "preprocess" parameter be set to "Moses" directly ? I mean, "xml"、"raw" and "parsed" are all time-conmusing relatively. |
There is now Regarding moses, it also possible to download moses files with the I'm still leaving this issue open until we find a better solution for processing huge documents. |
I tried to extract the aligned sentence pairs from CCMatrix, previously downloaded using
opus_express
. The command I used wasThe command runs for several days at 100% CPU, without producing any output. Perhaps expat is choking on some error in the data. To rule out package corruption after download, I allowed
opus_read
to download it again, with the same hanging result.Workaround: (re)download the corpus directly in moses format from https://opus.nlpl.eu/CCMatrix.php
The text was updated successfully, but these errors were encountered: