opus_read fails to extract CCMatrix #32

Waino · 2021-11-18T07:26:57Z

I tried to extract the aligned sentence pairs from CCMatrix, previously downloaded using opus_express. The command I used was

opus_read --source en --target fi --directory CCMatrix --preprocess xml --leave_non_alignments_out --write_mode moses --write CCMatrix.raw.en CCMatrix.raw.fi --write_ids CCMatrix.raw.ids

The command runs for several days at 100% CPU, without producing any output. Perhaps expat is choking on some error in the data. To rule out package corruption after download, I allowed opus_read to download it again, with the same hanging result.

Traceback when killed:
  File "/home/stiggronroos/venvs/opustools/bin/opus_read", line 135, in <module>
    OpusRead(**vars(args)).printPairs()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/opus_read.py", line 214, in printPairs
    self.alignmentParser.collect_links()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/alignment_parser.py", line 107, in collect_links
    blocks = self.bp.get_complete_blocks()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 98, in get_complete_blocks
    self.parse_line(line)
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 82, in parse_line
    self.p.Parse(line)
KeyboardInterrupt

Workaround: (re)download the corpus directly in moses format from https://opus.nlpl.eu/CCMatrix.php

The text was updated successfully, but these errors were encountered:

jorgtied · 2021-11-23T20:08:48Z

I guess it's because CCMatrix is so big that reading only the sentence IDs and links to be retrieved from the monolingual corpora takes too much memory to run efficiently. We don't have a good solution for this at the moment but should add some robustness to the tools to also run on bigger data sets. The workaround with moses files is the only solution I can recommend at this moment ....

shaoyangxu · 2021-11-27T13:03:16Z

Why can't the "preprocess" parameter be set to "Moses" directly ? I mean, "xml"、"raw" and "parsed" are all time-conmusing relatively.

miau1 · 2022-11-29T15:23:44Z

There is now --chunk_size parameter to control memory consumption, although the current implementation is still slow for corpora with huge documents.

Regarding moses, it also possible to download moses files with the opus_get script. For example:
to list available files:
opus_get -s en -t fi -d CCMatrix -p moses -l
to download the files:
opus_get -s en -t fi -d CCMatrix -p moses

I'm still leaving this issue open until we find a better solution for processing huge documents.

svirpioj mentioned this issue Apr 5, 2022

Process Killed Helsinki-NLP/OpusFilter#46

Closed

aflueckiger mentioned this issue Apr 26, 2022

Memory Issue: opus_read fails to extract MultiCCAligned #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opus_read fails to extract CCMatrix #32

opus_read fails to extract CCMatrix #32

Waino commented Nov 18, 2021

jorgtied commented Nov 23, 2021

shaoyangxu commented Nov 27, 2021

miau1 commented Nov 29, 2022

opus_read fails to extract CCMatrix #32

opus_read fails to extract CCMatrix #32

Comments

Waino commented Nov 18, 2021

jorgtied commented Nov 23, 2021

shaoyangxu commented Nov 27, 2021

miau1 commented Nov 29, 2022