Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issue: opus_read fails to extract MultiCCAligned #21

Closed
aflueckiger opened this issue Mar 17, 2021 · 1 comment
Closed

Memory Issue: opus_read fails to extract MultiCCAligned #21

aflueckiger opened this issue Mar 17, 2021 · 1 comment

Comments

@aflueckiger
Copy link

aflueckiger commented Mar 17, 2021

Using v1.2.1, the following command successfully downloads the resources of MultiCCAligned. After the download, however, the conversion to Moses-format fails without any error message due to a lack of memory (RAM).

opus_read --directory MultiCCAligned -r v1 --source en --target de --write en-de.en en-de.de --write_mode moses

opus_read seems to read the dataset into memory. The memory increases above 60GB before the process dies.

A similar operation to download the WMT dataset works:

opus_read --directory WMT-News -r v2019 --source en --target de --write en-de.en en-de.de --write_mode moses

Thanks for this library. A tool to collect and filter the ever-increasing datasets is of great use.

@aflueckiger
Copy link
Author

The issue still exists, yet I close this in favor of #32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant