-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve commoncrawl components #403
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RobbeSneyders. Lgtm, beside the change regarding metadata writing. Maybe we can exclude this change before we merge the PR.
@@ -261,7 +261,6 @@ def _create_write_task( | |||
schema=schema, | |||
overwrite=False, | |||
compute=False, | |||
write_metadata_file=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep in mind to investigate the impact of this change later. In particular to answer the question which impact the writing metadata file affects memory release.
@PhilippeMoussalli do you think this change has an impact on existing components/pipelines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was False for a a long time and just recently was introduced. Don't have the full context on why it was removed again. What does it offer?
Improved commoncrawl download components for the license-free image use case.
We currently don't preserve the divisions of the data when writing and reading again, which leads to errors when merging datasets with a low and high amount of partitions. This PR enables the writing of a metadata file which should fix this. This was originally introduced in #391, which contains more information, but then later reverted in #403 without a clear reasoning. Let's reactivate it, and if there's a reason to remove it again, let's document it properly.
This is the version currently running (check the commits in the specs)