Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve commoncrawl components #403

Merged
merged 12 commits into from
Sep 13, 2023
Merged

Improve commoncrawl components #403

merged 12 commits into from
Sep 13, 2023

Conversation

RobbeSneyders
Copy link
Member

This is the version currently running (check the commits in the specs)

Copy link
Contributor

@mrchtr mrchtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RobbeSneyders. Lgtm, beside the change regarding metadata writing. Maybe we can exclude this change before we merge the PR.

@@ -261,7 +261,6 @@ def _create_write_task(
schema=schema,
overwrite=False,
compute=False,
write_metadata_file=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep in mind to investigate the impact of this change later. In particular to answer the question which impact the writing metadata file affects memory release.

@PhilippeMoussalli do you think this change has an impact on existing components/pipelines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was False for a a long time and just recently was introduced. Don't have the full context on why it was removed again. What does it offer?

@mrchtr mrchtr merged commit b72e252 into main Sep 13, 2023
@mrchtr mrchtr deleted the feature/improve-commoncrawl branch September 13, 2023 06:04
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Improved commoncrawl download components for the license-free image use case.
RobbeSneyders added a commit that referenced this pull request Feb 20, 2024
We currently don't preserve the divisions of the data when writing and
reading again, which leads to errors when merging datasets with a low
and high amount of partitions. This PR enables the writing of a metadata
file which should fix this.

This was originally introduced in #391, which contains more information,
but then later reverted in #403 without a clear reasoning.

Let's reactivate it, and if there's a reason to remove it again, let's
document it properly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants