Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566

Merged
merged 11 commits into from
Jul 18, 2024

Conversation

jacobmarks
Copy link
Contributor

@jacobmarks jacobmarks commented Jul 11, 2024

What changes are proposed in this pull request?

@harpreetsahota204 pointed out that the HF Hub integration did not work with grouped datasets, and that led me to dig into some edge cases.

Now both push_to_hub and load_from_hub should support:

  • grouped datasets
  • all dataset.app_config.media_fields
  • preserve knowledge of dataset.app_config.grid_media_field
  • upload and download of segmentation masks stored with mask_path and heatmaps stored with map_path should also be supported.

I've tested this in a bunch of scenarios but could use some more beta testing.

(Please fill in changes proposed in this fix)

How is this patch tested? If it is not, please explain why.

(Details)

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • Documentation

    • Improved meta tag and anchor tag formatting in HTML templates.
    • Updated import statements in Hugging Face integration documentation for consistency.
  • New Features

    • Added support for handling app_media_fields and grid_media_field attributes in datasets.
  • Bug Fixes

    • Corrected list item closures in HTML templates.
    • Fixed import and chunk size handling in Hugging Face utility functions.
  • Chores

    • Updated database package version to 1.1.4 and added MongoDB download links for version 24.

Copy link
Contributor

coderabbitai bot commented Jul 11, 2024

Walkthrough

The recent updates encompass three main areas: CI/CD, documentation, and Hugging Face integration. Pipeline adjustments ensure artifacts are uploaded conditionally by platform type. HTML and RST documentation saw syntax simplifications and clarifications for better readability and accuracy. Enhancements to the Hugging Face utility include improved sample counting, handling media fields, and more efficient dataset processing.

Changes

File Change Summary
.github/workflows/build-db.yml Added platform-based conditional checks and refined artifact upload paths.
docs/source/_templates/layout.html Simplified meta tag formatting, adjusted script tag placement, fixed list item closures, and refined anchor tags.
docs/source/integrations/huggingface.rst Updated import statements and usage for push_to_hub function for better clarity and simplicity.
fiftyone/utils/huggingface.py Enhanced sample counting, media field handling, and dataset processing mechanisms.
package/db/setup.py Updated MongoDB version and incremented package version.

Poem

In the land of code, changes unfold,
New checks and tags, more robust, bold.
Hugging Face whispers, datasets refined,
Clearer docs, easier to find.
A rabbit cheers, in digital delight,
For every update, beamed to light.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ea8774b and 615c708.

Files selected for processing (5)
  • .github/workflows/build-db.yml (1 hunks)
  • docs/source/_templates/layout.html (7 hunks)
  • docs/source/integrations/huggingface.rst (29 hunks)
  • fiftyone/utils/huggingface.py (16 hunks)
  • package/db/setup.py (2 hunks)
Files skipped from review due to trivial changes (1)
  • docs/source/_templates/layout.html
Additional context used
Ruff
fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)


1410-1410: Do not use bare except

(E722)

Additional comments not posted (38)
.github/workflows/build-db.yml (1)

69-72: Conditional check for artifact upload is correct.

The addition of the conditional check ensures that artifacts are uploaded only when the platform is 'sdist', and the upload path is correctly restricted to .tar.gz files.

package/db/setup.py (2)

98-101: New MongoDB download links are valid.

The addition of new MongoDB download links for Ubuntu version "24" appears correct.


146-146: Version update is consistent.

The VERSION constant update to "1.1.4" is consistent with the changes made.

fiftyone/utils/huggingface.py (5)

12-12: Import of itertools is appropriate.

The addition of the itertools import is necessary for the chunking logic and other functionalities.


318-319: New attributes are correctly initialized.

The app_media_fields and grid_media_field attributes are correctly initialized in the HFHubDatasetConfig class.


457-458: Correct chunk size handling.

The logic for handling chunk sizes is correct and ensures efficient uploads.


1425-1435: Handling of new attributes is correct.

The logic for handling app_media_fields and grid_media_field attributes is correctly implemented.


670-676: Recursive sample counting is correct.

The logic for recursively counting samples in grouped datasets is correctly implemented.

docs/source/integrations/huggingface.rst (30)

1014-1014: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1016-1016: Function call updated.

The function call has been correctly updated to match the new import statement.


1046-1046: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1052-1052: Function call updated.

The function call has been correctly updated to match the new import statement.


1069-1069: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1077-1077: Function call updated.

The function call has been correctly updated to match the new import statement.


1106-1106: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1109-1109: Function call updated.

The function call has been correctly updated to match the new import statement.


1142-1142: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1144-1144: Function call updated.

The function call has been correctly updated to match the new import statement.


1156-1156: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1160-1160: Function call updated.

The function call has been correctly updated to match the new import statement.


1201-1201: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1207-1207: Function call updated.

The function call has been correctly updated to match the new import statement.


1233-1233: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.


1237-1237: Function call updated.

The function call has been correctly updated to match the new import statement.


1267-1267: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1269-1269: Function call updated.

The function call has been correctly updated to match the new import statement.


1297-1297: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1299-1299: Function call updated.

The function call has been correctly updated to match the new import statement.


1333-1333: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1335-1335: Function call updated.

The function call has been correctly updated to match the new import statement.


1361-1361: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1363-1363: Function call updated.

The function call has been correctly updated to match the new import statement.


1566-1566: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1568-1568: Function call updated.

The function call has been correctly updated to match the new import statement.


1604-1604: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1606-1606: Function call updated.

The function call has been correctly updated to match the new import statement.


1621-1621: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.


1623-1623: Function call updated.

The function call has been correctly updated to match the new import statement.

fiftyone/utils/huggingface.py Show resolved Hide resolved
fiftyone/utils/huggingface.py Show resolved Hide resolved
Copy link
Contributor

@benjaminpkane benjaminpkane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code update LGTM. Defer to others 🚀

There is a fair amount of nuance it seems. Adding hugging face unit tests in ./tests/unittest might be valuable if the integration is extended in the future

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobmarks looks like you need to rebase on the latest develop to solve the merge conflict.

Also left a couple tiny comments for you 🤓

docs/source/integrations/huggingface.rst Outdated Show resolved Hide resolved
fiftyone/utils/huggingface.py Show resolved Hide resolved
@jacobmarks
Copy link
Contributor Author

The code update LGTM. Defer to others 🚀

There is a fair amount of nuance it seems. Adding hugging face unit tests in ./tests/unittest might be valuable if the integration is extended in the future

Sorry about that! Rebasing now!

@jacobmarks jacobmarks force-pushed the hf-hub-group-datasets branch from 615c708 to 7f57f9b Compare July 17, 2024 03:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (1)
fiftyone/utils/huggingface.py (1)

Line range hint 318-342: Add documentation for new attributes in HFHubDatasetConfig.

The new attributes app_media_fields and grid_media_field are added to the class but not documented in the class docstring. It's important for maintainability and usability to document these, especially given their significance in the context of this PR.

class HFHubDatasetConfig(Config):
    """Config for a Hugging Face Hub dataset.

    Args:
        name: the name of the dataset
        repo_type: the type of the repository
        repo_id: the identifier of the repository
        revision: the revision of the dataset
        filename: the name of the file
        format: the format of the dataset
        tags: the tags of the dataset
        license: the license of the dataset
        description: the description of the dataset
+       app_media_fields: the media fields visible in the App
+       grid_media_field: the media field to use in the grid view
        fiftyone: the fiftyone version requirement of the dataset
    """
Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 615c708 and 7f57f9b.

Files selected for processing (5)
  • .github/workflows/build-db.yml (1 hunks)
  • docs/source/_templates/layout.html (7 hunks)
  • docs/source/integrations/huggingface.rst (29 hunks)
  • fiftyone/utils/huggingface.py (16 hunks)
  • package/db/setup.py (2 hunks)
Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/build-db.yml
  • docs/source/_templates/layout.html
  • docs/source/integrations/huggingface.rst
  • package/db/setup.py
Additional context used
Ruff
fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)


1410-1410: Do not use bare except

(E722)

Additional comments not posted (2)
fiftyone/utils/huggingface.py (2)

Line range hint 1425-1494: Ensure correct handling of media fields in _load_fiftyone_dataset_from_config.

The method now handles app_media_fields and grid_media_field, which are crucial for the integration of new media types. Ensure that these fields are correctly used and that their handling does not introduce any issues, especially when the fields are not set to their default values.

Verification successful

Verified: Correct handling of app_media_fields and grid_media_field in _load_fiftyone_dataset_from_config.

The fields app_media_fields and grid_media_field are correctly set and checked within the method, ensuring proper integration of new media types.

  • app_media_fields and grid_media_field are initialized based on the configuration or default values.
  • They are then applied to the dataset's app configuration if they differ from the default values.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correct handling of `app_media_fields` and `grid_media_field`.

# Test: Search for the usage of these fields in the method. Expect: Correct handling and usage.
ast-grep --lang python --pattern $'def _load_fiftyone_dataset_from_config($_) { $$$ app_media_fields $$$ grid_media_field $$$ }'

Length of output: 474


Script:

#!/bin/bash
# Description: Verify the usage of `app_media_fields` and `grid_media_field` within the `_load_fiftyone_dataset_from_config` method.

# Search for the definition of `_load_fiftyone_dataset_from_config` method
ast-grep --lang python --pattern 'def _load_fiftyone_dataset_from_config($_) { $$$ }' -A 50

# Search for the usage of `app_media_fields` within the method
rg 'app_media_fields' -A 5

# Search for the usage of `grid_media_field` within the method
rg 'grid_media_field' -A 5

Length of output: 14066


133-133: Ensure proper chunk size handling in push_to_hub.

The logic to automatically set the chunk size based on the number of samples is a good addition for performance optimization. However, ensure that the _count_samples function is thoroughly tested, especially since it affects how datasets are chunked and uploaded.

Verification successful

Ensure proper chunk size handling in push_to_hub.

The logic to automatically set the chunk size based on the number of samples is a good addition for performance optimization. The _count_samples function is defined and used appropriately within the fiftyone/utils/huggingface.py file. Ensure that this function is thoroughly tested, especially since it affects how datasets are chunked and uploaded.

  • Definition: fiftyone/utils/huggingface.py, line with def _count_samples(sample_collection):
  • Usage: fiftyone/utils/huggingface.py, multiple lines including the one under review
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `_count_samples` function.

# Test: Search for the function implementation and usage. Expect: Proper implementation and usage in `push_to_hub`.
ast-grep --lang python --pattern $'def _count_samples($_, $_) { $$$ }'

Length of output: 528


Script:

#!/bin/bash
# Description: Verify the implementation of `_count_samples` function.

# Test: Search for the function implementation and usage. Expect: Proper implementation and usage in `push_to_hub`.
ast-grep --lang python --pattern $'def _count_samples($_) { $$$ }'

Length of output: 520


Script:

#!/bin/bash
# Description: Verify the implementation of `_count_samples` function.

# Test: Search for the function implementation and usage using a simpler pattern.
ast-grep --lang python --pattern $'def _count_samples($_)'

# Fallback: Use `rg` to search for the function definition and its usage.
rg '_count_samples'

Length of output: 770

Tools
Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)

fiftyone/utils/huggingface.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7f57f9b and 661ad48.

Files selected for processing (2)
  • docs/source/integrations/huggingface.rst (29 hunks)
  • fiftyone/utils/huggingface.py (16 hunks)
Files skipped from review as they are similar to previous changes (1)
  • docs/source/integrations/huggingface.rst
Additional context used
Ruff
fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)


1410-1410: Do not use bare except

(E722)

Additional comments not posted (3)
fiftyone/utils/huggingface.py (3)

12-12: Approved addition of itertools import.

This import is used effectively in the file for handling complex iterations, such as in the _get_files_to_download function.


133-133: Approved modification to chunk size logic.

The updated logic correctly sets a default chunk size for large datasets, which is crucial for managing file system limitations on the Hugging Face Hub.

Tools
Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)


Line range hint 318-342: Approved addition of media field attributes in HFHubDatasetConfig.

The new attributes app_media_fields and grid_media_field are well-implemented and align with the PR's objectives to handle dataset configurations more effectively.

@jacobmarks
Copy link
Contributor Author

@brimoor let me know if this is acceptable, or if I should just cherry-pick the commits on a new branch

@jacobmarks jacobmarks requested a review from brimoor July 17, 2024 03:39
@brimoor brimoor force-pushed the hf-hub-group-datasets branch from 661ad48 to 2c9d5c6 Compare July 17, 2024 20:22
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 661ad48 and 4f89f18.

Files selected for processing (2)
  • docs/source/integrations/huggingface.rst (29 hunks)
  • fiftyone/utils/huggingface.py (16 hunks)
Files skipped from review as they are similar to previous changes (1)
  • docs/source/integrations/huggingface.rst
Additional context used
Ruff
fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)


1410-1410: Do not use bare except

(E722)

Additional comments not posted (12)
fiftyone/utils/huggingface.py (12)

133-133: Optimize nested if statements.

Use a single if statement instead of nested if statements for better readability.

-  if dataset_type == fot.FiftyOneDataset and chunk_size is None:
-    if _count_samples(dataset) > 10000:
+  if dataset_type == fot.FiftyOneDataset and chunk_size is None and _count_samples(dataset) > 10000:
Tools
Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)


133-133: LGTM!

The use of _count_samples instead of dataset.count() is appropriate for handling grouped datasets.

Tools
Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)


Line range hint 318-341:
LGTM!

The new attributes app_media_fields and grid_media_field are well-defined and initialized correctly.


457-462: LGTM!

The changes ensure proper handling of chunk sizes when uploading data to the repository.


546-552: LGTM!

The changes ensure that the configuration file includes the new attributes app_media_fields and grid_media_field if they are different from the default values.


630-630: LGTM!

The use of _count_samples instead of dataset.count() is appropriate for handling grouped datasets.


670-676: LGTM!

The function _count_samples is well-implemented and ensures proper counting of samples in grouped datasets.


929-934: LGTM!

The changes ensure correct formatting of the API URL.


945-950: LGTM!

The changes ensure correct formatting of the API URL.


1350-1357: LGTM!

The function _get_media_fields is well-implemented and ensures proper retrieval of media fields.


1360-1412: LGTM!

The changes ensure proper handling of different media fields and exceptions while retrieving files to download.

Tools
Ruff

1410-1410: Do not use bare except

(E722)


1410-1410: Avoid bare except statements.

Replace the bare except statement with a specific exception to avoid catching unintended exceptions.

-  except:
+  except Exception:
Tools
Ruff

1410-1410: Do not use bare except

(E722)

@jacobmarks jacobmarks merged commit cac9c1a into develop Jul 18, 2024
13 checks passed
@jacobmarks jacobmarks deleted the hf-hub-group-datasets branch July 18, 2024 18:00
@coderabbitai coderabbitai bot mentioned this pull request Dec 5, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants