Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566

jacobmarks · 2024-07-11T17:32:43Z

What changes are proposed in this pull request?

@harpreetsahota204 pointed out that the HF Hub integration did not work with grouped datasets, and that led me to dig into some edge cases.

Now both push_to_hub and load_from_hub should support:

grouped datasets
all dataset.app_config.media_fields
preserve knowledge of dataset.app_config.grid_media_field
upload and download of segmentation masks stored with mask_path and heatmaps stored with map_path should also be supported.

I've tested this in a bunch of scenarios but could use some more beta testing.

(Please fill in changes proposed in this fix)

How is this patch tested? If it is not, please explain why.

(Details)

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release
notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

App: FiftyOne application changes
Build: Build and test infrastructure changes
Core: Core fiftyone Python library changes
Documentation: FiftyOne documentation changes
Other

Summary by CodeRabbit

Documentation
- Improved meta tag and anchor tag formatting in HTML templates.
- Updated import statements in Hugging Face integration documentation for consistency.
New Features
- Added support for handling app_media_fields and grid_media_field attributes in datasets.
Bug Fixes
- Corrected list item closures in HTML templates.
- Fixed import and chunk size handling in Hugging Face utility functions.
Chores
- Updated database package version to 1.1.4 and added MongoDB download links for version 24.

coderabbitai · 2024-07-11T17:32:59Z

Walkthrough

The recent updates encompass three main areas: CI/CD, documentation, and Hugging Face integration. Pipeline adjustments ensure artifacts are uploaded conditionally by platform type. HTML and RST documentation saw syntax simplifications and clarifications for better readability and accuracy. Enhancements to the Hugging Face utility include improved sample counting, handling media fields, and more efficient dataset processing.

Changes

File	Change Summary
`.github/workflows/build-db.yml`	Added platform-based conditional checks and refined artifact upload paths.
`docs/source/_templates/layout.html`	Simplified meta tag formatting, adjusted script tag placement, fixed list item closures, and refined anchor tags.
`docs/source/integrations/huggingface.rst`	Updated import statements and usage for `push_to_hub` function for better clarity and simplicity.
`fiftyone/utils/huggingface.py`	Enhanced sample counting, media field handling, and dataset processing mechanisms.
`package/db/setup.py`	Updated MongoDB version and incremented package version.

Poem

In the land of code, changes unfold,
New checks and tags, more robust, bold.
Hugging Face whispers, datasets refined,
Clearer docs, easier to find.
A rabbit cheers, in digital delight,
For every update, beamed to light.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ea8774b and 615c708.

Files selected for processing (5)

.github/workflows/build-db.yml (1 hunks)
docs/source/_templates/layout.html (7 hunks)
docs/source/integrations/huggingface.rst (29 hunks)
fiftyone/utils/huggingface.py (16 hunks)
package/db/setup.py (2 hunks)

Files skipped from review due to trivial changes (1)

docs/source/_templates/layout.html

Additional context used

Ruff

fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)

1410-1410: Do not use bare except

(E722)

Additional comments not posted (38)

.github/workflows/build-db.yml (1)

69-72: Conditional check for artifact upload is correct.

The addition of the conditional check ensures that artifacts are uploaded only when the platform is 'sdist', and the upload path is correctly restricted to .tar.gz files.

package/db/setup.py (2)

98-101: New MongoDB download links are valid.

The addition of new MongoDB download links for Ubuntu version "24" appears correct.

146-146: Version update is consistent.

The VERSION constant update to "1.1.4" is consistent with the changes made.

fiftyone/utils/huggingface.py (5)

12-12: Import of itertools is appropriate.

The addition of the itertools import is necessary for the chunking logic and other functionalities.

318-319: New attributes are correctly initialized.

The app_media_fields and grid_media_field attributes are correctly initialized in the HFHubDatasetConfig class.

457-458: Correct chunk size handling.

The logic for handling chunk sizes is correct and ensures efficient uploads.

1425-1435: Handling of new attributes is correct.

The logic for handling app_media_fields and grid_media_field attributes is correctly implemented.

670-676: Recursive sample counting is correct.

The logic for recursively counting samples in grouped datasets is correctly implemented.

docs/source/integrations/huggingface.rst (30)

1014-1014: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1016-1016: Function call updated.

The function call has been correctly updated to match the new import statement.

1046-1046: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1052-1052: Function call updated.

The function call has been correctly updated to match the new import statement.

1069-1069: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1077-1077: Function call updated.

The function call has been correctly updated to match the new import statement.

1106-1106: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1109-1109: Function call updated.

The function call has been correctly updated to match the new import statement.

1142-1142: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1144-1144: Function call updated.

The function call has been correctly updated to match the new import statement.

1156-1156: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1160-1160: Function call updated.

The function call has been correctly updated to match the new import statement.

1201-1201: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1207-1207: Function call updated.

The function call has been correctly updated to match the new import statement.

1233-1233: Import statement updated.

The import statement has been correctly updated to directly import push_to_hub.

1237-1237: Function call updated.

The function call has been correctly updated to match the new import statement.

1267-1267: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1269-1269: Function call updated.

The function call has been correctly updated to match the new import statement.

1297-1297: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1299-1299: Function call updated.

The function call has been correctly updated to match the new import statement.

1333-1333: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1335-1335: Function call updated.

The function call has been correctly updated to match the new import statement.

1361-1361: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1363-1363: Function call updated.

The function call has been correctly updated to match the new import statement.

1566-1566: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1568-1568: Function call updated.

The function call has been correctly updated to match the new import statement.

1604-1604: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1606-1606: Function call updated.

The function call has been correctly updated to match the new import statement.

1621-1621: Import statement updated.

The import statement has been correctly updated to directly import load_from_hub.

1623-1623: Function call updated.

The function call has been correctly updated to match the new import statement.

fiftyone/utils/huggingface.py

benjaminpkane

The code update LGTM. Defer to others 🚀

There is a fair amount of nuance it seems. Adding hugging face unit tests in ./tests/unittest might be valuable if the integration is extended in the future

brimoor

@jacobmarks looks like you need to rebase on the latest develop to solve the merge conflict.

Also left a couple tiny comments for you 🤓

docs/source/integrations/huggingface.rst

fiftyone/utils/huggingface.py

jacobmarks · 2024-07-17T03:08:50Z

The code update LGTM. Defer to others 🚀

There is a fair amount of nuance it seems. Adding hugging face unit tests in ./tests/unittest might be valuable if the integration is extended in the future

Sorry about that! Rebasing now!

coderabbitai

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (1)

fiftyone/utils/huggingface.py (1)
Line range hint 318-342: Add documentation for new attributes in HFHubDatasetConfig.

The new attributes app_media_fields and grid_media_field are added to the class but not documented in the class docstring. It's important for maintainability and usability to document these, especially given their significance in the context of this PR.
class HFHubDatasetConfig(Config):
    """Config for a Hugging Face Hub dataset.

    Args:
        name: the name of the dataset
        repo_type: the type of the repository
        repo_id: the identifier of the repository
        revision: the revision of the dataset
        filename: the name of the file
        format: the format of the dataset
        tags: the tags of the dataset
        license: the license of the dataset
        description: the description of the dataset
+       app_media_fields: the media fields visible in the App
+       grid_media_field: the media field to use in the grid view
        fiftyone: the fiftyone version requirement of the dataset
    """

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 615c708 and 7f57f9b.

Files selected for processing (5)

.github/workflows/build-db.yml (1 hunks)
docs/source/_templates/layout.html (7 hunks)
docs/source/integrations/huggingface.rst (29 hunks)
fiftyone/utils/huggingface.py (16 hunks)
package/db/setup.py (2 hunks)

Files skipped from review as they are similar to previous changes (4)

.github/workflows/build-db.yml
docs/source/_templates/layout.html
docs/source/integrations/huggingface.rst
package/db/setup.py

Additional context used

Ruff

fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)

1410-1410: Do not use bare except

(E722)

Additional comments not posted (2)

fiftyone/utils/huggingface.py (2)
Line range hint 1425-1494: Ensure correct handling of media fields in _load_fiftyone_dataset_from_config.

The method now handles app_media_fields and grid_media_field, which are crucial for the integration of new media types. Ensure that these fields are correctly used and that their handling does not introduce any issues, especially when the fields are not set to their default values.

Verification successful

Verified: Correct handling of app_media_fields and grid_media_field in _load_fiftyone_dataset_from_config.

The fields app_media_fields and grid_media_field are correctly set and checked within the method, ensuring proper integration of new media types.

app_media_fields and grid_media_field are initialized based on the configuration or default values.

They are then applied to the dataset's app configuration if they differ from the default values.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the correct handling of `app_media_fields` and `grid_media_field`.

# Test: Search for the usage of these fields in the method. Expect: Correct handling and usage.
ast-grep --lang python --pattern $'def _load_fiftyone_dataset_from_config($_) { $$$ app_media_fields $$$ grid_media_field $$$ }'
Length of output: 474

Script:
#!/bin/bash
# Description: Verify the usage of `app_media_fields` and `grid_media_field` within the `_load_fiftyone_dataset_from_config` method.

# Search for the definition of `_load_fiftyone_dataset_from_config` method
ast-grep --lang python --pattern 'def _load_fiftyone_dataset_from_config($_) { $$$ }' -A 50

# Search for the usage of `app_media_fields` within the method
rg 'app_media_fields' -A 5

# Search for the usage of `grid_media_field` within the method
rg 'grid_media_field' -A 5
Length of output: 14066
133-133: Ensure proper chunk size handling in push_to_hub.

The logic to automatically set the chunk size based on the number of samples is a good addition for performance optimization. However, ensure that the _count_samples function is thoroughly tested, especially since it affects how datasets are chunked and uploaded.

Verification successful

Ensure proper chunk size handling in push_to_hub.

The logic to automatically set the chunk size based on the number of samples is a good addition for performance optimization. The _count_samples function is defined and used appropriately within the fiftyone/utils/huggingface.py file. Ensure that this function is thoroughly tested, especially since it affects how datasets are chunked and uploaded.

Definition: fiftyone/utils/huggingface.py, line with def _count_samples(sample_collection):

Usage: fiftyone/utils/huggingface.py, multiple lines including the one under review
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the implementation of `_count_samples` function.

# Test: Search for the function implementation and usage. Expect: Proper implementation and usage in `push_to_hub`.
ast-grep --lang python --pattern $'def _count_samples($_, $_) { $$$ }'
Length of output: 528

Script:
#!/bin/bash
# Description: Verify the implementation of `_count_samples` function.

# Test: Search for the function implementation and usage. Expect: Proper implementation and usage in `push_to_hub`.
ast-grep --lang python --pattern $'def _count_samples($_) { $$$ }'
Length of output: 520

Script:
#!/bin/bash
# Description: Verify the implementation of `_count_samples` function.

# Test: Search for the function implementation and usage using a simpler pattern.
ast-grep --lang python --pattern $'def _count_samples($_)'

# Fallback: Use `rg` to search for the function definition and its usage.
rg '_count_samples'
Length of output: 770
Tools

Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)

fiftyone/utils/huggingface.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7f57f9b and 661ad48.

Files selected for processing (2)

docs/source/integrations/huggingface.rst (29 hunks)
fiftyone/utils/huggingface.py (16 hunks)

Files skipped from review as they are similar to previous changes (1)

docs/source/integrations/huggingface.rst

Additional context used

Ruff

fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)

1410-1410: Do not use bare except

(E722)

Additional comments not posted (3)

fiftyone/utils/huggingface.py (3)

12-12: Approved addition of itertools import.

This import is used effectively in the file for handling complex iterations, such as in the _get_files_to_download function.

133-133: Approved modification to chunk size logic.

The updated logic correctly sets a default chunk size for large datasets, which is crucial for managing file system limitations on the Hugging Face Hub.

Tools

Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)

Line range hint 318-342: Approved addition of media field attributes in HFHubDatasetConfig.

The new attributes app_media_fields and grid_media_field are well-implemented and align with the PR's objectives to handle dataset configurations more effectively.

jacobmarks · 2024-07-17T03:39:11Z

@brimoor let me know if this is acceptable, or if I should just cherry-pick the commits on a new branch

brimoor

LGTM

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 661ad48 and 4f89f18.

Files selected for processing (2)

docs/source/integrations/huggingface.rst (29 hunks)
fiftyone/utils/huggingface.py (16 hunks)

Files skipped from review as they are similar to previous changes (1)

docs/source/integrations/huggingface.rst

Additional context used

Ruff

fiftyone/utils/huggingface.py

132-133: Use a single if statement instead of nested if statements

(SIM102)

1410-1410: Do not use bare except

(E722)

Additional comments not posted (12)

fiftyone/utils/huggingface.py (12)
133-133: Optimize nested if statements.

Use a single if statement instead of nested if statements for better readability.
-  if dataset_type == fot.FiftyOneDataset and chunk_size is None:
-    if _count_samples(dataset) > 10000:
+  if dataset_type == fot.FiftyOneDataset and chunk_size is None and _count_samples(dataset) > 10000:
Tools

Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)

133-133: LGTM!

The use of _count_samples instead of dataset.count() is appropriate for handling grouped datasets.

Tools

Ruff

132-133: Use a single if statement instead of nested if statements

(SIM102)

Line range hint 318-341:
LGTM!

The new attributes app_media_fields and grid_media_field are well-defined and initialized correctly.

457-462: LGTM!

The changes ensure proper handling of chunk sizes when uploading data to the repository.

546-552: LGTM!

The changes ensure that the configuration file includes the new attributes app_media_fields and grid_media_field if they are different from the default values.

630-630: LGTM!

The use of _count_samples instead of dataset.count() is appropriate for handling grouped datasets.

670-676: LGTM!

The function _count_samples is well-implemented and ensures proper counting of samples in grouped datasets.

929-934: LGTM!

The changes ensure correct formatting of the API URL.

945-950: LGTM!

The changes ensure correct formatting of the API URL.

1350-1357: LGTM!

The function _get_media_fields is well-implemented and ensures proper retrieval of media fields.

1360-1412: LGTM!

The changes ensure proper handling of different media fields and exceptions while retrieving files to download.

Tools

Ruff

1410-1410: Do not use bare except

(E722)

1410-1410: Avoid bare except statements.

Replace the bare except statement with a specific exception to avoid catching unintended exceptions.
-  except:
+  except Exception:
Tools

Ruff

1410-1410: Do not use bare except

(E722)

jacobmarks requested review from mwoodson1, brimoor and harpreetsahota204 July 11, 2024 17:32

coderabbitai bot reviewed Jul 11, 2024

View reviewed changes

fiftyone/utils/huggingface.py Show resolved Hide resolved

fiftyone/utils/huggingface.py Show resolved Hide resolved

benjaminpkane approved these changes Jul 16, 2024

View reviewed changes

brimoor requested changes Jul 16, 2024

View reviewed changes

docs/source/integrations/huggingface.rst Outdated Show resolved Hide resolved

fiftyone/utils/huggingface.py Show resolved Hide resolved

jacobmarks force-pushed the hf-hub-group-datasets branch from 615c708 to 7f57f9b Compare July 17, 2024 03:13

coderabbitai bot reviewed Jul 17, 2024

View reviewed changes

fiftyone/utils/huggingface.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jul 17, 2024

View reviewed changes

jacobmarks requested a review from brimoor July 17, 2024 03:39

jacobmarks added 10 commits July 17, 2024 16:21

change convention to import from fo hf utils

16b1acc

generalize sample count for grouped datasets

0cee8f6

string format enhancements

9a985a6

generalize file download for grouped datasets

9bf1237

correct chunk size message

40c0d2c

download all app media fields

393dacc

remove space

525e6ba

add masks and heatmaps

800883c

efficient sample counting for grouped datasets

3b11eb2

space typo

2c9d5c6

brimoor force-pushed the hf-hub-group-datasets branch from 661ad48 to 2c9d5c6 Compare July 17, 2024 20:22

linting

4f89f18

brimoor approved these changes Jul 17, 2024

View reviewed changes

coderabbitai bot reviewed Jul 17, 2024

View reviewed changes

jacobmarks merged commit cac9c1a into develop Jul 18, 2024
13 checks passed

jacobmarks deleted the hf-hub-group-datasets branch July 18, 2024 18:00

coderabbitai bot mentioned this pull request Sep 30, 2024

Merge release/v1.0.0 to develop #4867

Merged

coderabbitai bot mentioned this pull request Dec 5, 2024

List hf hub datasets #5226

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566

Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566

jacobmarks commented Jul 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 11, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

benjaminpkane left a comment

brimoor left a comment

jacobmarks commented Jul 17, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

jacobmarks commented Jul 17, 2024

brimoor left a comment

coderabbitai bot left a comment

Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566

Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566

Conversation

jacobmarks commented Jul 11, 2024 • edited by coderabbitai bot Loading

What changes are proposed in this pull request?

How is this patch tested? If it is not, please explain why.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

What areas of FiftyOne does this PR affect?

Summary by CodeRabbit

coderabbitai bot commented Jul 11, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

benjaminpkane left a comment

Choose a reason for hiding this comment

brimoor left a comment

Choose a reason for hiding this comment

jacobmarks commented Jul 17, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jacobmarks commented Jul 17, 2024

brimoor left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jacobmarks commented Jul 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 11, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)