-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Support for Grouped Datasets, Masks, Heatmaps, and Thumbnails #4566
Conversation
WalkthroughThe recent updates encompass three main areas: CI/CD, documentation, and Hugging Face integration. Pipeline adjustments ensure artifacts are uploaded conditionally by platform type. HTML and RST documentation saw syntax simplifications and clarifications for better readability and accuracy. Enhancements to the Hugging Face utility include improved sample counting, handling media fields, and more efficient dataset processing. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (5)
- .github/workflows/build-db.yml (1 hunks)
- docs/source/_templates/layout.html (7 hunks)
- docs/source/integrations/huggingface.rst (29 hunks)
- fiftyone/utils/huggingface.py (16 hunks)
- package/db/setup.py (2 hunks)
Files skipped from review due to trivial changes (1)
- docs/source/_templates/layout.html
Additional context used
Ruff
fiftyone/utils/huggingface.py
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
1410-1410: Do not use bare
except
(E722)
Additional comments not posted (38)
.github/workflows/build-db.yml (1)
69-72
: Conditional check for artifact upload is correct.The addition of the conditional check ensures that artifacts are uploaded only when the platform is 'sdist', and the upload path is correctly restricted to
.tar.gz
files.package/db/setup.py (2)
98-101
: New MongoDB download links are valid.The addition of new MongoDB download links for Ubuntu version "24" appears correct.
146-146
: Version update is consistent.The
VERSION
constant update to "1.1.4" is consistent with the changes made.fiftyone/utils/huggingface.py (5)
12-12
: Import ofitertools
is appropriate.The addition of the
itertools
import is necessary for the chunking logic and other functionalities.
318-319
: New attributes are correctly initialized.The
app_media_fields
andgrid_media_field
attributes are correctly initialized in theHFHubDatasetConfig
class.
457-458
: Correct chunk size handling.The logic for handling chunk sizes is correct and ensures efficient uploads.
1425-1435
: Handling of new attributes is correct.The logic for handling
app_media_fields
andgrid_media_field
attributes is correctly implemented.
670-676
: Recursive sample counting is correct.The logic for recursively counting samples in grouped datasets is correctly implemented.
docs/source/integrations/huggingface.rst (30)
1014-1014
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1016-1016
: Function call updated.The function call has been correctly updated to match the new import statement.
1046-1046
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1052-1052
: Function call updated.The function call has been correctly updated to match the new import statement.
1069-1069
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1077-1077
: Function call updated.The function call has been correctly updated to match the new import statement.
1106-1106
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1109-1109
: Function call updated.The function call has been correctly updated to match the new import statement.
1142-1142
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1144-1144
: Function call updated.The function call has been correctly updated to match the new import statement.
1156-1156
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1160-1160
: Function call updated.The function call has been correctly updated to match the new import statement.
1201-1201
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1207-1207
: Function call updated.The function call has been correctly updated to match the new import statement.
1233-1233
: Import statement updated.The import statement has been correctly updated to directly import
push_to_hub
.
1237-1237
: Function call updated.The function call has been correctly updated to match the new import statement.
1267-1267
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1269-1269
: Function call updated.The function call has been correctly updated to match the new import statement.
1297-1297
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1299-1299
: Function call updated.The function call has been correctly updated to match the new import statement.
1333-1333
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1335-1335
: Function call updated.The function call has been correctly updated to match the new import statement.
1361-1361
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1363-1363
: Function call updated.The function call has been correctly updated to match the new import statement.
1566-1566
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1568-1568
: Function call updated.The function call has been correctly updated to match the new import statement.
1604-1604
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1606-1606
: Function call updated.The function call has been correctly updated to match the new import statement.
1621-1621
: Import statement updated.The import statement has been correctly updated to directly import
load_from_hub
.
1623-1623
: Function call updated.The function call has been correctly updated to match the new import statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code update LGTM. Defer to others 🚀
There is a fair amount of nuance it seems. Adding hugging face unit tests in ./tests/unittest
might be valuable if the integration is extended in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobmarks looks like you need to rebase on the latest develop
to solve the merge conflict.
Also left a couple tiny comments for you 🤓
Sorry about that! Rebasing now! |
615c708
to
7f57f9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range, codebase verification and nitpick comments (1)
fiftyone/utils/huggingface.py (1)
Line range hint
318-342
: Add documentation for new attributes inHFHubDatasetConfig
.The new attributes
app_media_fields
andgrid_media_field
are added to the class but not documented in the class docstring. It's important for maintainability and usability to document these, especially given their significance in the context of this PR.class HFHubDatasetConfig(Config): """Config for a Hugging Face Hub dataset. Args: name: the name of the dataset repo_type: the type of the repository repo_id: the identifier of the repository revision: the revision of the dataset filename: the name of the file format: the format of the dataset tags: the tags of the dataset license: the license of the dataset description: the description of the dataset + app_media_fields: the media fields visible in the App + grid_media_field: the media field to use in the grid view fiftyone: the fiftyone version requirement of the dataset """
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (5)
- .github/workflows/build-db.yml (1 hunks)
- docs/source/_templates/layout.html (7 hunks)
- docs/source/integrations/huggingface.rst (29 hunks)
- fiftyone/utils/huggingface.py (16 hunks)
- package/db/setup.py (2 hunks)
Files skipped from review as they are similar to previous changes (4)
- .github/workflows/build-db.yml
- docs/source/_templates/layout.html
- docs/source/integrations/huggingface.rst
- package/db/setup.py
Additional context used
Ruff
fiftyone/utils/huggingface.py
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
1410-1410: Do not use bare
except
(E722)
Additional comments not posted (2)
fiftyone/utils/huggingface.py (2)
Line range hint
1425-1494
: Ensure correct handling of media fields in_load_fiftyone_dataset_from_config
.The method now handles
app_media_fields
andgrid_media_field
, which are crucial for the integration of new media types. Ensure that these fields are correctly used and that their handling does not introduce any issues, especially when the fields are not set to their default values.Verification successful
Verified: Correct handling of
app_media_fields
andgrid_media_field
in_load_fiftyone_dataset_from_config
.The fields
app_media_fields
andgrid_media_field
are correctly set and checked within the method, ensuring proper integration of new media types.
app_media_fields
andgrid_media_field
are initialized based on the configuration or default values.- They are then applied to the dataset's app configuration if they differ from the default values.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correct handling of `app_media_fields` and `grid_media_field`. # Test: Search for the usage of these fields in the method. Expect: Correct handling and usage. ast-grep --lang python --pattern $'def _load_fiftyone_dataset_from_config($_) { $$$ app_media_fields $$$ grid_media_field $$$ }'Length of output: 474
Script:
#!/bin/bash # Description: Verify the usage of `app_media_fields` and `grid_media_field` within the `_load_fiftyone_dataset_from_config` method. # Search for the definition of `_load_fiftyone_dataset_from_config` method ast-grep --lang python --pattern 'def _load_fiftyone_dataset_from_config($_) { $$$ }' -A 50 # Search for the usage of `app_media_fields` within the method rg 'app_media_fields' -A 5 # Search for the usage of `grid_media_field` within the method rg 'grid_media_field' -A 5Length of output: 14066
133-133
: Ensure proper chunk size handling inpush_to_hub
.The logic to automatically set the chunk size based on the number of samples is a good addition for performance optimization. However, ensure that the
_count_samples
function is thoroughly tested, especially since it affects how datasets are chunked and uploaded.Verification successful
Ensure proper chunk size handling in
push_to_hub
.The logic to automatically set the chunk size based on the number of samples is a good addition for performance optimization. The
_count_samples
function is defined and used appropriately within thefiftyone/utils/huggingface.py
file. Ensure that this function is thoroughly tested, especially since it affects how datasets are chunked and uploaded.
- Definition:
fiftyone/utils/huggingface.py
, line withdef _count_samples(sample_collection):
- Usage:
fiftyone/utils/huggingface.py
, multiple lines including the one under reviewScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `_count_samples` function. # Test: Search for the function implementation and usage. Expect: Proper implementation and usage in `push_to_hub`. ast-grep --lang python --pattern $'def _count_samples($_, $_) { $$$ }'Length of output: 528
Script:
#!/bin/bash # Description: Verify the implementation of `_count_samples` function. # Test: Search for the function implementation and usage. Expect: Proper implementation and usage in `push_to_hub`. ast-grep --lang python --pattern $'def _count_samples($_) { $$$ }'Length of output: 520
Script:
#!/bin/bash # Description: Verify the implementation of `_count_samples` function. # Test: Search for the function implementation and usage using a simpler pattern. ast-grep --lang python --pattern $'def _count_samples($_)' # Fallback: Use `rg` to search for the function definition and its usage. rg '_count_samples'Length of output: 770
Tools
Ruff
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (2)
- docs/source/integrations/huggingface.rst (29 hunks)
- fiftyone/utils/huggingface.py (16 hunks)
Files skipped from review as they are similar to previous changes (1)
- docs/source/integrations/huggingface.rst
Additional context used
Ruff
fiftyone/utils/huggingface.py
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
1410-1410: Do not use bare
except
(E722)
Additional comments not posted (3)
fiftyone/utils/huggingface.py (3)
12-12
: Approved addition ofitertools
import.This import is used effectively in the file for handling complex iterations, such as in the
_get_files_to_download
function.
133-133
: Approved modification to chunk size logic.The updated logic correctly sets a default chunk size for large datasets, which is crucial for managing file system limitations on the Hugging Face Hub.
Tools
Ruff
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
Line range hint
318-342
: Approved addition of media field attributes inHFHubDatasetConfig
.The new attributes
app_media_fields
andgrid_media_field
are well-implemented and align with the PR's objectives to handle dataset configurations more effectively.
@brimoor let me know if this is acceptable, or if I should just cherry-pick the commits on a new branch |
661ad48
to
2c9d5c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (2)
- docs/source/integrations/huggingface.rst (29 hunks)
- fiftyone/utils/huggingface.py (16 hunks)
Files skipped from review as they are similar to previous changes (1)
- docs/source/integrations/huggingface.rst
Additional context used
Ruff
fiftyone/utils/huggingface.py
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
1410-1410: Do not use bare
except
(E722)
Additional comments not posted (12)
fiftyone/utils/huggingface.py (12)
133-133
: Optimize nested if statements.Use a single
if
statement instead of nestedif
statements for better readability.- if dataset_type == fot.FiftyOneDataset and chunk_size is None: - if _count_samples(dataset) > 10000: + if dataset_type == fot.FiftyOneDataset and chunk_size is None and _count_samples(dataset) > 10000:Tools
Ruff
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
133-133
: LGTM!The use of
_count_samples
instead ofdataset.count()
is appropriate for handling grouped datasets.Tools
Ruff
132-133: Use a single
if
statement instead of nestedif
statements(SIM102)
Line range hint
318-341
:
LGTM!The new attributes
app_media_fields
andgrid_media_field
are well-defined and initialized correctly.
457-462
: LGTM!The changes ensure proper handling of chunk sizes when uploading data to the repository.
546-552
: LGTM!The changes ensure that the configuration file includes the new attributes
app_media_fields
andgrid_media_field
if they are different from the default values.
630-630
: LGTM!The use of
_count_samples
instead ofdataset.count()
is appropriate for handling grouped datasets.
670-676
: LGTM!The function
_count_samples
is well-implemented and ensures proper counting of samples in grouped datasets.
929-934
: LGTM!The changes ensure correct formatting of the API URL.
945-950
: LGTM!The changes ensure correct formatting of the API URL.
1350-1357
: LGTM!The function
_get_media_fields
is well-implemented and ensures proper retrieval of media fields.
1360-1412
: LGTM!The changes ensure proper handling of different media fields and exceptions while retrieving files to download.
Tools
Ruff
1410-1410: Do not use bare
except
(E722)
1410-1410
: Avoid bare except statements.Replace the bare
except
statement with a specific exception to avoid catching unintended exceptions.- except: + except Exception:Tools
Ruff
1410-1410: Do not use bare
except
(E722)
What changes are proposed in this pull request?
@harpreetsahota204 pointed out that the HF Hub integration did not work with grouped datasets, and that led me to dig into some edge cases.
Now both
push_to_hub
andload_from_hub
should support:dataset.app_config.media_fields
dataset.app_config.grid_media_field
mask_path
and heatmaps stored withmap_path
should also be supported.I've tested this in a bunch of scenarios but could use some more beta testing.
(Please fill in changes proposed in this fix)
How is this patch tested? If it is not, please explain why.
(Details)
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
notes for FiftyOne users.
(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
Documentation
New Features
app_media_fields
andgrid_media_field
attributes in datasets.Bug Fixes
Chores