-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for summary fields #4765
Add support for summary fields #4765
Conversation
Warning Rate limit exceeded@brimoor has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 58 seconds before requesting another review. How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe pull request introduces new methods for managing summary fields in the Changes
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
c4a19d3
to
24025b2
Compare
455c9b8
to
cab2429
Compare
4f256e0
to
42cc12f
Compare
… fo.StringField 2. Index created for counts mode must have classifications.label 3. Updated docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (7)
- fiftyone/core/collections.py (6 hunks)
- fiftyone/core/dataset.py (10 hunks)
- fiftyone/core/fields.py (15 hunks)
- fiftyone/core/odm/embedded_document.py (1 hunks)
- fiftyone/core/odm/mixins.py (3 hunks)
- fiftyone/core/view.py (6 hunks)
- tests/unittests/dataset_tests.py (1 hunks)
Additional context used
Ruff
tests/unittests/dataset_tests.py
766-766: Do not assign a
lambda
expression, use adef
Rewrite
to_sets
as adef
(E731)
766-766: Ambiguous variable name:
l
(E741)
Additional comments not posted (19)
fiftyone/core/odm/embedded_document.py (1)
51-52
: LGTM! The addition of the__hash__
method is a useful enhancement.By implementing the
__hash__
method, instances of theDynamicEmbeddedDocument
class become hashable. This allows them to be used as keys in dictionaries or elements in sets, expanding their usability in various scenarios.The implementation, which converts the instance to a string and computes its hash value, is a common and reasonable approach. However, it's important to ensure that the string representation of the instance is unique and consistent across different instances to minimize the risk of hash collisions.
Overall, this change enhances the functionality and flexibility of the
DynamicEmbeddedDocument
class.fiftyone/core/fields.py (5)
Line range hint
26-107
: LGTM!The new parameters
info_keys
andcreated_after
are properly validated. The function raises appropriateValueError
s if the provided values do not meet the expected types.
Line range hint
113-168
: LGTM!The function correctly checks the new
info_keys
andcreated_after
constraints against the corresponding attributes of the field. It returnsFalse
if any of the constraints are not met.Tools
Ruff
163-170: Return the negated condition directly
Inline condition
(SIM103)
Line range hint
270-362
: LGTM!The function properly incorporates the new
info_keys
andcreated_after
parameters. It passes them to thevalidate_constraints
function for validation and includes them in thekwargs
dictionary when applying the constraints to filter the schema.
Line range hint
368-417
: LGTM!The
flatten_schema
function has been updated to include the newinfo_keys
andcreated_after
parameters. It passes these parameters to thevalidate_constraints
function for validation and to the_flatten
helper function to apply the constraints while flattening the schema.
Line range hint
429-467
: LGTM!The
_flatten
helper function has been properly updated to handle the newinfo_keys
andcreated_after
parameters. It passes these parameters to thematches_constraints
function to filter the fields based on the specified constraints and propagates them recursively when flattening embedded document fields.fiftyone/core/odm/mixins.py (2)
Line range hint
182-206
: LGTM!The new optional parameters
info_keys
andcreated_after
expand the functionality of theget_field_schema
function, allowing for more granular control over the schema retrieval process. The implementation looks good.
Line range hint
1-38
: Looks good!The function correctly extracts updates for filtered list fields and generates extra updates using the element ID and array filter syntax. It also handles the case where the
$set
operator becomes empty after extracting the updates. The validation for illegal modifications to the root of filtered list fields is a nice touch.fiftyone/core/view.py (2)
918-919
: LGTM!The new
info_keys
andcreated_after
parameters enhance the filtering capabilities of theget_field_schema
function by allowing users to specify required keys in the field'sinfo
dictionary and a minimum field creation date. This provides more control over the returned schema. The changes are backwards compatible.Also applies to: 937-940
977-978
: LGTM!The new
info_keys
andcreated_after
parameters enhance the filtering capabilities of theget_frame_field_schema
function by allowing users to specify required keys in the frame field'sinfo
dictionary and a minimum frame field creation date. This provides more control over the returned frame schema. The changes are backwards compatible.Also applies to: 998-1001
tests/unittests/dataset_tests.py (1)
715-931
: Comprehensive test for frame summaries functionality.The
test_frame_summaries
method thoroughly tests the creation, management, and validation of frame summaries in theDataset
class. It covers various scenarios, including:
- Generating frame summaries at dataset and frame levels with different configurations.
- Asserting the correctness of the generated summaries against expected values.
- Verifying the read-only status of summary fields.
- Checking the creation of appropriate database indexes.
- Updating dataset fields and ensuring frame summaries are updated accordingly.
- Dropping frame summaries and confirming their removal.
The test method is well-structured, follows good practices for unit testing, and provides comprehensive coverage of the frame summaries functionality.
Tools
Ruff
766-766: Do not assign a
lambda
expression, use adef
Rewrite
to_sets
as adef
(E731)
766-766: Ambiguous variable name:
l
(E741)
fiftyone/core/dataset.py (6)
1647-1658
: LGTM!This is a straightforward and useful function to list the frame summaries that have been created on the dataset. Using
get_field_schema
withflat=True
andinfo_keys=_FRAME_SUMMARY_KEY
to retrieve just the relevant frame summary fields is the right approach.
1660-1738
: Excellent work!This is a very comprehensive and well-designed function for creating frame summaries on a dataset. It covers all the key functionality and customization options one would expect, including:
- Handling both categorical and numeric field types
- Customizing the summary field name, sidebar group, counts, grouping, read-only status, indexing, and overwriting behavior
- Using efficient aggregation pipelines to compute the summaries
- Robust sidebar group management to ensure the summary field is added to an appropriate location
- Safely defaulting to making the summary fields read-only
The implementation looks solid and I don't see any issues. Great job!
2026-2050
: LGTM!This function properly deletes a frame summary field from the dataset. Unsetting the read-only status before deleting is the correct approach, and the early return if the field doesn't exist is a nice optimization.
2052-2087
: Looks good!This is a handy function to identify frame summaries that may be outdated due to modifications made to the source frames or samples since the summary was created. Comparing the creation timestamp of the summary to the last modification timestamp of the frames/samples is a sensible approach.
Properly handling both frame-level and sample-level fields is important since frame summaries can be created from either.
No issues found - good work!
Line range hint
2088-2097
: LGTM!This is a straightforward extension of the existing
_add_implied_sample_field
to allow adding implied frame fields too.The check that the dataset supports frame fields is good defensive programming. Reloading the dataset to update its schema after adding the field is the right step.
Looks good!
Line range hint
2099-2114
: Looks good to me!This function extends
_merge_sample_field_schema
to allow merging frame field schemas into the dataset as well.The check that the dataset supports frame fields is a good safeguard. Only calling
_reload()
if the schema actually expanded is a nice optimization to avoid unnecessary work.No issues found, good implementation!
Tools
Ruff
48-48:
fiftyone.core.odm.dataset.SampleFieldDocument
imported but unusedRemove unused import:
fiftyone.core.odm.dataset.SampleFieldDocument
(F401)
fiftyone/core/collections.py (2)
1406-1407
: LGTM!The new
info_keys
andcreated_after
parameters provide useful ways to filter the returned schema based on field metadata and creation time.
1449-1450
: Looks good!The
info_keys
andcreated_after
parameters have been added toget_frame_field_schema
as well, providing a consistent way to filter frame-level fields across the API.
for field_name in update_fields: | ||
dataset.update_summary_field(field_name) | ||
|
||
update_fields = dataset.check_summary_fields() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
field_name, | ||
fof.ListField, | ||
subfield=fof.EmbeddedDocumentField, | ||
embedded_doc_type=foo.DynamicEmbeddedDocument, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question - what is dynamic embedded document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha nice, the Fiftyone nested document doll ~
Thanks @brimoor - this is awesome! I have a couple of thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (2)
- docs/source/user_guide/app.rst (2 hunks)
- docs/source/user_guide/using_datasets.rst (1 hunks)
Additional comments not posted (6)
docs/source/user_guide/app.rst (4)
366-394
: Excellent addition of the dataset filtering feature!The new "Filtering sample fields" section provides clear instructions on using the App's UI to conveniently filter datasets by various fields.
Automatically making custom attributes filterable is a great usability enhancement. The ability to convert filters to view stages also seems very handy.
Overall, this feature will make it much easier to quickly drill down into datasets of interest. Nice work!
512-518
: Resolved: Unassigned fields behavior is documentedThe note on lines 512-518 directly addresses the suggestion from the previous comment.
It clearly states that fields not explicitly assigned to any sidebar group will be dynamically added to default groups in the App at runtime.
This provides the missing information and nicely complements the sidebar groups documentation. No further clarification is needed.
395-445
: Great guidance on optimizing filters with indexes!The new section provides clear recommendations and code examples for leveraging dataset indexes to speed up sidebar filters in the App.
The advice to create separate indexes for group datasets to optimize filters in both "group" and "slice" modes is very helpful.
Mentioning the use of summary fields for efficient filtering on video datasets is also a nice tip.
Let's verify that the recommended indexes are being created properly:
#!/bin/bash # Description: Verify that the recommended indexes are created. # Test: Search for the `ground_truth.detections.label` index rg --type python -A 5 $'dataset.create_index("ground_truth.detections.label")' # Test: Search for the compound index on `group.name` and `detections.detections.label` rg --type python -A 5 $'dataset.create_index([("group.name", 1), ("detections.detections.label", 1)])'
451-511
: Sidebar groups customization is a great addition!The new "Sidebar groups" section thoroughly explains the feature for visually creating, modifying, and persisting custom sidebar groups in the App.
Being able to configure sidebar groups programmatically through the dataset's App config is also very useful for more advanced customization.
The code examples are clear and easy to follow.
One suggestion for improvement:
It would be helpful to clarify what happens to dataset fields that are not explicitly assigned to any sidebar group. Are they automatically added to a default group?
Let's check if this behavior is documented anywhere:
#!/bin/bash # Description: Check for documentation on behavior of fields not assigned to sidebar groups. # Test: Search for any comments related to the behavior of unassigned fields rg --type python -i $'fields that do not appear in the dataset\'s sidebar_groups'docs/source/user_guide/using_datasets.rst (2)
1636-1679
: LGTM!This is a clear and informative introduction to summary fields in FiftyOne. The explanations of the concept and benefits are easy to understand. The examples showing how to create summary fields for categorical and numeric input fields are helpful.
1757-1810
: Looks good!This section provides a nice set of examples demonstrating how to manage summary fields:
- Listing summary fields with
list_summary_fields()
- Checking if summary fields need updating with
check_summary_fields()
- Updating summary fields with
update_summary_field()
- Deleting summary fields with
delete_summary_field()
anddelete_summary_fields()
The explanations are clear and the code snippets are easy to understand.
Thank you for the detailed review @minhtuev 🙌🏼 |
If a sidebar group becomes empty, it will remain in the App sidebar with an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range and nitpick comments (2)
fiftyone/core/dataset.py (2)
1660-1734
: Add a code comment explaining the purpose of_SUMMARY_FIELD_KEY
.The
create_summary_field
function looks good overall and handles the creation and population of summary fields based on the provided configuration.One suggestion is to add a code comment explaining the purpose and usage of the
_SUMMARY_FIELD_KEY
info key, which is used to identify and store metadata about summary fields. This will make the code more self-explanatory and easier to maintain.
1915-1917
: Add a comment explaining the purpose of unwinding list fields.Consider adding a code comment to explain why the pipeline unwinds list fields using the
$unwind
stage. This will make it clearer to future readers that unwinding is necessary to handle summary fields that are nested within list fields in the dataset.
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (1)
- fiftyone/core/dataset.py (12 hunks)
Additional comments not posted (7)
fiftyone/core/dataset.py (7)
1647-1658
: LGTM!The
list_summary_fields
function looks good. It retrieves the summary fields by filtering the dataset's field schema for fields with the_SUMMARY_FIELD_KEY
info key, sorts the field names, and returns them as a list.
1894-2037
: Efficient use of aggregation pipelines to compute summary field values.The
_populate_summary_field
function makes good use of MongoDB aggregation pipelines to efficiently compute the summary field values based on the provided configuration. The pipelines handle various scenarios, such as categorical and numeric fields, sample-level and frame-level fields, and optional grouping and counting.Using aggregation pipelines allows for optimized computation of the summaries directly in the database, avoiding the need to load all the data into memory.
2038-2050
: Effective use of last modified timestamps to determine summary field updates.The
check_summary_fields
function implements a useful mechanism to determine which summary fields may need to be updated based on the last modified timestamps of the samples and frames in the dataset.By comparing the timestamps, the function identifies summary fields that were last generated before the most recent modifications to the dataset, indicating that they might be outdated and need to be refreshed.
This approach helps optimize the summary field update process by only updating the necessary fields instead of regenerating all summaries unconditionally.
2086-2107
: Appropriate handling of updating a specific summary field.The
update_summary_field
function provides a straightforward way to update a single summary field based on the current values of its source field.The function takes the necessary steps to ensure data integrity and consistency:
- It reloads the dataset to mitigate potential concurrency issues.
- It validates that the provided field name corresponds to an existing summary field.
- It updates the
last_modified_at
timestamp to reflect the update operation.- It saves the updated field document, allowing the read-only status to be overridden.
- It calls the
_populate_summary_field
method to recompute the summary field values.Overall, the function handles the update process effectively and ensures that the summary field is refreshed with the latest data.
2109-2120
: Straightforward wrapper function for deleting a summary field.The
delete_summary_field
function serves as a simple wrapper around the_delete_summary_fields
method, providing a convenient way to delete a single summary field from all samples in the dataset.It directly passes the provided
field_name
anderror_level
arguments to the underlying method, allowing for consistent behavior and error handling.The function's purpose is clear, and its implementation is concise and readable.
2122-2133
: Straightforward wrapper function for deleting multiple summary fields.The
delete_summary_fields
function serves as a simple wrapper around the_delete_summary_fields
method, providing a convenient way to delete multiple summary fields from all samples in the dataset.It allows for flexibility by accepting either a single field name or an iterable of field names through the
field_names
argument. This enables the caller to delete one or more summary fields in a single operation.The function directly passes the
field_names
anderror_level
arguments to the underlying method, ensuring consistent behavior and error handling.Overall, the function's purpose is clear, and its implementation is concise and readable.
2135-2155
: Robust implementation for deleting summary fields with error handling.The
_delete_summary_fields
function provides a robust implementation for deleting summary fields from all samples in the dataset. It handles various scenarios and incorporates error handling based on the providederror_level
.Key aspects of the function:
- It converts the
field_names
argument to a list, allowing for consistent handling of single field names and iterables.- It validates each field name to ensure it corresponds to an existing summary field. If a field is not a valid summary field, it handles the error based on the
error_level
using thefou.handle_error
function.- It sets the
read_only
attribute of
* Added support for summary fields which allow the users to collect frame-level information at the sample level * Added documentation and unit tests for adding, querying, and deleting summary fields * Updated retrieving field schema by info_keys and created_after * Updated drop_index to not throw error if the index does not exist Co-authored-by: brimoor <[email protected]> Co-authored-by: Stuart Wheaton <[email protected]> Co-authored-by: minhtuevo <[email protected]>
What changes are proposed in this pull request?
Adds support for creating and managing summary fields on datasets.
How is this patch tested? If it is not, please explain why.
Example usage
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
drop_index
method.Tests