-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching to Torch Datasets pipeline #123
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes involve significant updates to the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (2)
sleap_nn/data/custom_datasets.py (2)
26-26
: Remove duplicate import ofapply_normalization
The function
apply_normalization
is imported at line 26 but was already imported earlier at line 18. You can remove the duplicate import to clean up the code.Apply this diff to remove the duplicate import:
from sleap_nn.data.instance_cropping import make_centered_bboxes -from sleap_nn.data.normalization import apply_normalization from sleap_nn.data.resizing import apply_pad_to_stride, apply_resizer
270-271
: Simplify nestedif
statements when filtering user instancesAt lines 270-271, the nested
if
statements can be combined into a single condition to enhance readability.Apply this diff to simplify the condition:
-if self.data_config.user_instances_only: - if lf.user_instances is not None and len(lf.user_instances) > 0: +if self.data_config.user_instances_only and lf.user_instances: lf.instances = lf.user_instances🧰 Tools
🪛 Ruff (0.8.2)
270-271: Use a single
if
statement instead of nestedif
statements(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
sleap_nn/data/custom_datasets.py
(7 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
sleap_nn/data/custom_datasets.py
270-271: Use a single if
statement instead of nested if
statements
(SIM102)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
sleap_nn/data/custom_datasets.py (1)
272-273
: Simplify nestedif
statements into a single conditionThe nested
if
statements at lines 272-273 can be combined into a singleif
statement for improved readability.Apply this diff to combine the conditions:
-if self.data_config.user_instances_only: - if lf.user_instances is not None and len(lf.user_instances) > 0: +if self.data_config.user_instances_only and lf.user_instances and len(lf.user_instances) > 0:🧰 Tools
🪛 Ruff (0.8.2)
272-273: Use a single
if
statement instead of nestedif
statements(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
sleap_nn/data/custom_datasets.py
(7 hunks)tests/data/test_custom_datasets.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
sleap_nn/data/custom_datasets.py
272-273: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (5)
sleap_nn/data/custom_datasets.py (4)
341-342
: Avoid closing videos within data loading
Closing videos at lines 341-342 inside the dataset initialization may lead to issues if the dataset accesses video frames afterward. Ensure that videos remain open for the duration of their needed use, or manage video resources appropriately.
Run the following script to check for any subsequent usage of video
objects after they are closed:
#!/bin/bash
# Description: Search for usage of 'video' objects after they are closed.
# Find all instances where 'video' is used after being closed.
rg -A5 'video\.close\(\)' --no-heading --line-number | grep -A5 'video\.'
120-122
: 🛠️ Refactor suggestion
Consider refactoring caching logic to reduce duplication
The initialization of self.cache
and the call to self._fill_cache()
are repeated across multiple dataset classes. To improve maintainability and reduce code redundancy, consider moving the caching mechanism into the BaseDataset
class or creating a shared utility method.
123-164
: 🛠️ Refactor suggestion
Refactor repeated _fill_cache
methods into a shared method
The _fill_cache
method in BottomUpDataset
is similar to those in other dataset classes. Refactoring common logic into a shared method in the BaseDataset
class or a utility function can promote code reuse and simplify future maintenance.
262-264
: 🛠️ Refactor suggestion
Consolidate caching logic across dataset classes
The caching logic implemented in CenteredInstanceDataset
, CentroidDataset
, and SingleInstanceDataset
duplicates code from BottomUpDataset
. Consolidating this logic can reduce redundancy and improve code maintainability.
tests/data/test_custom_datasets.py (1)
167-169
: Verify expected output shapes after changing max_stride
The expected shapes of image
, confidence_maps
, and part_affinity_fields
have been updated at lines 167-169. Confirm that these changes are correct and that the model architecture and loss functions are compatible with these new shapes.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #123 +/- ##
==========================================
+ Coverage 96.64% 97.62% +0.98%
==========================================
Files 23 39 +16
Lines 1818 4003 +2185
==========================================
+ Hits 1757 3908 +2151
- Misses 61 95 +34 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
sleap_nn/data/custom_datasets.py (1)
64-111
: Consider implementing cache memory managementThe current caching implementation loads all samples into memory without any size limits or cleanup mechanism. For large datasets, this could lead to memory issues.
Consider implementing:
- A maximum cache size limit
- LRU (Least Recently Used) cache eviction policy
- Cache cleanup in the
__del__
methodfrom functools import lru_cache import sys class BaseDataset(Dataset): def __init__(self, *args, **kwargs): super().__init__() self.cache = {} self.cache_size = 0 self.max_cache_size = 1024 * 1024 * 1024 # 1GB default def __del__(self): self.cache.clear()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
sleap_nn/data/custom_datasets.py
(8 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
sleap_nn/data/custom_datasets.py
272-273: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (3)
sleap_nn/data/custom_datasets.py (3)
168-172
: LGTM: Efficient caching implementation
The caching implementation in BottomUpDataset correctly initializes the cache and efficiently retrieves samples in getitem.
589-593
: LGTM: Consistent caching implementation
The caching implementation in SingleInstanceDataset follows the established pattern and correctly handles sample retrieval.
463-514
: 🛠️ Refactor suggestion
Reduce code duplication in _fill_cache methods
The _fill_cache
implementation duplicates preprocessing logic from BaseDataset._fill_cache
.
Consider extracting common preprocessing steps to a utility method in BaseDataset
:
class BaseDataset(Dataset):
def _preprocess_sample(self, sample):
"""Common preprocessing steps for all dataset types."""
sample["image"] = apply_normalization(sample["image"])
if self.data_config.preprocessing.is_rgb:
sample["image"] = convert_to_rgb(sample["image"])
else:
sample["image"] = convert_to_grayscale(sample["image"])
sample["image"], eff_scale = apply_sizematcher(
sample["image"],
max_height=self.max_hw[0],
max_width=self.max_hw[1],
)
return sample, eff_scale
This is the second PR for #119. We implement in-memory caching by pre-loading the samples in a dictionary and applying augmentations in the
__getitem__()
function to ensure randomness.Summary by CodeRabbit
New Features
CenteredInstanceDataset
class.SingleInstanceDataset
class.Bug Fixes
Tests