-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sampler Plugin #115
Add Sampler Plugin #115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the contribution!
Please try to keep lines no longer than 80 characters. Consider using enums instead of string constants.
temp_rank = temp_rank[-k:] | ||
elif method == "randk": | ||
return self.data.sample(n=k).reset_index(drop=True) | ||
elif method in ["mixk", "randtopk"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using an enum for instead of string constants.
datumaro/plugins/sampler/sampler.py
Outdated
def build_cmdline_parser(cls, **kwargs): | ||
parser = super().build_cmdline_parser(**kwargs) | ||
parser.add_argument( | ||
"-algo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using only -a
and --long-name
variants.
datumaro/plugins/sampler/sampler.py
Outdated
raise Exception(msg) | ||
|
||
# 1. check the empty of the data set | ||
if len(extractor) < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't consider this an error.
datumaro/plugins/sampler/sampler.py
Outdated
|
||
# 2. Import data into a subset name and convert it | ||
# to a format that will be used in the sampler algorithm with the inference result. | ||
data_df, infer_df = self._load_inference_from_subset(extractor, subset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the operation to be traversing the input dataset only when __iter__
is called.
datumaro/plugins/sampler/sampler.py
Outdated
for data in subset: | ||
data_df["ImageID"].append(data.id) | ||
|
||
if data.image is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use item.has_image
.
datumaro/plugins/sampler/sampler.py
Outdated
if data.image is None: | ||
msg = f"Invalid data, some data.image is None" | ||
raise Exception(msg) | ||
width, height = data.image.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size
can return None
in some cases (only path
provided, but the image file is not available).
datumaro/plugins/sampler/sampler.py
Outdated
# Checking and creating algorithms | ||
algorithms = ["entropy"] | ||
if algorithm == "entropy": | ||
from datumaro.plugins.sampler.algorithm.entropy import SampleEntropy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use relative imports in plugins for intra-plugin imports.
@zhiltsov-max Thank you for your review. I modified the contents in commit And CI checking is currently failing because the sampler uses pandas. If there is a guide for the external library, please reply. Currently, it is only added to requirements.txt. |
Plugin dependencies are considered optional for Datumaro, so putting them to the |
|
||
# check the existence of "ImageID" in data & inference | ||
if "ImageID" not in data: | ||
msg = "Invalid Data, ImageID not found in data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to avoid this pattern, because it spreads reader's attention. Just do raise Exception("some text")
datumaro/plugins/sampler/sampler.py
Outdated
- Requesting a sample larger than the number of all images will return all images.|n | ||
|n | ||
Example:|n | ||
|s|s%(prog)s -algo entropy -subset_name train -sample_name sample -unsampled_name unsampled -m topk -k 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this line.
datumaro/plugins/sampler/sampler.py
Outdated
"--algorithm", | ||
type=str, | ||
default="entropy", | ||
choices=["entropy"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also introduce an enum for this.
datumaro/plugins/sampler/sampler.py
Outdated
infer_df = defaultdict(list) | ||
|
||
# 2. Fill the data_df and infer_df to fit the sampler algorithm input format. | ||
for data in subset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest calling it item
.
datumaro/plugins/sampler/sampler.py
Outdated
data_df["ImageID"].append(data.id) | ||
|
||
if not data.has_image or data.image.size is None: | ||
msg = "Invalid data, the image file is not available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the error messages here it would be nice to also print data.id
.
- `randtopk`: First, select 3 times the number of k randomly, and return the topk among them. | ||
|
||
``` bash | ||
datum transform -t sampler -- \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the example.
@@ -0,0 +1,501 @@ | |||
ImageID,ClassProbability1,ClassProbability2,ClassProbability3,Uncertainty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reduce the example to few lines?
@zhiltsov-max Thank you for review.
If you have any questions, please comment. |
Summary
This PR includes
How to test
Unittest
python3 -m unittest -v tests/test_sampler.py
Testing sampling with dataset (After obtaining the inference result)
Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.