-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge back Releases/1.4.0 #1111
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…-life COCO2017 object detection dataset (#1093) ### Summary 1. Fix broken `is_stream` test 2. Fix `StreamDataset.import_from(path, format="coco_instances")` actually create `Dataset` not `StreamDataset`. 3. To speed up, let the initial length of `_CocoBase(stream=True)` can be obtained from `COCOPageMapper`. 4. Speed up parsing the `"categories"` section when it is at the end of JSON file. 5. To speed up, change caching logic and size slightly. 6. Fix `_CocoBase(stream=False)` abruptly raising an error when the progress reporter is given. 7. Add `COCOExtractorMerger` to handle `"coco"` import which should merge extractors across the tasks. ### How to test I manually tested the following code on the real-life COCO2017 object detection dataset. ```python from datumaro.components.dataset_base import DatasetItem from datumaro.components.dataset import StreamDataset, Dataset from time import time from datumaro.components.progress_reporting import TQDMProgressReporter import argparse parser = argparse.ArgumentParser() parser.add_argument("--stream", action="store_true", help="Use stream importer") def upload_to_geti_db(item: DatasetItem) -> None: # Hi, I'm mock! pass if __name__ == "__main__": args = parser.parse_args() start = time() dataset = ( StreamDataset.import_from( "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter() ) if args.stream else Dataset.import_from( "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter() ) ) for item in dataset: upload_to_geti_db(item) print(f"Done. Elapsed time: {time() - start:.2f}s") ``` **Results:** - No stream ![no_stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/8951145b-4181-4b04-bd84-b38c7529754a) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py mprof: Sampling memory every 0.1s running new process running as a Python program... WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances Parsing image info in 'instances_val2017.json': 100%|██████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 77041.13it/s] Parsing annotations in 'instances_val2017.json': 100%|███████████████████████████████████████████████| 36781/36781 [00:01<00:00, 32391.89it/s] Parsing image info in 'instances_train2017.json': 100%|████████████████████████████████████████████| 118287/118287 [00:01<00:00, 83969.43it/s] Parsing annotations in 'instances_train2017.json': 100%|███████████████████████████████████████████| 860001/860001 [00:27<00:00, 31262.75it/s] Done. Elapsed time: 38.17s ``` - Stream ![stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/98379589-5c7c-411e-946c-3925dddd8e7a) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py --stream mprof: Sampling memory every 0.1s running new process running as a Python program... WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances Parsing image info in 'instances_val2017.json': 100%|███████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3650.78it/s] Parsing image info in 'instances_train2017.json': 100%|█████████████████████████████████████████████| 118287/118287 [00:39<00:00, 3027.10it/s] Done. Elapsed time: 238.61s ``` Signed-off-by: Kim, Vinnam <[email protected]>
### Summary - Fix `_get_dm_format_version()` faster when there is no `dm_format_version` field in the file. - Fix `_load_media_type()` faster when there is no `media_type` field in the file. - Fix `TQDMProgressReporter` when `total` is not given (`total = None`) ### How to test I manually tested the following code on the real-life COCO2017 object detection dataset which is converted to Datumaro (JSON) data format. ```python from datumaro.components.dataset_base import DatasetItem from datumaro.components.dataset import StreamDataset, Dataset from time import time from datumaro.components.progress_reporting import TQDMProgressReporter import argparse parser = argparse.ArgumentParser() parser.add_argument("-f", "--format", choices=["coco", "datumaro", "yolo", "voc"], help="Choose format") parser.add_argument("--stream", action="store_true", help="Use stream importer") def upload_to_geti_db(item: DatasetItem) -> None: # Hi, I'm mock! pass if __name__ == "__main__": args = parser.parse_args() path, format = args.format, args.format if format == "coco": format = "coco_instances" # Set specific format start = time() dataset = ( StreamDataset.import_from(path, format=format, progress_reporter=TQDMProgressReporter()) if args.stream else Dataset.import_from(path, format=format, progress_reporter=TQDMProgressReporter()) ) for item in dataset: upload_to_geti_db(item) print(f"Done. Elapsed time: {time() - start:.2f}s") ``` **Results:** - No stream ![no_stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/8777f340-9ccf-47f4-8bde-0e36efb5c389) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py -f datumaro mprof: Sampling memory every 0.1s running new process running as a Python program... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 4728.92it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 118287/118287 [00:30<00:00, 3919.85it/s] Done. Elapsed time: 42.14s ``` - Stream ![stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/eca19caa-73d5-45c3-81a1-1fae17c85e34) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py -f datumaro --stream mprof: Sampling memory every 0.1s running new process running as a Python program... 4999it [00:06, 732.11it/s] 118286it [02:41, 731.49it/s] Done. Elapsed time: 168.55s ``` Signed-off-by: Kim, Vinnam <[email protected]>
Update 3rd-party.txt for release 1.4.0 --------- Co-authored-by: Vinnam Kim <[email protected]>
…s() is stacked on the top (#1101) - Ticket no. 115725 - Fix: Dataset infos() can be broken if a transform not redefining infos() is stacked on the top - Enhance the StreamDatasetStorage transform tests added in #1077. - Test `call_count` as well in the tests to validate stacked transforms. Signed-off-by: Kim, Vinnam <[email protected]> Co-authored-by: Wonju Lee <[email protected]>
<!-- Contributing guide: https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md --> ### Summary <!-- Resolves #111 and #222. Depends on #1000 (for series of dependent commits). This PR introduces this capability to make the project better in this and that. - Added this feature - Removed that feature - Fixed the problem #1234 --> ### How to test <!-- Describe the testing procedure for reviewers, if changes are not fully covered by unit tests or manual testing can be complicated. --> ### Checklist <!-- Put an 'x' in all the boxes that apply --> - [ ] I have added unit tests to cover my changes. - [ ] I have added integration tests to cover my changes. - [x] I have added the description of my changes into [CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md). - [ ] I have updated the [documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs) accordingly ### License - [ ] I submit _my code changes_ under the same [MIT License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern. - [ ] I have updated the license header for each file (see an example below). ```python # Copyright (C) 2023 Intel Corporation # # SPDX-License-Identifier: MIT ```
- Due to `train_path` not matched with `dataset._source_path`, `query` is considered as `string` input, not coverted to `Datasetitem` or could not bring the proper `Datasetitem` which is matched with `path` through `dataset.get_datasetitem_by_path(args.query)`. - Match `train_path` of CLI path as `project/source-1/images/train/1.jpg'.
* updated version string and changelog
…1102) - Ticket no. 114762 Signed-off-by: Kim, Vinnam <[email protected]>
* update changelog & release note
- Resolve #1108 Signed-off-by: Vinnam Kim <[email protected]>
update version string to "1.4.0"
There is no conflict, so that the merge can be done smoothly. |
- Ticket no. 116090 - It is needed for Geti dataset-ie MS Signed-off-by: Kim, Vinnam <[email protected]>
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #1111 +/- ##
===========================================
+ Coverage 78.87% 80.45% +1.58%
===========================================
Files 239 258 +19
Lines 27184 30002 +2818
Branches 5418 6059 +641
===========================================
+ Hits 21441 24139 +2698
- Misses 4479 4509 +30
- Partials 1264 1354 +90
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
wonjuleee
approved these changes
Jul 27, 2023
jihyeonyi
approved these changes
Jul 27, 2023
Signed-off-by: Kim, Vinnam <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
How to test
Checklist
License
Feel free to contact the maintainers if that's a concern.