Merge back Releases/1.4.0 #1111

vinnamkim · 2023-07-26T02:17:41Z

Summary

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have added the description of my changes into CHANGELOG.
I have updated the documentation accordingly

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

…-life COCO2017 object detection dataset (#1093) ### Summary 1. Fix broken `is_stream` test 2. Fix `StreamDataset.import_from(path, format="coco_instances")` actually create `Dataset` not `StreamDataset`. 3. To speed up, let the initial length of `_CocoBase(stream=True)` can be obtained from `COCOPageMapper`. 4. Speed up parsing the `"categories"` section when it is at the end of JSON file. 5. To speed up, change caching logic and size slightly. 6. Fix `_CocoBase(stream=False)` abruptly raising an error when the progress reporter is given. 7. Add `COCOExtractorMerger` to handle `"coco"` import which should merge extractors across the tasks. ### How to test I manually tested the following code on the real-life COCO2017 object detection dataset. ```python from datumaro.components.dataset_base import DatasetItem from datumaro.components.dataset import StreamDataset, Dataset from time import time from datumaro.components.progress_reporting import TQDMProgressReporter import argparse parser = argparse.ArgumentParser() parser.add_argument("--stream", action="store_true", help="Use stream importer") def upload_to_geti_db(item: DatasetItem) -> None: # Hi, I'm mock! pass if __name__ == "__main__": args = parser.parse_args() start = time() dataset = ( StreamDataset.import_from( "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter() ) if args.stream else Dataset.import_from( "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter() ) ) for item in dataset: upload_to_geti_db(item) print(f"Done. Elapsed time: {time() - start:.2f}s") ``` **Results:** - No stream ![no_stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/8951145b-4181-4b04-bd84-b38c7529754a) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py mprof: Sampling memory every 0.1s running new process running as a Python program... WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances Parsing image info in 'instances_val2017.json': 100%|██████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 77041.13it/s] Parsing annotations in 'instances_val2017.json': 100%|███████████████████████████████████████████████| 36781/36781 [00:01<00:00, 32391.89it/s] Parsing image info in 'instances_train2017.json': 100%|████████████████████████████████████████████| 118287/118287 [00:01<00:00, 83969.43it/s] Parsing annotations in 'instances_train2017.json': 100%|███████████████████████████████████████████| 860001/860001 [00:27<00:00, 31262.75it/s] Done. Elapsed time: 38.17s ``` - Stream ![stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/98379589-5c7c-411e-946c-3925dddd8e7a) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py --stream mprof: Sampling memory every 0.1s running new process running as a Python program... WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances Parsing image info in 'instances_val2017.json': 100%|███████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3650.78it/s] Parsing image info in 'instances_train2017.json': 100%|█████████████████████████████████████████████| 118287/118287 [00:39<00:00, 3027.10it/s] Done. Elapsed time: 238.61s ``` Signed-off-by: Kim, Vinnam <[email protected]>

### Summary - Fix `_get_dm_format_version()` faster when there is no `dm_format_version` field in the file. - Fix `_load_media_type()` faster when there is no `media_type` field in the file. - Fix `TQDMProgressReporter` when `total` is not given (`total = None`) ### How to test I manually tested the following code on the real-life COCO2017 object detection dataset which is converted to Datumaro (JSON) data format. ```python from datumaro.components.dataset_base import DatasetItem from datumaro.components.dataset import StreamDataset, Dataset from time import time from datumaro.components.progress_reporting import TQDMProgressReporter import argparse parser = argparse.ArgumentParser() parser.add_argument("-f", "--format", choices=["coco", "datumaro", "yolo", "voc"], help="Choose format") parser.add_argument("--stream", action="store_true", help="Use stream importer") def upload_to_geti_db(item: DatasetItem) -> None: # Hi, I'm mock! pass if __name__ == "__main__": args = parser.parse_args() path, format = args.format, args.format if format == "coco": format = "coco_instances" # Set specific format start = time() dataset = ( StreamDataset.import_from(path, format=format, progress_reporter=TQDMProgressReporter()) if args.stream else Dataset.import_from(path, format=format, progress_reporter=TQDMProgressReporter()) ) for item in dataset: upload_to_geti_db(item) print(f"Done. Elapsed time: {time() - start:.2f}s") ``` **Results:** - No stream ![no_stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/8777f340-9ccf-47f4-8bde-0e36efb5c389) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py -f datumaro mprof: Sampling memory every 0.1s running new process running as a Python program... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 4728.92it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 118287/118287 [00:30<00:00, 3919.85it/s] Done. Elapsed time: 42.14s ``` - Stream ![stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/eca19caa-73d5-45c3-81a1-1fae17c85e34) ``` (datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py -f datumaro --stream mprof: Sampling memory every 0.1s running new process running as a Python program... 4999it [00:06, 732.11it/s] 118286it [02:41, 731.49it/s] Done. Elapsed time: 168.55s ``` Signed-off-by: Kim, Vinnam <[email protected]>

Update 3rd-party.txt for release 1.4.0 --------- Co-authored-by: Vinnam Kim <[email protected]>

…s() is stacked on the top (#1101) - Ticket no. 115725 - Fix: Dataset infos() can be broken if a transform not redefining infos() is stacked on the top - Enhance the StreamDatasetStorage transform tests added in #1077. - Test `call_count` as well in the tests to validate stacked transforms. Signed-off-by: Kim, Vinnam <[email protected]> Co-authored-by: Wonju Lee <[email protected]>

### Summary  ### How to test  ### Checklist  - [ ] I have added unit tests to cover my changes. - [ ] I have added integration tests to cover my changes. - [x] I have added the description of my changes into [CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md). - [ ] I have updated the [documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs) accordingly ### License - [ ] I submit _my code changes_ under the same [MIT License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern. - [ ] I have updated the license header for each file (see an example below). ```python # Copyright (C) 2023 Intel Corporation # # SPDX-License-Identifier: MIT ```

- Due to `train_path` not matched with `dataset._source_path`, `query` is considered as `string` input, not coverted to `Datasetitem` or could not bring the proper `Datasetitem` which is matched with `path` through `dataset.get_datasetitem_by_path(args.query)`. - Match `train_path` of CLI path as `project/source-1/images/train/1.jpg'.

* updated version string and changelog

…1102) - Ticket no. 114762 Signed-off-by: Kim, Vinnam <[email protected]>

* update changelog & release note

- Resolve #1108 Signed-off-by: Vinnam Kim <[email protected]>

update version string to "1.4.0"

vinnamkim · 2023-07-26T02:18:46Z

There is no conflict, so that the merge can be done smoothly.

- Ticket no. 116090 - It is needed for Geti dataset-ie MS Signed-off-by: Kim, Vinnam <[email protected]>

codecov-commenter · 2023-07-26T06:48:45Z

Codecov Report

Patch coverage: 83.93% and project coverage change: +1.58% 🎉

Comparison is base (3e77b31) 78.87% compared to head (6d65cec) 80.45%.
Report is 63 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1111      +/-   ##
===========================================
+ Coverage    78.87%   80.45%   +1.58%     
===========================================
  Files          239      258      +19     
  Lines        27184    30002    +2818     
  Branches      5418     6059     +641     
===========================================
+ Hits         21441    24139    +2698     
- Misses        4479     4509      +30     
- Partials      1264     1354      +90

Flag	Coverage Δ
macos-11_Python-3.8	`?`
ubuntu-20.04_Python-3.8	`80.43% <83.93%> (+1.58%)`	⬆️
windows-2019_Python-3.8	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
src/datumaro/cli/__main__.py	`0.00% <ø> (ø)`
src/datumaro/cli/commands/__init__.py	`80.00% <ø> (ø)`
src/datumaro/cli/commands/convert.py	`19.69% <0.00%> (ø)`
src/datumaro/cli/commands/detect_format.py	`21.42% <ø> (ø)`
src/datumaro/cli/commands/download.py	`17.85% <ø> (ø)`
src/datumaro/cli/commands/explain.py	`14.13% <ø> (ø)`
src/datumaro/cli/commands/filter.py	`20.00% <ø> (ø)`
src/datumaro/cli/commands/info.py	`17.24% <ø> (ø)`
src/datumaro/cli/commands/merge.py	`21.17% <ø> (ø)`
src/datumaro/cli/commands/patch.py	`26.92% <ø> (ø)`
... and 202 more

... and 80 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kim, Vinnam <[email protected]>

vinnamkim and others added 11 commits July 14, 2023 14:00

Update 3rd-party.txt for release 1.4.0 (#1099)

8867601

Update 3rd-party.txt for release 1.4.0 --------- Co-authored-by: Vinnam Kim <[email protected]>

Update for release 1.4.0rc3 (#1103)

32fa404

* updated version string and changelog

Enable stream exporters: VOC, YOLO, Datumaro, and COCO data format (#…

6a612ef

…1102) - Ticket no. 114762 Signed-off-by: Kim, Vinnam <[email protected]>

Update for release 1.4 (#1106)

ebdfc6c

* update changelog & release note

Add utf-8 encoding directive (#1109)

622b3db

- Resolve #1108 Signed-off-by: Vinnam Kim <[email protected]>

update version string to 1.4.0 (#1107)

b6981de

update version string to "1.4.0"

vinnamkim marked this pull request as ready for review July 26, 2023 02:18

vinnamkim requested review from a team as code owners July 26, 2023 02:18

vinnamkim requested review from jihyeonyi and removed request for a team July 26, 2023 02:18

Report errors for COCO (stream) and Datumaro importers (#1110)

6774c3e

- Ticket no. 116090 - It is needed for Geti dataset-ie MS Signed-off-by: Kim, Vinnam <[email protected]>

wonjuleee approved these changes Jul 27, 2023

View reviewed changes

jihyeonyi approved these changes Jul 27, 2023

View reviewed changes

Version up to 1.4.1 (#1112)

6d65cec

Signed-off-by: Kim, Vinnam <[email protected]>

yunchu temporarily deployed to pypi July 27, 2023 08:26 — with GitHub Actions Inactive

vinnamkim merged commit 2015d4c into develop Jul 27, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge back Releases/1.4.0 #1111

Merge back Releases/1.4.0 #1111

vinnamkim commented Jul 26, 2023

vinnamkim commented Jul 26, 2023

codecov-commenter commented Jul 26, 2023 •

edited

Loading

Merge back Releases/1.4.0 #1111

Merge back Releases/1.4.0 #1111

Conversation

vinnamkim commented Jul 26, 2023

Summary

How to test

Checklist

License

vinnamkim commented Jul 26, 2023

codecov-commenter commented Jul 26, 2023 • edited Loading

Codecov Report

codecov-commenter commented Jul 26, 2023 •

edited

Loading