Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge back Releases/1.4.0 #1111

Merged
merged 13 commits into from
Jul 27, 2023
Merged

Merge back Releases/1.4.0 #1111

merged 13 commits into from
Jul 27, 2023

Conversation

vinnamkim
Copy link
Contributor

Summary

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

vinnamkim and others added 11 commits July 14, 2023 14:00
…-life COCO2017 object detection dataset (#1093)

### Summary

1. Fix broken `is_stream` test
2. Fix `StreamDataset.import_from(path, format="coco_instances")`
actually create `Dataset` not `StreamDataset`.
3. To speed up, let the initial length of `_CocoBase(stream=True)` can
be obtained from `COCOPageMapper`.
4. Speed up parsing the `"categories"` section when it is at the end of
JSON file.
5. To speed up, change caching logic and size slightly.
6. Fix `_CocoBase(stream=False)` abruptly raising an error when the
progress reporter is given.
7. Add `COCOExtractorMerger` to handle `"coco"` import which should
merge extractors across the tasks.

### How to test
I manually tested the following code on the real-life COCO2017 object
detection dataset.
```python
from datumaro.components.dataset_base import DatasetItem
from datumaro.components.dataset import StreamDataset, Dataset
from time import time
from datumaro.components.progress_reporting import TQDMProgressReporter
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--stream", action="store_true", help="Use stream importer")


def upload_to_geti_db(item: DatasetItem) -> None:
    # Hi, I'm mock!
    pass


if __name__ == "__main__":
    args = parser.parse_args()
    start = time()

    dataset = (
        StreamDataset.import_from(
            "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter()
        )
        if args.stream
        else Dataset.import_from(
            "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter()
        )
    )

    for item in dataset:
        upload_to_geti_db(item)

    print(f"Done. Elapsed time: {time() - start:.2f}s")
```

**Results:**

- No stream


![no_stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/8951145b-4181-4b04-bd84-b38c7529754a)

```
(datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py
mprof: Sampling memory every 0.1s
running new process
running as a Python program...
WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Parsing image info in 'instances_val2017.json': 100%|██████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 77041.13it/s]
Parsing annotations in 'instances_val2017.json': 100%|███████████████████████████████████████████████| 36781/36781 [00:01<00:00, 32391.89it/s]
Parsing image info in 'instances_train2017.json': 100%|████████████████████████████████████████████| 118287/118287 [00:01<00:00, 83969.43it/s]
Parsing annotations in 'instances_train2017.json': 100%|███████████████████████████████████████████| 860001/860001 [00:27<00:00, 31262.75it/s]
Done. Elapsed time: 38.17s
```
- Stream


![stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/98379589-5c7c-411e-946c-3925dddd8e7a)

```
(datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py --stream
mprof: Sampling memory every 0.1s
running new process
running as a Python program...
WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Parsing image info in 'instances_val2017.json': 100%|███████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3650.78it/s]
Parsing image info in 'instances_train2017.json': 100%|█████████████████████████████████████████████| 118287/118287 [00:39<00:00, 3027.10it/s]
Done. Elapsed time: 238.61s
```

Signed-off-by: Kim, Vinnam <[email protected]>
### Summary
- Fix `_get_dm_format_version()` faster when there is no
`dm_format_version` field in the file.
- Fix `_load_media_type()` faster when there is no `media_type` field in
the file.
- Fix `TQDMProgressReporter` when `total` is not given (`total = None`)

### How to test
I manually tested the following code on the real-life COCO2017 object
detection dataset which is converted to Datumaro (JSON) data format.

```python
from datumaro.components.dataset_base import DatasetItem
from datumaro.components.dataset import StreamDataset, Dataset
from time import time
from datumaro.components.progress_reporting import TQDMProgressReporter
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--format", choices=["coco", "datumaro", "yolo", "voc"], help="Choose format")
parser.add_argument("--stream", action="store_true", help="Use stream importer")


def upload_to_geti_db(item: DatasetItem) -> None:
    # Hi, I'm mock!
    pass


if __name__ == "__main__":
    args = parser.parse_args()
    path, format = args.format, args.format
    if format == "coco":
        format = "coco_instances"  # Set specific format
    start = time()

    dataset = (
        StreamDataset.import_from(path, format=format, progress_reporter=TQDMProgressReporter())
        if args.stream
        else Dataset.import_from(path, format=format, progress_reporter=TQDMProgressReporter())
    )

    for item in dataset:
        upload_to_geti_db(item)

    print(f"Done. Elapsed time: {time() - start:.2f}s")
```

**Results:**

- No stream


![no_stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/8777f340-9ccf-47f4-8bde-0e36efb5c389)

```
(datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py -f datumaro
mprof: Sampling memory every 0.1s
running new process
running as a Python program...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 4728.92it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 118287/118287 [00:30<00:00, 3919.85it/s]
Done. Elapsed time: 42.14s
```
- Stream


![stream](https://github.com/openvinotoolkit/datumaro/assets/26541465/eca19caa-73d5-45c3-81a1-1fae17c85e34)

```
(datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py -f datumaro --stream
mprof: Sampling memory every 0.1s
running new process
running as a Python program...
4999it [00:06, 732.11it/s]
118286it [02:41, 731.49it/s]
Done. Elapsed time: 168.55s
```

Signed-off-by: Kim, Vinnam <[email protected]>
Update 3rd-party.txt for release 1.4.0

---------

Co-authored-by: Vinnam Kim <[email protected]>
…s() is stacked on the top (#1101)

- Ticket no. 115725
- Fix: Dataset infos() can be broken if a transform not redefining
infos() is stacked on the top
- Enhance the StreamDatasetStorage transform tests added in #1077.
- Test `call_count` as well in the tests to validate stacked transforms.

Signed-off-by: Kim, Vinnam <[email protected]>
Co-authored-by: Wonju Lee <[email protected]>
<!-- Contributing guide:
https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md
-->

### Summary

<!--
Resolves #111 and #222.
Depends on #1000 (for series of dependent commits).

This PR introduces this capability to make the project better in this
and that.

- Added this feature
- Removed that feature
- Fixed the problem #1234
-->

### How to test
<!-- Describe the testing procedure for reviewers, if changes are
not fully covered by unit tests or manual testing can be complicated.
-->

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [x] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [ ] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [ ] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```
- Due to `train_path` not matched with `dataset._source_path`, `query` is considered as `string` input, not coverted to `Datasetitem` or could not bring the proper `Datasetitem` which is matched with `path` through `dataset.get_datasetitem_by_path(args.query)`.
- Match `train_path` of CLI path as `project/source-1/images/train/1.jpg'.
* updated version string and changelog
* update changelog & release note
- Resolve #1108 

Signed-off-by: Vinnam Kim <[email protected]>
update version string to "1.4.0"
@vinnamkim vinnamkim marked this pull request as ready for review July 26, 2023 02:18
@vinnamkim vinnamkim requested review from a team as code owners July 26, 2023 02:18
@vinnamkim vinnamkim requested review from jihyeonyi and removed request for a team July 26, 2023 02:18
@vinnamkim
Copy link
Contributor Author

There is no conflict, so that the merge can be done smoothly.

 - Ticket no. 116090
 - It is needed for Geti dataset-ie MS

Signed-off-by: Kim, Vinnam <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2023

Codecov Report

Patch coverage: 83.93% and project coverage change: +1.58% 🎉

Comparison is base (3e77b31) 78.87% compared to head (6d65cec) 80.45%.
Report is 63 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1111      +/-   ##
===========================================
+ Coverage    78.87%   80.45%   +1.58%     
===========================================
  Files          239      258      +19     
  Lines        27184    30002    +2818     
  Branches      5418     6059     +641     
===========================================
+ Hits         21441    24139    +2698     
- Misses        4479     4509      +30     
- Partials      1264     1354      +90     
Flag Coverage Δ
macos-11_Python-3.8 ?
ubuntu-20.04_Python-3.8 80.43% <83.93%> (+1.58%) ⬆️
windows-2019_Python-3.8 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/datumaro/cli/__main__.py 0.00% <ø> (ø)
src/datumaro/cli/commands/__init__.py 80.00% <ø> (ø)
src/datumaro/cli/commands/convert.py 19.69% <0.00%> (ø)
src/datumaro/cli/commands/detect_format.py 21.42% <ø> (ø)
src/datumaro/cli/commands/download.py 17.85% <ø> (ø)
src/datumaro/cli/commands/explain.py 14.13% <ø> (ø)
src/datumaro/cli/commands/filter.py 20.00% <ø> (ø)
src/datumaro/cli/commands/info.py 17.24% <ø> (ø)
src/datumaro/cli/commands/merge.py 21.17% <ø> (ø)
src/datumaro/cli/commands/patch.py 26.92% <ø> (ø)
... and 202 more

... and 80 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kim, Vinnam <[email protected]>
@yunchu yunchu temporarily deployed to pypi July 27, 2023 08:26 — with GitHub Actions Inactive
@vinnamkim vinnamkim merged commit 2015d4c into develop Jul 27, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants