Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add num_devices in Engine for multi-gpu training #3778

Merged
merged 5 commits into from
Jul 31, 2024

Conversation

harimkang
Copy link
Contributor

@harimkang harimkang commented Jul 31, 2024

Summary

I noticed that multi-gpu settings are not available through Engine, fix this.
https://jira.devtools.intel.com/browse/CVS-148420
This allows us to set up multi-gpu in the API and CLI.

  • Add num_devices property and setter function in Engine
  • Update Docs to use multi-gpu training
  • Update CHANGELOG

image

How to test

  1. API
engine = Engine(..., num_devices=2)
  1. CLI
otx train ... --engine.num_devices 2

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have ran e2e tests and there is no issues.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

@github-actions github-actions bot added TEST Any changes in tests OTX 2.0 labels Jul 31, 2024
@github-actions github-actions bot added the DOC Improvements or additions to documentation label Jul 31, 2024
CHANGELOG.md Outdated Show resolved Hide resolved
@eunwoosh
Copy link
Contributor

I'm not sure OTX is already supporting multi GPU training. As far as I know, we haven't checked all models can be trained on multi GPU. And I also think that we should add integration test to validate distributed training if we really support it. If it's just for preparing, then I think it's ok to merge after reverting documentation.

@harimkang
Copy link
Contributor Author

harimkang commented Jul 31, 2024

I'm not sure OTX is already supporting multi GPU training. As far as I know, we haven't checked all models can be trained on multi GPU. And I also think that we should add integration test to validate distributed training if we really support it. If it's just for preparing, then I think it's ok to merge after reverting documentation.

Yes you are right, but this is a solution to the issue that there is no way to use it, so it is a PR to fix the issue. I agree that we should do validation for all models, but we should at least make this work in OTX. (Anyway, They also all work for Classification.)

@harimkang harimkang enabled auto-merge July 31, 2024 09:18
@eunwoosh
Copy link
Contributor

I'm not sure OTX is already supporting multi GPU training. As far as I know, we haven't checked all models can be trained on multi GPU. And I also think that we should add integration test to validate distributed training if we really support it. If it's just for preparing, then I think it's ok to merge after reverting documentation.

Yes you are right, but this is a solution to the issue that there is no way to use it, so it is a PR to fix the issue. I agree that we should do validation for all models, but we should at least make this work in OTX. (Anyway, They also all work for Classification.)

I understood. But honestly, I think it's hard to say that OTX supports multi GPU training. Currently, yolox models can't be trained on multi GPU (please refer #3635). I propose to notify that classification is only validated in documentation at least.

Copy link
Collaborator

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Eunwoo comment. We need to validate all tasks/models and handle the errors if it is not possible to train on multi GPUs

@harimkang harimkang added this pull request to the merge queue Jul 31, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 31, 2024
@sovrasov sovrasov added this pull request to the merge queue Jul 31, 2024
Merged via the queue into openvinotoolkit:develop with commit ca13765 Jul 31, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DOC Improvements or additions to documentation TEST Any changes in tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants