Fix Trainer with a parallel model #9578

sgugger · 2021-01-13T23:37:53Z

What does this PR do?

The test introduced in #9566 wasn't actually working as the default batch size is 8, not 16...
So the problem was still there, the reason because _setup_devices in TrainingArguments is a cached_property, so its result is computed once and for all at init. Had to change the behavior slightly, but it should be okay since it's a private method.

Fixes #9577 (model is getting wrapped into DataParallel because the value of self.args.n_gpu is not updated.

sgugger · 2021-01-13T23:41:13Z

src/transformers/training_args.py

@@ -426,7 +426,6 @@ def __post_init__(self):

        if is_torch_available() and self.device.type != "cuda" and self.fp16:
            raise ValueError("Mixed precision training with AMP or APEX (`--fp16`) can only be used on CUDA devices.")
-        self._n_gpu = torch.cuda.device_count()


Removing from here, this is going to be completely setup in _setup_devices

sgugger · 2021-01-13T23:41:34Z

tests/test_trainer.py

@@ -381,9 +381,11 @@ def test_data_is_not_parallelized_when_model_is_parallel(self):
        # Make the Trainer believe it's a parallelized model
        model.is_parallelizable = True
        model.model_parallel = True
-        trainer = Trainer(model=model, train_dataset=RegressionDataset(), eval_dataset=RegressionDataset())
+        args = TrainingArguments("./regression", per_device_train_batch_size=16, per_device_eval_batch_size=16)


Make sure the test uses batch sizes of 16.

sgugger · 2021-01-13T23:41:53Z

tests/test_trainer.py

        # Check the Trainer was fooled
        self.assertTrue(trainer.is_model_parallel)
+        self.assertEqual(trainer.args.n_gpu, 1)


This was still set to 2 before, so this checks it is indeed 1.

src/transformers/training_args.py

LysandreJik

LGTM, thanks @sgugger

* Fix Trainer with a parallel model * More clean up

Fix Trainer with a parallel model

e89c153

sgugger requested a review from LysandreJik January 13, 2021 23:37

sgugger mentioned this pull request Jan 13, 2021

Trainer is using DataParallel on parallelized models #9577

Closed

sgugger commented Jan 13, 2021

View reviewed changes

More clean up

710176e

sgugger commented Jan 13, 2021

View reviewed changes

stas00 reviewed Jan 13, 2021

View reviewed changes

src/transformers/training_args.py Show resolved Hide resolved

LysandreJik approved these changes Jan 14, 2021

View reviewed changes

LysandreJik merged commit 5e1bea4 into master Jan 14, 2021

LysandreJik deleted the fix_trainer_model_parallel branch January 14, 2021 08:23

LysandreJik pushed a commit that referenced this pull request Jan 14, 2021

Fix Trainer with a parallel model (#9578)

59fbd64

* Fix Trainer with a parallel model * More clean up

guyrosin pushed a commit to guyrosin/transformers that referenced this pull request Jan 15, 2021

Fix Trainer with a parallel model (huggingface#9578)

ecb5cd3

* Fix Trainer with a parallel model * More clean up

stas00 mentioned this pull request Jan 21, 2021

Model Parallelism and Big Models #8771

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Trainer with a parallel model #9578

Fix Trainer with a parallel model #9578

sgugger commented Jan 13, 2021

sgugger Jan 13, 2021

sgugger Jan 13, 2021

sgugger Jan 13, 2021

LysandreJik left a comment

Fix Trainer with a parallel model #9578

Fix Trainer with a parallel model #9578

Conversation

sgugger commented Jan 13, 2021

What does this PR do?

sgugger Jan 13, 2021

Choose a reason for hiding this comment

sgugger Jan 13, 2021

Choose a reason for hiding this comment

sgugger Jan 13, 2021

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment