Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

Closed
wants to merge 198 commits into from
Closed
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
198 commits
Select commit Hold shift + click to select a range
ca98878
do not force
Aug 1, 2020
b0dbc28
debug
Aug 3, 2020
c9f91e0
debug
Aug 3, 2020
e64a56f
debug
Aug 3, 2020
602809f
debug
Aug 3, 2020
fb2e0c8
debug
Aug 3, 2020
f238dc3
debug
Aug 3, 2020
81c5255
debug
Aug 3, 2020
d48e147
debug
Aug 3, 2020
885c1d7
debug
Aug 3, 2020
c1da18b
debug
Aug 3, 2020
98cb6bb
debug
Aug 3, 2020
87e4a78
debug
Aug 3, 2020
8279449
debug
Aug 3, 2020
3fbdf76
debug
Aug 3, 2020
34ac16b
debug
Aug 3, 2020
24fb056
debug
Aug 3, 2020
8d9c49e
Merge branch 'master' into bugfix/test-before-fit
awaelchli Aug 3, 2020
9fce421
merge
awaelchli Aug 3, 2020
7b42a0f
debug
Aug 3, 2020
223136b
debug
Aug 3, 2020
463dfbb
debug
Aug 3, 2020
8395e14
debug
Aug 3, 2020
3453ee2
debug
Aug 4, 2020
69804b7
debug
Aug 4, 2020
60241b5
debug
Aug 4, 2020
700d881
debug
Aug 4, 2020
9928148
debug
Aug 4, 2020
f3c4404
debug
Aug 4, 2020
d95cc46
debug
Aug 4, 2020
50ab31e
debug
Aug 4, 2020
43f2d65
debug
Aug 4, 2020
752dbf1
debug
Aug 4, 2020
fc15ea7
debug
Aug 4, 2020
61e90f5
debug
Aug 4, 2020
bf30a98
debug
Aug 4, 2020
703c1c9
debug
Aug 4, 2020
414e6cc
debug
Aug 4, 2020
3a75faf
debug
Aug 4, 2020
6d8cd81
debug
Aug 4, 2020
85f8929
debug
Aug 4, 2020
f3bb93d
debug
Aug 4, 2020
53a7338
debug
Aug 4, 2020
7a35761
debug
Aug 4, 2020
79358fc
debug
Aug 4, 2020
f106dfb
debug
Aug 4, 2020
3d01604
debug
Aug 4, 2020
f97a8ed
debug
Aug 4, 2020
b426258
debug
Aug 4, 2020
cf09642
debug
Aug 4, 2020
138c906
debug
Aug 4, 2020
b3665d7
debug
Aug 4, 2020
47c4800
debug
Aug 4, 2020
6a9750f
debug
Aug 4, 2020
4e39510
ddptest
Aug 6, 2020
7d82e6b
ddptest
Aug 6, 2020
6c4e4c9
ddptest
Aug 6, 2020
111633d
ddptest
Aug 6, 2020
87ee614
ddptest
Aug 6, 2020
1a26952
ddptest
Aug 6, 2020
6354b21
ddptest
Aug 6, 2020
e7b6ea4
ddptest
Aug 6, 2020
ab94100
ddptest
Aug 6, 2020
18e47c7
ddptest
Aug 6, 2020
d396f7f
ddptest
Aug 6, 2020
5024dcc
ddptest
Aug 6, 2020
26d49c8
ddptest
Aug 6, 2020
bd8f762
ddptest
Aug 6, 2020
f3fe1bc
ddptest
Aug 6, 2020
e4d1823
ddptest
Aug 6, 2020
924b26a
ddptest
Aug 6, 2020
38b89d8
ddptest
Aug 6, 2020
6bd3cec
ddptest
Aug 6, 2020
4431213
ddptest
Aug 6, 2020
28ab5cd
ddptest
Aug 6, 2020
dc16a1f
add ddp script variations
Aug 6, 2020
9031558
add ddp test
Aug 6, 2020
b5bc4d6
rename
Aug 7, 2020
13fc64a
shell
Aug 7, 2020
3163db8
test
Aug 7, 2020
bd189a9
test
Aug 7, 2020
ce4274f
try call
Aug 7, 2020
886ce19
try without subprocess
Aug 7, 2020
884e759
test
Aug 7, 2020
65c1cff
display the error
Aug 7, 2020
d6c57eb
list all variations
awaelchli Aug 8, 2020
3be75ba
try string
awaelchli Aug 9, 2020
25a2748
try copy env
Aug 9, 2020
0911f31
debug
Aug 9, 2020
e700f81
pythonpath
Aug 9, 2020
83bd213
path
Aug 9, 2020
1cecde9
update test
Aug 9, 2020
1316c55
change
Aug 9, 2020
30ad2e7
Merge branch 'ddp_testing' into bugfix/test-before-fit
Aug 9, 2020
61a80ec
remove old file
Aug 9, 2020
462776b
debug
Aug 9, 2020
764c06a
try new
Aug 9, 2020
69fe561
port
Aug 9, 2020
844f106
debug
Aug 9, 2020
a44b9e3
debug
Aug 9, 2020
e712eb9
debug
Aug 9, 2020
5c21884
debug
Aug 10, 2020
2fe51fa
debug
Aug 10, 2020
59c0173
debug
Aug 10, 2020
f3d0190
debug
Aug 10, 2020
5c06679
debug
Aug 10, 2020
5ba3962
debug
Aug 10, 2020
e74cb9c
debug
Aug 10, 2020
a7c732d
debug
Aug 10, 2020
fa5d177
debug
Aug 10, 2020
01a8f11
debug
Aug 10, 2020
3ac5609
debug
Aug 10, 2020
0531f11
debug
Aug 10, 2020
2431333
debug
Aug 10, 2020
7b40fc0
debug
Aug 10, 2020
a293da0
debug
Aug 10, 2020
ee393bd
debug
Aug 10, 2020
a4c546a
debug
Aug 10, 2020
ba517bd
debug
Aug 10, 2020
308ed14
debug
Aug 10, 2020
9f34b2c
debug
Aug 10, 2020
49ed09d
debug
Aug 10, 2020
1874b8a
debug
Aug 10, 2020
b22bd74
debug
Aug 10, 2020
46915c6
cleanup
Aug 10, 2020
c3f9c86
cleanup
Aug 10, 2020
454d4cf
cleanup
Aug 10, 2020
27a815f
move class
Aug 10, 2020
748a963
cleanup
Aug 10, 2020
ce2f31e
cleanup
Aug 10, 2020
76fe75b
cleanup
Aug 10, 2020
6c45ebc
cleanup
Aug 10, 2020
0530234
cleanup
Aug 10, 2020
fe59656
cleanup
Aug 10, 2020
cbab095
cleanup
Aug 10, 2020
9c3dde5
cleanup
Aug 10, 2020
02a5070
cleanup
Aug 10, 2020
0c5592c
cleanup
Aug 10, 2020
f1c5edc
cleanup
Aug 10, 2020
59c95ac
cleanup
Aug 10, 2020
75d4085
Merge branch 'master' into bugfix/test-before-fit
Aug 10, 2020
ed4058f
merge
Aug 10, 2020
1aa0591
cleanup
Aug 10, 2020
c81138f
cleanup
Aug 10, 2020
528381b
try atexit handler
Aug 10, 2020
a0dca5b
cleanup
Aug 10, 2020
7a16c32
cleanup
Aug 10, 2020
473f004
add note about teardown
Aug 10, 2020
c7365fd
cleanup
Aug 10, 2020
d432f56
cleanup
Aug 10, 2020
dbac944
cleanup
Aug 10, 2020
ce1de36
cleanup
Aug 10, 2020
f6dfab9
cleanup
Aug 10, 2020
c527ab5
cleanup
Aug 10, 2020
48263a8
repair
Aug 10, 2020
f393d46
repair
Aug 10, 2020
d8b7d66
repair
Aug 10, 2020
569fe0e
repair
Aug 10, 2020
3d66bac
repair
Aug 11, 2020
ce59c5f
repair
Aug 11, 2020
e53dbe0
repair
Aug 11, 2020
cab2245
repair
Aug 11, 2020
f7fb55d
repair
Aug 11, 2020
d9bd460
repair
Aug 11, 2020
d6fd24c
repair
Aug 11, 2020
4bf3706
repair
Aug 11, 2020
72edd6a
debug
Aug 11, 2020
d128cd5
repair
Aug 11, 2020
795de43
repair
Aug 11, 2020
4c0550a
repair
Aug 11, 2020
a2c47b1
repair
Aug 11, 2020
ae201e8
repair
Aug 11, 2020
ce90830
repair
Aug 11, 2020
99fd9f6
repair
Aug 11, 2020
e5ff21f
repair
Aug 11, 2020
9e6b892
repair
Aug 11, 2020
ab7ebdd
repair
Aug 11, 2020
47712a0
repair
Aug 11, 2020
68a2db6
repair
Aug 11, 2020
8f8c0fd
repair
Aug 11, 2020
159b4c8
repair
Aug 11, 2020
ce4ad1e
repair
Aug 11, 2020
ce8a93c
repair
Aug 11, 2020
25767df
repair
Aug 11, 2020
418fc90
repair
Aug 11, 2020
0495da8
repair
Aug 11, 2020
5b267ff
repair
Aug 11, 2020
18e75ca
repair
Aug 11, 2020
6d56a78
repair
Aug 11, 2020
8622c43
repair
Aug 11, 2020
6dfec2c
repair
Aug 11, 2020
b35679c
repair
Aug 11, 2020
d0e6f3b
repair
Aug 11, 2020
13e9236
repair
Aug 11, 2020
f684550
repair
Aug 11, 2020
b5f8978
repair
Aug 11, 2020
f9a7353
simple
Aug 15, 2020
68ec750
mem
Aug 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion pl_examples/basic_examples/gpu_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,20 @@ def main(args):
# ------------------------
# 2 INIT TRAINER
# ------------------------
trainer = Trainer.from_argparse_args(args)
trainer = Trainer.from_argparse_args(
args,
distributed_backend='ddp',
limit_train_batches=10,
limit_val_batches=10,
max_epochs=1,
)

# ------------------------
# 3 START TRAINING
# ------------------------
trainer.test(model)
trainer.fit(model)
trainer.test(model)
trainer.fit(model)


Expand Down
66 changes: 66 additions & 0 deletions pl_examples/basic_examples/gpu_template2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
"""
Runs a model on a single node across multiple gpus.
"""
import os
from argparse import ArgumentParser

from pytorch_lightning import Trainer, seed_everything, Callback
from pl_examples.models.lightning_template import LightningTemplateModel

seed_everything(234)


class DebugCallback(Callback):

def on_test_batch_end(self, trainer, pl_module):
print('test_batch', trainer.global_rank)


def main(args):
""" Main training routine specific for this project. """
# ------------------------
# 1 INIT LIGHTNING MODEL
# ------------------------
model = LightningTemplateModel(**vars(args))

# ------------------------
# 2 INIT TRAINER
# ------------------------
trainer = Trainer.from_argparse_args(
args,
distributed_backend='ddp',
limit_train_batches=10,
limit_val_batches=10,
max_epochs=1,
callbacks=[DebugCallback()],
)

# ------------------------
# 3 START TRAINING
# ------------------------
trainer.fit(model)
trainer.test(model)


def run_cli():
# ------------------------
# TRAINING ARGUMENTS
# ------------------------
# these are project-wide arguments
root_dir = os.path.dirname(os.path.realpath(__file__))
parent_parser = ArgumentParser(add_help=False)

# each LightningModule defines arguments relevant to it
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
parser = Trainer.add_argparse_args(parser)
parser.set_defaults(gpus=2)
args = parser.parse_args()

# ---------------------
# RUN TRAINING
# ---------------------
main(args)


if __name__ == '__main__':
run_cli()
26 changes: 20 additions & 6 deletions pytorch_lightning/accelerator_backends/ddp_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,19 +56,19 @@ def train(self, model):
self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)

def spawn_ddp_children(self, model):
#
assert self.trainer.global_rank == 0
self.trainer.set_random_port(force=True)
port = os.environ['MASTER_PORT']

master_address = '127.0.0.1' if 'MASTER_ADDR' not in os.environ else os.environ['MASTER_ADDR']
master_address = os.environ.get('MASTER_ADDR', '127.0.0.1')
awaelchli marked this conversation as resolved.
Show resolved Hide resolved
os.environ['MASTER_PORT'] = f'{port}'
os.environ['MASTER_ADDR'] = f'{master_address}'

# allow the user to pass the node rank
node_rank = '0'
if 'NODE_RANK' in os.environ:
node_rank = os.environ['NODE_RANK']
if 'GROUP_RANK' in os.environ:
node_rank = os.environ['GROUP_RANK']

node_rank = os.environ.get('NODE_RANK', node_rank)
node_rank = os.environ.get('GROUP_RANK', node_rank)
os.environ['NODE_RANK'] = node_rank
os.environ['LOCAL_RANK'] = '0'

Expand Down Expand Up @@ -153,11 +153,18 @@ def ddp_train(self, process_idx, mp_queue, model, is_master=False, proc_offset=0
# try to init for 20 times at max in case ports are taken
# where to store ip_table
model.trainer = self.trainer

# from torch.distributed import is_initialized
# if not is_master or not is_initialized():
# assert not (is_master and self.trainer.global_rank > 0)
# # on rank > 0, we always need to initialize, because these are new processes
model.init_ddp_connection(
self.trainer.global_rank,
self.trainer.world_size,
self.trainer.is_slurm_managing_tasks
)
# else:
# print('already initialized', os.environ['MASTER_PORT'], os.getpid(), is_master)

# call setup after the ddp process has connected
self.trainer.call_setup_hook(model)
Expand Down Expand Up @@ -225,5 +232,12 @@ def ddp_train(self, process_idx, mp_queue, model, is_master=False, proc_offset=0
# clean up memory
torch.cuda.empty_cache()

# clean up dist group
#if self.use_ddp or self.use_ddp2:
# import torch.distributed as torch_distrib
# torch_distrib.destroy_process_group()

# torch.distributed.destroy_process_group()

if self.trainer.global_rank == 0 and self.trainer.distributed_backend not in ['ddp_spawn', 'ddp_cpu']:
return results
22 changes: 20 additions & 2 deletions pytorch_lightning/core/decorators.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
from functools import wraps
from typing import Callable

from pytorch_lightning.core.lightning import LightningModule


def auto_move_data(fn: Callable) -> Callable:
"""
Expand Down Expand Up @@ -40,6 +38,9 @@ def forward(self, x):
"""
@wraps(fn)
def auto_transfer_args(self, *args, **kwargs):
# local import to prevent circular import issue
from pytorch_lightning.core.lightning import LightningModule

if not isinstance(self, LightningModule):
return fn(self, *args, **kwargs)

Expand All @@ -48,3 +49,20 @@ def auto_transfer_args(self, *args, **kwargs):
return fn(self, *args, **kwargs)

return auto_transfer_args


def run_once(fn):
"""
Decorate a function or method to make it run only once.
Subsequent calls will result in a no-operation.
"""
@wraps(fn)
def wrapper(*args, **kwargs):
if not wrapper.has_run:
wrapper.has_run = True
fn(*args, **kwargs)

wrapper.has_run = False
return wrapper


4 changes: 3 additions & 1 deletion pytorch_lightning/core/lightning.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from torch.utils.data import DataLoader

from pytorch_lightning import _logger as log
from pytorch_lightning.core.decorators import run_once
from pytorch_lightning.core.grads import GradInformation
from pytorch_lightning.core.hooks import ModelHooks
from pytorch_lightning.core.memory import ModelSummary
Expand Down Expand Up @@ -921,6 +922,7 @@ def _init_slurm_connection(self) -> None:
root_node = self.trainer.resolve_root_node_address(root_node)
os.environ['MASTER_ADDR'] = root_node

#@run_once
def init_ddp_connection(self, global_rank: int, world_size: int, is_slurm_managing_tasks: bool = True) -> None:
"""
Override to define your custom way of setting up a distributed environment.
Expand Down Expand Up @@ -952,7 +954,7 @@ def init_ddp_connection(self, global_rank: int, world_size: int, is_slurm_managi
f"WORLD_SIZE environment variable ({os.environ['WORLD_SIZE']}) "
f"is not equal to the computed world size ({world_size}). Ignored."
)

print('master port init', os.environ['MASTER_PORT'], os.getpid())
torch_backend = "nccl" if self.trainer.on_gpu else "gloo"
log.info(f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank+1}/{world_size}")
torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size)
Expand Down
20 changes: 12 additions & 8 deletions pytorch_lightning/trainer/distrib_data_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,9 +171,11 @@ def train_fx(trial_hparams, cluster_manager, _):
else:
XLA_AVAILABLE = True

PID = os.getpid()
RNG1 = np.random.RandomState(PID)
RANDOM_PORTS = RNG1.randint(10000, 19999, 1000)

#PID = os.getpid()
#RNG1 = np.random.RandomState(PID)
#RANDOM_PORTS = RNG1.randint(10000, 19999, 1000)
RANDOM_PORTS = list(range(10000, 20000))


class TrainerDDPMixin(ABC):
Expand Down Expand Up @@ -411,13 +413,15 @@ def set_random_port(self, force=False):
"""
# pick a random port first
assert self.num_nodes == 1, 'random port can only be called from single node training'
global RANDOM_PORTS
default_port = RANDOM_PORTS[-1]
RANDOM_PORTS = RANDOM_PORTS[:-1]

print('setting port on rank', self.global_rank)
default_port = os.environ.get('MASTER_PORT')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if relevant...but when I pulled this current version of PR down and ran gpu_template.py, I got an NCCL error. In an issue I found on github, someone said this was because "master_port was used more than once at the same time".

How are we making sure this isn't happening? (or how can we?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this person means in the post. This is the port that the processes use to communicate through. It is expected that it gets used multiple times.


# when not forced, use the user port
if not force:
default_port = os.environ.get('MASTER_PORT', default_port)
if force or not default_port:
global RANDOM_PORTS
default_port = RANDOM_PORTS[-1]
RANDOM_PORTS = RANDOM_PORTS[:-1]

os.environ['MASTER_PORT'] = str(default_port)

Expand Down
9 changes: 8 additions & 1 deletion pytorch_lightning/trainer/evaluation_loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,7 @@ def _evaluate(

# run validation
for dataloader_idx, dataloader in enumerate(dataloaders):
print('here 1')
dl_outputs = []

# on TPU we have to wrap it under the ParallelLoader
Expand All @@ -303,6 +304,7 @@ def _evaluate(
dl_max_batches = max_batches[dataloader_idx]

for batch_idx, batch in enumerate(dataloader):
print('here 2')
if batch is None:
continue

Expand Down Expand Up @@ -600,16 +602,19 @@ def __log_evaluation_epoch_metrics(self, eval_results, test_mode):
def evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode: bool = False):
# make dataloader_idx arg in validation_step optional
args = [batch, batch_idx]

print('here 3')
if (test_mode and len(self.test_dataloaders) > 1) \
or (not test_mode and len(self.val_dataloaders) > 1):
args.append(dataloader_idx)

# handle DP, DDP forward
if self.use_ddp or self.use_dp or self.use_ddp2:
# SOMETHING GOES WRONG HERE, test loop is stuck
output = model(*args)
return output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test loop, at batch 9, it get's stuck here. The cpu and gpu usage stays at 100% as if it is busy but it is just hanging here at the same batch. Any ideas?
I also checked master, same problem there, so not related to my changes in this PR here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you get any errors messages that you can copy paste here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it just hangs. this happens also on master. Here I just used stupid print statements to figure out where it occurs xD


print('here 4')

# Horovod
if self.use_horovod and self.on_gpu:
batch = self.transfer_batch_to_gpu(batch, hvd.local_rank())
Expand All @@ -635,4 +640,6 @@ def evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode:
else:
output = model.validation_step(*args)

print('here 5')

return output
11 changes: 8 additions & 3 deletions pytorch_lightning/trainer/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1019,7 +1019,7 @@ def fit(

# ddp
elif self.distributed_backend == 'ddp':
self.set_random_port()
# self.set_random_port()
self.accelerator_backend = DDPBackend(self)
results = self.accelerator_backend.spawn_ddp_children(model)

Expand Down Expand Up @@ -1296,6 +1296,7 @@ def test(
self.verbose_test = verbose

if self.global_rank != 0:
# do nothing, rank 0 process will launch new processes for testing
return

# If you supply a datamodule you can't supply train_dataloader or val_dataloaders
Expand All @@ -1314,6 +1315,10 @@ def test(

self.teardown('test')

if torch.distributed.is_initialized():
print('destroy in test', self.global_rank, os.getpid())
torch.distributed.destroy_process_group()

return results

def __test_using_best_weights(self, ckpt_path, test_dataloaders):
Expand Down Expand Up @@ -1347,7 +1352,7 @@ def __test_using_best_weights(self, ckpt_path, test_dataloaders):

# run tests
self.tested_ckpt_path = ckpt_path
self.set_random_port(force=True)
#self.set_random_port()
self.testing = True
os.environ['PL_TESTING_MODE'] = '1'
self.model = model
Expand All @@ -1370,7 +1375,7 @@ def __test_given_model(self, model, test_dataloaders):

# run test
# sets up testing so we short circuit to eval
self.set_random_port(force=True)
#self.set_random_port()
self.testing = True
self.model = model
results = self.fit(model)
Expand Down
3 changes: 2 additions & 1 deletion pytorch_lightning/trainer/training_loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -1021,7 +1021,8 @@ def run_training_teardown(self):
subprocess.Popen.kill(proc)

# clean up dist group
if self.use_ddp or self.use_ddp2:
if (self.use_ddp or self.use_ddp2):
print('destroy on rank ', self.global_rank, os.getpid())
torch_distrib.destroy_process_group()

# clear mem
Expand Down