When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604

m-bain · 2021-11-18T10:40:01Z

🐛 Bug

With using the neptune logger and trainer with n_gpu > 1, accelerator='ddp', duplicate experiments are created and logged, one for each individual gpu

To Reproduce

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer

from pytorch_lightning.loggers.neptune import NeptuneLogger

# Create NeptuneLogger
neptune_logger = NeptuneLogger(
    project="m-bain/LONGVID")


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        gpus=4,
        strategy='ddp',
        logger=neptune_logger
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()

Output:

https://app.neptune.ai/m-bain/LONGVID/e/LON-52
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
https://app.neptune.ai/m-bain/LONGVID/e/LON-53
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
https://app.neptune.ai/m-bain/LONGVID/e/LON-54
initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
https://app.neptune.ai/m-bain/LONGVID/e/LON-55
initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

Expected behavior

A single run logged (happens when gpu=1 and/or accelerator is turned off), not 4 of them...

Environment

PyTorch Lightning Version (e.g., 1.3.0): 1.5.1
PyTorch Version (e.g., 1.8): 1.10.0
Python version: 3.8.12
OS (e.g., Linux): Linux
CUDA/cuDNN version: cuda 11.3
GPU models and configuration: 4 x NVIDIA GeForce GTX TITAN X
How you installed PyTorch (conda, pip, source): pip

cc @tchaton

The text was updated successfully, but these errors were encountered:

tchaton · 2021-11-18T14:42:29Z

@shnela @HubertJaworski @pkasprzyk @pitercl @Raalsky @aniezurawski @kamil-kaczmarek

kamil-kaczmarek · 2021-11-18T15:11:27Z

let me check it

ppaysan · 2021-11-22T09:39:28Z

I have the same problem

flukeskywalker · 2021-11-23T08:07:05Z

Same issue here, blocking upgrade to latest PL version.

@kamil-kaczmarek I guess the problem is that _init_run_instance() is being called in __init__(), which is not correct according to PL loggers design. The run should be created on rank zero only.

kamil-kaczmarek · 2021-11-26T10:01:53Z

@flukeskywalker

We are double checking this, but seems that you are exactly right 👍 Thanks

@ppaysan @m-bain we're working on the fix.

ankitvad · 2021-12-01T21:23:20Z

Whew! I was planning to open this issue a month or so ago! I was facing the same issue with a couple of other stuff on the same dataset. Thought I'd share my library and issues maybe it provides extra information!

-Libraries:

torch==1.7.0
pytorch-lightning==1.2.1
neptune-client==0.10.0
fairscale==0.3.0

-Code:

from pytorch_lightning.loggers.neptune import NeptuneLogger
from pytorch_lightning.callbacks import ModelCheckpoint
dataModule = DataModule(data_config = "data_config.json", model_config = "model_config.json")
model = PLModule(model_config = "model_config.json", data_config = "data_config.json", bpeTok = dataModule.sp_gec)
model_config = Config("model_config.json")
fD = "model_saved/"
os.makedirs(fD, exist_ok=True)
checkpoint_callback = ModelCheckpoint(dirpath = "model_saved/", save_top_k = 1, verbose = True, monitor = "Val_Loss", mode = "min", save_last=True)

neptune_logger = NeptuneLogger(
		api_key = model_config.nept_key,
		project_name = model_config.nept_proj,
		experiment_name = model_config.nept_exp)

trainer = pl.Trainer(
						max_epochs = model_config.epochs,
						checkpoint_callback = checkpoint_callback,
						gradient_clip_val = model_config.clip_grad,
						deterministic = True,
						accelerator = 'ddp',
						plugins='ddp_sharded',
						amp_backend = 'native',
						gpus = -1,
						log_every_n_steps = 100,
						precision = 16,
						val_check_interval=1.0,
						logger = neptune_logger
						)

trainer.fit(model, dataModule)

Dataset Size: ~8.5Million! Everything Works here on a smaller dataset! With the newer dataset the model kept blowing up to +-na.inf error. Maybe because of ddp_sharded?!

-Library Update discussed on Slack: Torch+Lightning

torch-1.10.0
fairscale==0.3.0
pytorch-lightning-1.4.9

Issues:

pytorch_lightning.utilities.exceptions.MisconfigurationException: Invalid type provided for checkpoint_callback: Expected bool but received <class 'pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint'>. Pass callback instances to the `callbacks` argument in the Trainer constructor instead.

checkpoint_callback = True 
mCallback = ModelCheckpoint(dirpath = fD, save_top_k = 1, verbose = True, monitor = "Val_Loss", mode = "min", save_last=True)
checkpoint_callback = checkpoint_callback,
callbacks = [mCallback]
Epoch 0:  60%|6| 39700/65987 [2:09:15<1:25:35,  5.12it/s, loss=nan, v_num=-209, TraiWARNING:neptune.experiments:Invalid metric value: nan for channel Train_Loss_step. Metrics with nan or +/-inf values will not be sent to server
WARNING:neptune.experiments:Invalid metric value: nan for channel Train_Recon_step. Metrics with nan or +/-inf values will not be sent to server

Facing issues with - ddp_sharded

update neptune client
neptune-client - 0.10

Remove - no ddp_sharded

Code works but issue with NeptuneClient:

Epoch 4:  25%|2| 16188/65987 [1:04:27<3:18:17,  4.19it/s, loss=0.614, v_num=-212
Unexpected error in ping thread.
Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/in
ternal/threads/ping_thread.py", line 37, in run
    self.__backend.ping_experiment(self.__experiment)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/in
ternal/api_clients/hosted_api_clients/hosted_alpha_leaderboard_api_client.py", line 392, in ping_experiment
    self.leaderboard_swagger_client.api.ping(experimentId=str(experiment.internal_id)).response().result
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 239, in response
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise   
    raise value
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 303, in _get_swagger_result
    self.request_config.response_callbacks,
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 334, in unmarshal_response
    raise_on_unexpected(incoming_response)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 408, in raise_on_unexpected
    raise make_http_exception(response=http_response)
bravado.exception.HTTPBadGateway: 502 Bad Gateway: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

Epoch 4:  25%|2| 16189/65987 [1:04:27<3:18:16,  4.19it/s, loss=0.607, v_num=-212
Epoch 4: 100%|#| 65987/65987 [4:21:40<00:00,  4.20it/s, loss=0.545, v_num=-212, Epoch 4, global step 328279: Val_Loss reached 0.51469 (best 0.51469), saving model to "/data/uwstudent1/4M/Exp23/model_saved/epoch=4-step=328279.ckpt" as top 1
Epoch 5:  29%|2| 18938/65987 [1:15:04<3:06:29,  4.20it/s, loss=0.572, v_num=-212Unexpected error in ping thread.
Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
      httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1379, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 272, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1379, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 272, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 291, in _get_incoming_response
    inner_response = self.future.result(timeout=timeout)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/requests_client.py", line 270, in result
    **settings
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/internal/threads/ping_thread.py", line 37, in run
    self.__backend.ping_experiment(self.__experiment)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/internal/api_clients/hosted_api_clients/hosted_alpha_leaderboard_api_client.py", line 392, in ping_experiment
    self.leaderboard_swagger_client.api.ping(experimentId=str(experiment.internal_id)).response().result
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 239, in response
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise   
    raise value
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 197, in response
    incoming_response = self._get_incoming_response(timeout)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 126, in wrapper
    self.future._raise_connection_error(exception)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 91, in _raise_connection_error
    self._raise_error(BravadoConnectionError, 'ConnectionError', exception)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 82, in _raise_error
    sys.exc_info()[2],
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 702, in reraise   
    raise value.with_traceback(tb)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 291, in _get_incoming_response
    inner_response = self.future.result(timeout=timeout)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/requests_client.py", line 270, in result
    **settings
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
bravado.http_future.RequestsFutureAdapterConnectionError
Epoch 5:  80%|8| 53007/65987 [3:30:18<51:29,  4.20it/s, loss=0.542, v_num=-212, Epoch 5:  80%|8| 53008/65987 [3:31:11<51:42,  4.18it/s, loss=0.551, v_num=-212, Epoch 5:  80%|8| 53071/65987 [3:31:25<51:27,  4.18it/s, loss=0.585, v_num=-212

->Discussed on Slack and updated Lightning to latest version:

pip install pytorch-lightning==1.5.0.rc1
pip install 'neptune-client[pytorch-lightning]'

-Issues:

from neptune.new.integrations.pytorch_lightning import NeptuneLogger

/data/uwstudent1/4M/Exp23/p36_T15/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing `Trainer(accelerator='ddp')` has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy='ddp')` instead.

/data/uwstudent1/4M/Exp23/p36_T15/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:148: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=True)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=True)`.

/data/uwstudent1/4M/Exp23/p36_T15/lib/python3.6/site-packages/pytorch_lightning/trainer/configuration_validator.py:156: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
  rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")

All DDP processes registered. Starting ddp with 3 processes

https://ui.neptune.ai/ankitvad/project/e/SCRIB-213
Global seed set to 1234
https://ui.neptune.ai/ankitvad/project/e/SCRIB-215
Global seed set to 1234
https://ui.neptune.ai/ankitvad/project/e/SCRIB-214
Global seed set to 1234

And Can't kill the tasks with Ctrl+C

Neptune has updated their library and provides another API as well now to use. So there are some inconsistencies there. If. you check the LIghtningDOcs for Neptune and the Neptune website for the Lightning API.

So, For the code I've successfully been working on for so long the Library version that work well for me is:

Torch - 1.10.0
Lightning - 1.4.9.
Neptune-client - 0.13.0
Neptune-pytorch-lightning - 0.9.7

And using:
from pytorch_lightning.loggers.neptune import NeptuneLogger
Not : from neptune.new.integrations.pytorch_lightning import NeptuneLogger

and in Trainer:
plugins=DDPPlugin(find_unused_parameters=False)
But, plugins='ddp_sharded' does not work and explodes to +-NAN/na.inf error very soon! So, I consulted the lightning website docs and they mentioned not to use ddp_sharded unless Parameters are huge! But, it worked perfectly on an older code and fairscale version with a smaller training dataset!

awaelchli · 2021-12-06T05:19:58Z

Yes I can confirm, the run instance should be created lazily on the first time the logger.experiment property is called. This can be seen in other implementations of our loggers too. Roughly like this:

class AnyLogger(Logger):
    def __init__(self, ...):
        self._run_instance = None

    @property
    def experiment(self):
        if self._run_instance is None:
            # create it
            self._run_instance = self._init_run_instance(api_key, project, name, run, neptune_run_kwargs)
        else:
            return self._run_instance

This way we can prevent that experiments are getting created on multiple processes when the user creates a logger:

logger = AnyLogger(...)
trainer = Trainer(logger=logger, ...)

Since the above code runs N times in a distributed setting with N devices, the logger instantiation should 1) not create any files or send messages to the server 2) should not create any unpickleable attribues 3) not print any messages. All these events need to be delayed like shown above.

kamil-kaczmarek · 2021-12-10T13:06:09Z

We are fixing it here: neptune-ai#8
cc: @Raalsky

Blaizzy · 2021-12-24T03:33:47Z

Hi, @ankitvad, @ppaysan and @ankitvad,

Prince Canuma here, a Data Scientist at Neptune.ai,

I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳

All you need to do is upgrade the library to the latest release 👍

Happy Christmas and a prosperous New Year in advance!

m-bain added the bug Something isn't working label Nov 18, 2021

tchaton added priority: 0 High priority task logger: neptune labels Nov 18, 2021

Raalsky mentioned this issue Dec 10, 2021

Fixed NeptuneLogger when using DDP #11030

Merged

12 tasks

rohitgr7 closed this as completed in #11030 Dec 18, 2021

anicolson mentioned this issue Jul 10, 2024

Regarding Monitoring using Neptune.ai aehrc/cvt2distilgpt2#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604

When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604

m-bain commented Nov 18, 2021 •

edited by github-actions bot

Loading

tchaton commented Nov 18, 2021

kamil-kaczmarek commented Nov 18, 2021

ppaysan commented Nov 22, 2021

flukeskywalker commented Nov 23, 2021 •

edited

Loading

kamil-kaczmarek commented Nov 26, 2021

ankitvad commented Dec 1, 2021 •

edited

Loading

awaelchli commented Dec 6, 2021 •

edited

Loading

kamil-kaczmarek commented Dec 10, 2021

Blaizzy commented Dec 24, 2021

When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604

When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604

Comments

m-bain commented Nov 18, 2021 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

tchaton commented Nov 18, 2021

kamil-kaczmarek commented Nov 18, 2021

ppaysan commented Nov 22, 2021

flukeskywalker commented Nov 23, 2021 • edited Loading

kamil-kaczmarek commented Nov 26, 2021

ankitvad commented Dec 1, 2021 • edited Loading

awaelchli commented Dec 6, 2021 • edited Loading

kamil-kaczmarek commented Dec 10, 2021

Blaizzy commented Dec 24, 2021

m-bain commented Nov 18, 2021 •

edited by github-actions bot

Loading

flukeskywalker commented Nov 23, 2021 •

edited

Loading

ankitvad commented Dec 1, 2021 •

edited

Loading

awaelchli commented Dec 6, 2021 •

edited

Loading