Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604

Closed
m-bain opened this issue Nov 18, 2021 · 9 comments · Fixed by #11030
Closed
Labels
bug Something isn't working logger: neptune priority: 0 High priority task

Comments

@m-bain
Copy link

m-bain commented Nov 18, 2021

🐛 Bug

With using the neptune logger and trainer with n_gpu > 1, accelerator='ddp', duplicate experiments are created and logged, one for each individual gpu

To Reproduce

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer

from pytorch_lightning.loggers.neptune import NeptuneLogger

# Create NeptuneLogger
neptune_logger = NeptuneLogger(
    project="m-bain/LONGVID")


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        gpus=4,
        strategy='ddp',
        logger=neptune_logger
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()

Output:

https://app.neptune.ai/m-bain/LONGVID/e/LON-52
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
https://app.neptune.ai/m-bain/LONGVID/e/LON-53
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
https://app.neptune.ai/m-bain/LONGVID/e/LON-54
initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
https://app.neptune.ai/m-bain/LONGVID/e/LON-55
initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

Expected behavior

A single run logged (happens when gpu=1 and/or accelerator is turned off), not 4 of them...

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.5.1
  • PyTorch Version (e.g., 1.8): 1.10.0
  • Python version: 3.8.12
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: cuda 11.3
  • GPU models and configuration: 4 x NVIDIA GeForce GTX TITAN X
  • How you installed PyTorch (conda, pip, source): pip

cc @tchaton

@m-bain m-bain added the bug Something isn't working label Nov 18, 2021
@tchaton tchaton added priority: 0 High priority task logger: neptune labels Nov 18, 2021
@tchaton
Copy link
Contributor

tchaton commented Nov 18, 2021

@kamil-kaczmarek
Copy link
Contributor

let me check it

@ppaysan
Copy link

ppaysan commented Nov 22, 2021

I have the same problem

@flukeskywalker
Copy link

flukeskywalker commented Nov 23, 2021

Same issue here, blocking upgrade to latest PL version.

@kamil-kaczmarek I guess the problem is that _init_run_instance() is being called in __init__(), which is not correct according to PL loggers design. The run should be created on rank zero only.

@kamil-kaczmarek
Copy link
Contributor

@flukeskywalker

We are double checking this, but seems that you are exactly right 👍 Thanks

@ppaysan @m-bain we're working on the fix.

@ankitvad
Copy link

ankitvad commented Dec 1, 2021

Whew! I was planning to open this issue a month or so ago! I was facing the same issue with a couple of other stuff on the same dataset. Thought I'd share my library and issues maybe it provides extra information!

-Libraries:

torch==1.7.0
pytorch-lightning==1.2.1
neptune-client==0.10.0
fairscale==0.3.0

-Code:

from pytorch_lightning.loggers.neptune import NeptuneLogger
from pytorch_lightning.callbacks import ModelCheckpoint
dataModule = DataModule(data_config = "data_config.json", model_config = "model_config.json")
model = PLModule(model_config = "model_config.json", data_config = "data_config.json", bpeTok = dataModule.sp_gec)
model_config = Config("model_config.json")
fD = "model_saved/"
os.makedirs(fD, exist_ok=True)
checkpoint_callback = ModelCheckpoint(dirpath = "model_saved/", save_top_k = 1, verbose = True, monitor = "Val_Loss", mode = "min", save_last=True)

neptune_logger = NeptuneLogger(
		api_key = model_config.nept_key,
		project_name = model_config.nept_proj,
		experiment_name = model_config.nept_exp)

trainer = pl.Trainer(
						max_epochs = model_config.epochs,
						checkpoint_callback = checkpoint_callback,
						gradient_clip_val = model_config.clip_grad,
						deterministic = True,
						accelerator = 'ddp',
						plugins='ddp_sharded',
						amp_backend = 'native',
						gpus = -1,
						log_every_n_steps = 100,
						precision = 16,
						val_check_interval=1.0,
						logger = neptune_logger
						)

trainer.fit(model, dataModule)

Dataset Size: ~8.5Million! Everything Works here on a smaller dataset! With the newer dataset the model kept blowing up to +-na.inf error. Maybe because of ddp_sharded?!

-Library Update discussed on Slack: Torch+Lightning

torch-1.10.0
fairscale==0.3.0
pytorch-lightning-1.4.9

Issues:

pytorch_lightning.utilities.exceptions.MisconfigurationException: Invalid type provided for checkpoint_callback: Expected bool but received <class 'pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint'>. Pass callback instances to the `callbacks` argument in the Trainer constructor instead.

checkpoint_callback = True 
mCallback = ModelCheckpoint(dirpath = fD, save_top_k = 1, verbose = True, monitor = "Val_Loss", mode = "min", save_last=True)
checkpoint_callback = checkpoint_callback,
callbacks = [mCallback]
Epoch 0:  60%|6| 39700/65987 [2:09:15<1:25:35,  5.12it/s, loss=nan, v_num=-209, TraiWARNING:neptune.experiments:Invalid metric value: nan for channel Train_Loss_step. Metrics with nan or +/-inf values will not be sent to server
WARNING:neptune.experiments:Invalid metric value: nan for channel Train_Recon_step. Metrics with nan or +/-inf values will not be sent to server

Facing issues with - ddp_sharded

update neptune client
neptune-client - 0.10

Remove - no ddp_sharded

Code works but issue with NeptuneClient:

Epoch 4:  25%|2| 16188/65987 [1:04:27<3:18:17,  4.19it/s, loss=0.614, v_num=-212
Unexpected error in ping thread.
Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/in
ternal/threads/ping_thread.py", line 37, in run
    self.__backend.ping_experiment(self.__experiment)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/in
ternal/api_clients/hosted_api_clients/hosted_alpha_leaderboard_api_client.py", line 392, in ping_experiment
    self.leaderboard_swagger_client.api.ping(experimentId=str(experiment.internal_id)).response().result
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 239, in response
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise   
    raise value
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 303, in _get_swagger_result
    self.request_config.response_callbacks,
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 334, in unmarshal_response
    raise_on_unexpected(incoming_response)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 408, in raise_on_unexpected
    raise make_http_exception(response=http_response)
bravado.exception.HTTPBadGateway: 502 Bad Gateway: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

Epoch 4:  25%|2| 16189/65987 [1:04:27<3:18:16,  4.19it/s, loss=0.607, v_num=-212
Epoch 4: 100%|#| 65987/65987 [4:21:40<00:00,  4.20it/s, loss=0.545, v_num=-212, Epoch 4, global step 328279: Val_Loss reached 0.51469 (best 0.51469), saving model to "/data/uwstudent1/4M/Exp23/model_saved/epoch=4-step=328279.ckpt" as top 1
Epoch 5:  29%|2| 18938/65987 [1:15:04<3:06:29,  4.20it/s, loss=0.572, v_num=-212Unexpected error in ping thread.
Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
      httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1379, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 272, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1379, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 272, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 291, in _get_incoming_response
    inner_response = self.future.result(timeout=timeout)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/requests_client.py", line 270, in result
    **settings
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/internal/threads/ping_thread.py", line 37, in run
    self.__backend.ping_experiment(self.__experiment)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/neptune/internal/api_clients/hosted_api_clients/hosted_alpha_leaderboard_api_client.py", line 392, in ping_experiment
    self.leaderboard_swagger_client.api.ping(experimentId=str(experiment.internal_id)).response().result
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 239, in response
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise   
    raise value
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 197, in response
    incoming_response = self._get_incoming_response(timeout)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 126, in wrapper
    self.future._raise_connection_error(exception)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 91, in _raise_connection_error
    self._raise_error(BravadoConnectionError, 'ConnectionError', exception)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 82, in _raise_error
    sys.exc_info()[2],
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 702, in reraise   
    raise value.with_traceback(tb)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/http_future.py", line 291, in _get_incoming_response
    inner_response = self.future.result(timeout=timeout)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/bravado/requests_client.py", line 270, in result
    **settings
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/data/uwstudent1/4M/Exp23/p36_L12/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
bravado.http_future.RequestsFutureAdapterConnectionError
Epoch 5:  80%|8| 53007/65987 [3:30:18<51:29,  4.20it/s, loss=0.542, v_num=-212, Epoch 5:  80%|8| 53008/65987 [3:31:11<51:42,  4.18it/s, loss=0.551, v_num=-212, Epoch 5:  80%|8| 53071/65987 [3:31:25<51:27,  4.18it/s, loss=0.585, v_num=-212

->Discussed on Slack and updated Lightning to latest version:

pip install pytorch-lightning==1.5.0.rc1
pip install 'neptune-client[pytorch-lightning]'

-Issues:

from neptune.new.integrations.pytorch_lightning import NeptuneLogger

/data/uwstudent1/4M/Exp23/p36_T15/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing `Trainer(accelerator='ddp')` has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy='ddp')` instead.

/data/uwstudent1/4M/Exp23/p36_T15/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:148: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=True)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=True)`.

/data/uwstudent1/4M/Exp23/p36_T15/lib/python3.6/site-packages/pytorch_lightning/trainer/configuration_validator.py:156: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
  rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")

All DDP processes registered. Starting ddp with 3 processes

https://ui.neptune.ai/ankitvad/project/e/SCRIB-213
Global seed set to 1234
https://ui.neptune.ai/ankitvad/project/e/SCRIB-215
Global seed set to 1234
https://ui.neptune.ai/ankitvad/project/e/SCRIB-214
Global seed set to 1234

And Can't kill the tasks with Ctrl+C

Neptune has updated their library and provides another API as well now to use. So there are some inconsistencies there. If. you check the LIghtningDOcs for Neptune and the Neptune website for the Lightning API.

So, For the code I've successfully been working on for so long the Library version that work well for me is:

Torch - 1.10.0
Lightning - 1.4.9.
Neptune-client - 0.13.0
Neptune-pytorch-lightning - 0.9.7

And using:
from pytorch_lightning.loggers.neptune import NeptuneLogger
Not : from neptune.new.integrations.pytorch_lightning import NeptuneLogger

and in Trainer:
plugins=DDPPlugin(find_unused_parameters=False)
But, plugins='ddp_sharded' does not work and explodes to +-NAN/na.inf error very soon! So, I consulted the lightning website docs and they mentioned not to use ddp_sharded unless Parameters are huge! But, it worked perfectly on an older code and fairscale version with a smaller training dataset!

@awaelchli
Copy link
Contributor

awaelchli commented Dec 6, 2021

Yes I can confirm, the run instance should be created lazily on the first time the logger.experiment property is called. This can be seen in other implementations of our loggers too. Roughly like this:

class AnyLogger(Logger):
    def __init__(self, ...):
        self._run_instance = None

    @property
    def experiment(self):
        if self._run_instance is None:
            # create it
            self._run_instance = self._init_run_instance(api_key, project, name, run, neptune_run_kwargs)
        else:
            return self._run_instance

This way we can prevent that experiments are getting created on multiple processes when the user creates a logger:

logger = AnyLogger(...)
trainer = Trainer(logger=logger, ...)

Since the above code runs N times in a distributed setting with N devices, the logger instantiation should 1) not create any files or send messages to the server 2) should not create any unpickleable attribues 3) not print any messages. All these events need to be delayed like shown above.

@kamil-kaczmarek
Copy link
Contributor

We are fixing it here: neptune-ai#8
cc: @Raalsky

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 24, 2021

Hi, @ankitvad, @ppaysan and @ankitvad,

Prince Canuma here, a Data Scientist at Neptune.ai,

I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳

All you need to do is upgrade the library to the latest release 👍

Happy Christmas and a prosperous New Year in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logger: neptune priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants