Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hivemind strategy fails #15

Open
blurry-mood opened this issue Dec 22, 2022 · 0 comments
Open

Hivemind strategy fails #15

blurry-mood opened this issue Dec 22, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@blurry-mood
Copy link

Bug description

I'm trying to use HivemindStrategy to train a ResNet model on Cifar-10 using two machines (one w/ a gpu and the other no).
I start the CPU machine first, and the training starts without a problem. Then, I copy the initial_peers value to the GPU machine, start training but it fails.

How to reproduce the bug

import os

import pandas as pd
import seaborn as sn
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.callbacks.progress import TQDMProgressBar
from pytorch_lightning.loggers import CSVLogger
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, update_bn
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 256 if torch.cuda.is_available() else 16
NUM_WORKERS = int(os.cpu_count() / 2)


train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}


model = LitResnet(lr=0.05)


from pytorch_lightning.strategies import HivemindStrategy

trainer = Trainer(
    max_epochs=30,
    accelerator="auto",
    devices=1 if torch.cuda.is_available() else None,  
    strategy=HivemindStrategy(target_batch_size=2048) # for the machine without a gpu

    strategy=HivemindStrategy(target_batch_size=2048,
initial_peers='/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ')      # for the machine with a gpu
)

trainer.fit(model, cifar10_dm)

Error messages and logs

The first machine (without gpu) proceeds with the training normally, here's a sample of its output:

Other machines can connect running the same command:
INITIAL_PEERS=/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ python ...
or passing the peers to the strategy:
HivemindStrategy(initial_peers='/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ')

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Files already downloaded and verified
Files already downloaded and verified

  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.696    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                                            | 0/3125 [00:00<?, ?it/s]Found per machine batch size automatically from the batch: 16
Epoch 0:   2%|██▏                                                                                                                                            | 48/3125 [00:11<12:03,  4.25it/s, loss=2.37, v_num=7]

The other machine, however, fails:

/opt/conda/lib/python3.10/site-packages/pl_bolts/callbacks/data_monitor.py:20: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("wandb")
/opt/conda/lib/python3.10/site-packages/pl_bolts/utils/semi_supervised.py:15: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("sklearn", pypi_name="scikit-learn")
/opt/conda/lib/python3.10/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:35: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  "lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/opt/conda/lib/python3.10/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:93: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/opt/conda/lib/python3.10/site-packages/pl_bolts/losses/self_supervised_learning.py:234: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  self.nce_loss = AmdimNCELoss(tclip)
/opt/conda/lib/python3.10/site-packages/pl_bolts/datamodules/experience_source.py:18: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("gym")
/opt/conda/lib/python3.10/site-packages/pl_bolts/datamodules/sklearn_datamodule.py:15: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("sklearn")
Global seed set to 7
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/workspace/cifar10.py", line 122, in <module>
    strategy=HivemindStrategy(target_batch_size=2048,
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/hivemind.py", line 142, in __init__
    self.dht = hivemind.DHT(
  File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 88, in __init__
    self.run_in_background(await_ready=await_ready)
  File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 148, in run_in_background
    self.wait_until_ready(timeout)
  File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 151, in wait_until_ready
    self._ready.result(timeout=timeout)
  File "/opt/conda/lib/python3.10/site-packages/hivemind/utils/mpfuture.py", line 258, in result
    return super().result(timeout)
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2022/12/22 20:13:57 failed to parse multiaddr "": empty multiaddr

Environment

CPU machine:

* CUDA:
	- GPU:               None
	- available:         False
	- version:           11.7
* Lightning:
	- lightning-bolts:   0.6.0.post1
	- lightning-lite:    1.8.0
	- lightning-utilities: 0.3.0
	- pytorch-lightning: 1.8.0
	- torch:             1.13.1
	- torchmetrics:      0.10.0
	- torchvision:       0.14.1
* Packages:
	- absl-py:           1.3.0
	- accelerate:        0.15.0
	- aiohttp:           3.8.3
	- aiosignal:         1.3.1
	- asttokens:         2.2.1
	- async-timeout:     4.0.2
	- attrs:             21.2.0
	- automat:           20.2.0
	- babel:             2.8.0
	- backcall:          0.2.0
	- base58:            2.1.1
	- bcrypt:            3.2.0
	- blinker:           1.4
	- cachetools:        5.2.0
	- certifi:           2020.6.20
	- chardet:           4.0.0
	- charset-normalizer: 2.1.1
	- click:             8.0.3
	- cloud-init:        22.4.2
	- colorama:          0.4.4
	- command-not-found: 0.3
	- configargparse:    1.5.3
	- configobj:         5.0.6
	- constantly:        15.1.0
	- contourpy:         1.0.6
	- cryptography:      3.4.8
	- cycler:            0.11.0
	- datasets:          2.8.0
	- dbus-python:       1.2.18
	- decorator:         5.1.1
	- diffusers:         0.11.1
	- dill:              0.3.6
	- distro:            1.7.0
	- distro-info:       1.1build1
	- executing:         1.2.0
	- filelock:          3.8.2
	- fire:              0.5.0
	- fonttools:         4.38.0
	- frozenlist:        1.3.3
	- fsspec:            2022.11.0
	- ftfy:              6.1.1
	- google-auth:       2.15.0
	- google-auth-oauthlib: 0.4.6
	- grpcio:            1.51.1
	- grpcio-tools:      1.48.2
	- hivemind:          1.1.4
	- httplib2:          0.20.2
	- huggingface-hub:   0.11.1
	- hyperlink:         21.0.0
	- idna:              3.3
	- importlib-metadata: 4.6.4
	- incremental:       21.3.0
	- ipython:           8.7.0
	- jedi:              0.18.2
	- jeepney:           0.7.1
	- jinja2:            3.0.3
	- jsonpatch:         1.32
	- jsonpointer:       2.0
	- jsonschema:        3.2.0
	- keyring:           23.5.0
	- kiwisolver:        1.4.4
	- launchpadlib:      1.10.16
	- lazr.restfulclient: 0.14.4
	- lazr.uri:          1.0.6
	- lgg:               0.2.4
	- lightning-bolts:   0.6.0.post1
	- lightning-lite:    1.8.0
	- lightning-utilities: 0.3.0
	- markdown:          3.4.1
	- markupsafe:        2.1.1
	- matplotlib:        3.6.2
	- matplotlib-inline: 0.1.6
	- more-itertools:    8.10.0
	- msgpack:           1.0.4
	- multiaddr:         0.0.9
	- multidict:         6.0.3
	- multiprocess:      0.70.14
	- netaddr:           0.8.0
	- netifaces:         0.11.0
	- numpy:             1.24.0
	- nvidia-cublas-cu11: 11.10.3.66
	- nvidia-cuda-nvrtc-cu11: 11.7.99
	- nvidia-cuda-runtime-cu11: 11.7.99
	- nvidia-cudnn-cu11: 8.5.0.96
	- oauthlib:          3.2.0
	- packaging:         22.0
	- pandas:            1.5.2
	- parso:             0.8.3
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            9.3.0
	- pip:               22.0.2
	- prefetch-generator: 1.0.3
	- prompt-toolkit:    3.0.36
	- protobuf:          3.20.1
	- psutil:            5.9.4
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyarrow:           10.0.1
	- pyasn1:            0.4.8
	- pyasn1-modules:    0.2.1
	- pydantic:          1.10.2
	- pygments:          2.13.0
	- pygobject:         3.42.1
	- pyhamcrest:        2.0.2
	- pyjwt:             2.3.0
	- pymultihash:       0.8.2
	- pyopenssl:         21.0.0
	- pyparsing:         2.4.7
	- pyrsistent:        0.18.1
	- pyserial:          3.5
	- python-apt:        2.3.0+ubuntu2.1
	- python-dateutil:   2.8.2
	- python-debian:     0.1.43ubuntu1
	- python-magic:      0.4.24
	- pytorch-lightning: 1.8.0
	- pytz:              2022.1
	- pyyaml:            5.4.1
	- regex:             2022.10.31
	- requests:          2.25.1
	- requests-oauthlib: 1.3.1
	- responses:         0.18.0
	- rsa:               4.9
	- scipy:             1.9.3
	- seaborn:           0.12.1
	- secretstorage:     3.3.1
	- service-identity:  18.1.0
	- setuptools:        59.6.0
	- six:               1.16.0
	- sortedcontainers:  2.4.0
	- sos:               4.4
	- ssh-import-id:     5.11
	- stack-data:        0.6.2
	- systemd-python:    234
	- tensorboard:       2.11.0
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- tensorboardx:      2.5.1
	- termcolor:         2.1.1
	- tokenizers:        0.13.2
	- torch:             1.13.1
	- torchmetrics:      0.10.0
	- torchvision:       0.14.1
	- tqdm:              4.64.1
	- traitlets:         5.8.0
	- transformers:      4.25.1
	- twisted:           22.1.0
	- typing-extensions: 4.4.0
	- ubuntu-advantage-tools: 27.12
	- ubuntu-drivers-common: 0.0.0
	- ufw:               0.36.1
	- unattended-upgrades: 0.1
	- urllib3:           1.26.5
	- uvloop:            0.17.0
	- varint:            1.0.2
	- wadllib:           1.3.6
	- wcwidth:           0.2.5
	- werkzeug:          2.2.2
	- wheel:             0.37.1
	- xkit:              0.0.0
	- xxhash:            3.1.0
	- yarl:              1.8.2
	- zipp:              1.0.0
	- zope.interface:    5.4.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.10.6
	- version:           Lightning-AI/lightning#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022

GPU machine:

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 3090
        - available:         True
        - version:           11.6
* Lightning:
        - lightning-bolts:   0.6.0.post1
        - lightning-utilities: 0.5.0
        - pytorch-lightning: 1.8.6
        - torch:             1.13.1
        - torchelastic:      0.2.2
        - torchmetrics:      0.10.0
        - torchtext:         0.14.1
        - torchvision:       0.14.1
* Packages:
        - absl-py:           1.3.0
        - accelerate:        0.15.0
        - aiohttp:           3.8.3
        - aiosignal:         1.3.1
        - anyio:             3.6.2
        - argon2-cffi:       21.3.0
        - argon2-cffi-bindings: 21.2.0
        - arrow:             1.2.3
        - asttokens:         2.0.5
        - astunparse:        1.6.3
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - babel:             2.11.0
        - backcall:          0.2.0
        - base58:            2.1.1
        - bash-kernel:       0.9.0
        - beautifulsoup4:    4.11.1
        - bleach:            5.0.1
        - brotlipy:          0.7.0
        - cachetools:        5.2.0
        - certifi:           2022.9.24
        - cffi:              1.15.1
        - chardet:           4.0.0
        - charset-normalizer: 2.0.4
        - comm:              0.1.2
        - conda:             22.11.1
        - conda-build:       3.23.3
        - conda-package-handling: 1.9.0
        - configargparse:    1.5.3
        - contourpy:         1.0.6
        - cryptography:      38.0.1
        - cycler:            0.11.0
        - datasets:          2.8.0
        - debugpy:           1.6.4
        - decorator:         5.1.1
        - defusedxml:        0.7.1
        - diffusers:         0.11.1
        - dill:              0.3.6
        - dnspython:         2.2.1
        - entrypoints:       0.4
        - exceptiongroup:    1.0.4
        - executing:         0.8.3
        - expecttest:        0.1.4
        - fastjsonschema:    2.16.2
        - filelock:          3.6.0
        - flit-core:         3.6.0
        - fonttools:         4.38.0
        - fqdn:              1.5.1
        - frozenlist:        1.3.3
        - fsspec:            2022.11.0
        - ftfy:              6.1.1
        - future:            0.18.2
        - glob2:             0.7
        - google-auth:       2.15.0
        - google-auth-oauthlib: 0.4.6
        - grpcio:            1.51.1
        - grpcio-tools:      1.48.2
        - hivemind:          1.1.4
        - huggingface-hub:   0.11.1
        - hypothesis:        6.61.0
        - idna:              3.4
        - importlib-metadata: 5.2.0
        - iniconfig:         1.1.1
        - ipykernel:         6.19.4
        - ipython:           8.7.0
        - ipython-genutils:  0.2.0
        - ipywidgets:        8.0.3
        - isoduration:       20.11.0
        - jedi:              0.18.1
        - jinja2:            3.1.2
        - json5:             0.9.10
        - jsonpointer:       2.3
        - jsonschema:        4.17.3
        - jupyter:           1.0.0
        - jupyter-archive:   3.3.3
        - jupyter-client:    7.4.8
        - jupyter-console:   6.4.4
        - jupyter-core:      5.1.0
        - jupyter-events:    0.5.0
        - jupyter-http-over-ws: 0.0.8
        - jupyter-server:    1.23.4
        - jupyter-server-terminals: 0.4.3
        - jupyterlab:        3.5.2
        - jupyterlab-pygments: 0.2.2
        - jupyterlab-server: 2.16.5
        - jupyterlab-widgets: 3.0.4
        - kiwisolver:        1.4.4
        - lgg:               0.2.4
        - libarchive-c:      2.9
        - lightning-bolts:   0.6.0.post1
        - lightning-utilities: 0.5.0
        - markdown:          3.4.1
        - markupsafe:        2.1.1
        - matplotlib:        3.6.2
        - matplotlib-inline: 0.1.6
        - mistune:           2.0.4
        - mkl-fft:           1.3.1
        - mkl-random:        1.2.2
        - mkl-service:       2.4.0
        - mpmath:            1.2.1
        - msgpack:           1.0.4
        - multiaddr:         0.0.9
        - multidict:         6.0.3
        - multiprocess:      0.70.14
        - nbclassic:         0.4.8
        - nbclient:          0.7.2
        - nbconvert:         7.2.7
        - nbformat:          5.7.1
        - nbzip:             0.1.0
        - nest-asyncio:      1.5.6
        - netaddr:           0.8.0
        - notebook:          6.5.2
        - notebook-shim:     0.2.2
        - numpy:             1.22.3
        - oauthlib:          3.2.2
        - packaging:         22.0
        - pandas:            1.5.2
        - pandocfilters:     1.5.0
        - parso:             0.8.3
        - pexpect:           4.8.0
        - pickleshare:       0.7.5
        - pillow:            9.3.0
        - pip:               22.3.1
        - pkginfo:           1.8.3
        - platformdirs:      2.6.0
        - pluggy:            1.0.0
        - prefetch-generator: 1.0.3
        - prometheus-client: 0.15.0
        - prompt-toolkit:    3.0.20
        - protobuf:          3.20.1
        - psutil:            5.9.0
        - ptyprocess:        0.7.0
        - pure-eval:         0.2.2
        - pyarrow:           10.0.1
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pycosat:           0.6.4
        - pycparser:         2.21
        - pydantic:          1.10.2
        - pygments:          2.11.2
        - pymultihash:       0.8.2
        - pyopenssl:         22.0.0
        - pyparsing:         3.0.9
        - pyrsistent:        0.19.2
        - pysocks:           1.7.1
        - pytest:            7.2.0
        - python-dateutil:   2.8.2
        - python-etcd:       0.4.5
        - python-json-logger: 2.0.4
        - pytorch-lightning: 1.8.6
        - pytz:              2022.1
        - pyyaml:            6.0
        - pyzmq:             24.0.1
        - qtconsole:         5.4.0
        - qtpy:              2.3.0
        - regex:             2022.10.31
        - requests:          2.28.1
        - requests-oauthlib: 1.3.1
        - responses:         0.18.0
        - rfc3339-validator: 0.1.4
        - rfc3986-validator: 0.1.1
        - rsa:               4.9
        - ruamel.yaml:       0.17.21
        - ruamel.yaml.clib:  0.2.6
        - scipy:             1.9.3
        - seaborn:           0.12.1
        - send2trash:        1.8.0
        - setuptools:        65.5.0
        - six:               1.16.0
        - sniffio:           1.3.0
        - sortedcontainers:  2.4.0
        - soupsieve:         2.3.2.post1
        - stack-data:        0.2.0
        - sympy:             1.11.1
        - tensorboard:       2.11.0
        - tensorboard-data-server: 0.6.1
        - tensorboard-plugin-wit: 1.8.1
        - tensorboardx:      2.5.1
        - terminado:         0.17.1
        - tinycss2:          1.2.1
        - tokenizers:        0.13.2
        - toml:              0.10.2
        - tomli:             2.0.1
        - toolz:             0.12.0
        - torch:             1.13.1
        - torchelastic:      0.2.2
        - torchmetrics:      0.10.0
        - torchtext:         0.14.1
        - torchvision:       0.14.1
        - tornado:           6.2
        - tqdm:              4.64.1
        - traitlets:         5.7.1
        - transformers:      4.25.1
        - types-dataclasses: 0.6.6
        - typing-extensions: 4.4.0
        - uri-template:      1.2.0
        - urllib3:           1.26.13
        - uvloop:            0.17.0
        - varint:            1.0.2
        - wcwidth:           0.2.5
        - webcolors:         1.12
        - webencodings:      0.5.1
        - websocket-client:  1.4.2
        - werkzeug:          2.2.2
        - wheel:             0.37.1
        - widgetsnbextension: 4.0.4
        - xxhash:            3.1.0
        - yarl:              1.8.2
        - zipp:              3.11.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.10.8
        - version:           Lightning-AI/lightning#142~18.04.1-Ubuntu SMP Thu Sep 1 16:25:16 UTC 2022

More info

  • The code I used for training is here.
  • This CIFAR-10 example worked perfectly fine.
@carmocca carmocca added the bug Something isn't working label Dec 23, 2022
@Borda Borda transferred this issue from Lightning-AI/pytorch-lightning May 3, 2023
@Lightning-Universe Lightning-Universe deleted a comment from github-actions bot Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants