Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to use more than 1 GPU #20520

Open
debashis-tech opened this issue Dec 26, 2024 · 1 comment
Open

Not able to use more than 1 GPU #20520

debashis-tech opened this issue Dec 26, 2024 · 1 comment
Labels
ver: 2.4.x waiting on author Waiting on user action, correction, or update

Comments

@debashis-tech
Copy link

Bug description

I have a ml.p4d.24xlarge machine in AWS. I am trying to run a Temporal Fusion Transformer model. But I am not able use more than 1 GPU at a time. Anything other than devices=1 does not work.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

import lightning.pytorch as pl

# configure network and trainer
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard

trainer = pl.Trainer(
    max_epochs=5,
    accelerator="gpu",
    devices=2,
    enable_model_summary=True,
    gradient_clip_val=0.1,#0.1321938983226982, #0.1,
    # limit_train_batches=50,  # comment in for training, running validation every 30 batches
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[lr_logger, early_stop_callback],
    logger=logger,
)
quantile_loss = QuantileLoss(quantiles=[0.5])
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate= 0.0004,#0.00037883813052639795, 0.0001,
    hidden_size=128,
    attention_head_size=4,
    dropout= 0.1, #0.27511071120990627, #,
    hidden_continuous_size=32, #32,
    # loss=QuantileLoss(),
    # output_size=[1],  # there are 7 quantiles by default: [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98]
    # 5, [1,1,1,1,1]
    # loss=MAE(),
    loss=quantile_loss,
    # log_interval=10,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    optimizer="AdamW",
    reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size() / 1e3:.1f}k")


### Error messages and logs

Error messages and logs here please

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

### Environment

Platform            AWS
GPU                ml.p4d.24xlarge
Python              3.11.10
pytorch-forecasting                1.2.0
lightning                          2.4.0
lightning-utilities                0.11.9
pytorch-lightning                  2.4.0
pytorch_optimizer                  3.3.0
pytorch-ranger                     0.1.1
s3torchconnector                   1.2.6
s3torchconnectorclient             1.2.7
sagemaker_pytorch_training         2.8.1
tft-torch                          0.0.6
torch                              2.5.1+cu124
torchaudio                         2.5.1+cu124
torchmetrics                       1.6.0
torchtext                          0.18.0+cu124
torchtnt                           0.2.4
torchvision                        0.20.1+cu124

### More info

_No response_
@debashis-tech debashis-tech added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 26, 2024
@lantiga
Copy link
Collaborator

lantiga commented Jan 6, 2025

@debashis-tech from the error

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

it looks like you are calling torch.cuda.is_available() or other CUDA-related functions in the main process, outside of your LightningModule. This causes CUDA to be initialized before PyTorch Lightning has a chance to spawn processes for DDP. This is an inherent limitation with CUDA, we can't do much about it.

Please make sure you don't call into CUDA before Trainer.fit is called.

@lantiga lantiga added waiting on author Waiting on user action, correction, or update and removed needs triage Waiting to be triaged by maintainers bug Something isn't working labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ver: 2.4.x waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

2 participants