You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a ml.p4d.24xlarge machine in AWS. I am trying to run a Temporal Fusion Transformer model. But I am not able use more than 1 GPU at a time. Anything other than devices=1 does not work.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
importlightning.pytorchaspl# configure network and trainerearly_stop_callback=EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger=LearningRateMonitor() # log the learning ratelogger=TensorBoardLogger("lightning_logs") # logging results to a tensorboardtrainer=pl.Trainer(
max_epochs=5,
accelerator="gpu",
devices=2,
enable_model_summary=True,
gradient_clip_val=0.1,#0.1321938983226982, #0.1,# limit_train_batches=50, # comment in for training, running validation every 30 batches# fast_dev_run=True, # comment in to check that networkor dataset has no serious bugscallbacks=[lr_logger, early_stop_callback],
logger=logger,
)
quantile_loss=QuantileLoss(quantiles=[0.5])
tft=TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.0004,#0.00037883813052639795, 0.0001,hidden_size=128,
attention_head_size=4,
dropout=0.1, #0.27511071120990627, #,hidden_continuous_size=32, #32,# loss=QuantileLoss(),# output_size=[1], # there are 7 quantiles by default: [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98]# 5, [1,1,1,1,1]# loss=MAE(),loss=quantile_loss,
# log_interval=10, # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batchesoptimizer="AdamW",
reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size() /1e3:.1f}k")
### Error messages and logs
Error messages and logs here please
RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.
### Environment
Platform AWS
GPU ml.p4d.24xlarge
Python 3.11.10
pytorch-forecasting 1.2.0
lightning 2.4.0
lightning-utilities 0.11.9
pytorch-lightning 2.4.0
pytorch_optimizer 3.3.0
pytorch-ranger 0.1.1
s3torchconnector 1.2.6
s3torchconnectorclient 1.2.7
sagemaker_pytorch_training 2.8.1
tft-torch 0.0.6
torch 2.5.1+cu124
torchaudio 2.5.1+cu124
torchmetrics 1.6.0
torchtext 0.18.0+cu124
torchtnt 0.2.4
torchvision 0.20.1+cu124
### More info
_No response_
The text was updated successfully, but these errors were encountered:
RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.
it looks like you are calling torch.cuda.is_available() or other CUDA-related functions in the main process, outside of your LightningModule. This causes CUDA to be initialized before PyTorch Lightning has a chance to spawn processes for DDP. This is an inherent limitation with CUDA, we can't do much about it.
Please make sure you don't call into CUDA before Trainer.fit is called.
Bug description
I have a ml.p4d.24xlarge machine in AWS. I am trying to run a Temporal Fusion Transformer model. But I am not able use more than 1 GPU at a time. Anything other than devices=1 does not work.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs here please
The text was updated successfully, but these errors were encountered: