can we use gpu when run demo fin_model? #445

YZH0216 · 2024-10-21T07:25:17Z

when i run "rdagent fin_model", it works well on my cpu to train a GRU. How to use gpu device such as "cuda:0" to run this demo?
Some outputs of my terminal when running this script are as follows:

[1:MainThread](2024-10-21 03:13:05,144) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:74] - GeneralPTNN pytorch version...
[1:MainThread](2024-10-21 03:13:05,157) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:92] - GeneralPTNN parameters setting:
n_epochs : 100
lr : 0.001
metric : loss
batch_size : 2000
early_stop : 10
optimizer : adam
loss_type : mse
device : cpu
n_jobs : 20
use_GPU : False
weight_decay : 0.0001
seed : None
pt_model_uri: model.model_cls
pt_model_kwargs: {'num_features': 20, 'num_timesteps': 20}
[1:MainThread](2024-10-21 03:13:05,158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model:
EnhancedDeepGRUModel(
(gru): GRU(20, 256, num_layers=5, batch_first=True, dropout=0.4)
(fc): Linear(in_features=256, out_features=1, bias=True)
)

TPLin22 · 2024-10-21T07:52:31Z

Hi,

You could firstly check if you've chosen the correct base image in your Dockerfile to support GPU functionality.
The Dockerfile can be found at rdagent/scenarios/qlib/docker.

YZH0216 · 2024-10-21T12:35:26Z

I think I have right docker file, the codes are listed below.
`FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime

For GPU support, please choose the proper tag from https://hub.docker.com/r/pytorch/pytorch/tags

RUN apt-get clean && apt-get update && apt-get install -y \
curl \
vim \
git \
build-essential
&& rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/microsoft/qlib.git

WORKDIR /workspace/qlib

RUN git reset c9ed050ef034fe6519c14b59f3d207abcb693282 --hard

RUN python -m pip install --upgrade cython -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN python -m pip install -e . -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

RUN pip install catboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install xgboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install scipy==1.11.4 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
`

I also successfully generarte docker image called "local_qlib", and if I run this image by "docker run --rm -ti --gpus all local_qlib /bin/bash", I can see normal output by running "nvidia-smi" in this image.
`
(rdagent) youme@youme-System-Product-Name:~/Documents/PythonProjects/RD-Agent$ docker run --rm -ti --gpus all local_qlib /bin/bash
root@8fa2d3b4c6eb:/workspace/qlib# nvidia-smi
Mon Oct 21 12:30:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:0A:00.0 On | N/A |
| 44% 55C P2 111W / 350W | 2724MiB / 12288MiB | 16% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@8fa2d3b4c6eb:/workspace/qlib# ^C
root@8fa2d3b4c6eb:/workspace/qlib# exit
`

However, when I run "rdagent fin_model", the ERROR are listed below.

[1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model: DeepGRUModel( (gru): GRU(20, 128, num_layers=3, batch_first=True, dropout=0.2) (fc): Linear(in_features=128, out_features=1, bias=True) ) [1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model size: 0.2440 MB [1:MainThread](2024-10-21 12:20:21,520) INFO - qlib.timer - [log.py:127] - Time cost: 0.000s | waitingasync_logDone [1:MainThread](2024-10-21 12:20:21,522) ERROR - qlib.workflow - [utils.py:41] - An exception has been raised[RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. ]. File "/opt/conda/bin/qrun", line 8, in <module> sys.exit(run()) File "/workspace/qlib/qlib/workflow/cli.py", line 151, in run fire.Fire(workflow) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/qlib/qlib/workflow/cli.py", line 145, in workflow recorder = task_train(config.get("task"), experiment_name=experiment_name) File "/workspace/qlib/qlib/model/trainer.py", line 127, in task_train _exe_task(task_config) File "/workspace/qlib/qlib/model/trainer.py", line 45, in _exe_task model: Model = init_instance_by_config(task_config["model"], accept_types=Model) File "/workspace/qlib/qlib/utils/mod.py", line 180, in init_instance_by_config return klass(**cls_kwargs, **try_kwargs, **kwargs) File "/workspace/qlib/qlib/contrib/model/pytorch_general_nn.py", line 140, in __init__ self.dnn_model.to(self.device) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 216, in _apply ret = super()._apply(fn, recurse) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.

YZH0216 · 2024-10-21T12:41:42Z

Besides, it seems the docker container can correctly detect the gpu device, the log detail are listed below.

2024-10-21 20:20:18.348 | INFO | rdagent.utils.env:_gpu_kwargs:269 - GPU Devices are available.

YZH0216 added the question Further information is requested label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can we use gpu when run demo fin_model? #445

can we use gpu when run demo fin_model? #445

YZH0216 commented Oct 21, 2024

TPLin22 commented Oct 21, 2024

YZH0216 commented Oct 21, 2024

YZH0216 commented Oct 21, 2024 •

edited

Loading

can we use gpu when run demo fin_model? #445

can we use gpu when run demo fin_model? #445

Comments

YZH0216 commented Oct 21, 2024

TPLin22 commented Oct 21, 2024

YZH0216 commented Oct 21, 2024

For GPU support, please choose the proper tag from https://hub.docker.com/r/pytorch/pytorch/tags

YZH0216 commented Oct 21, 2024 • edited Loading

YZH0216 commented Oct 21, 2024 •

edited

Loading