Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we use gpu when run demo fin_model? #445

Open
YZH0216 opened this issue Oct 21, 2024 · 3 comments
Open

can we use gpu when run demo fin_model? #445

YZH0216 opened this issue Oct 21, 2024 · 3 comments
Labels
question Further information is requested

Comments

@YZH0216
Copy link

YZH0216 commented Oct 21, 2024

when i run "rdagent fin_model", it works well on my cpu to train a GRU. How to use gpu device such as "cuda:0" to run this demo?
Some outputs of my terminal when running this script are as follows:

[1:MainThread](2024-10-21 03:13:05,144) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:74] - GeneralPTNN pytorch version...
[1:MainThread](2024-10-21 03:13:05,157) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:92] - GeneralPTNN parameters setting:
n_epochs : 100
lr : 0.001
metric : loss
batch_size : 2000
early_stop : 10
optimizer : adam
loss_type : mse
device : cpu
n_jobs : 20
use_GPU : False
weight_decay : 0.0001
seed : None
pt_model_uri: model.model_cls
pt_model_kwargs: {'num_features': 20, 'num_timesteps': 20}
[1:MainThread](2024-10-21 03:13:05,158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model:
EnhancedDeepGRUModel(
(gru): GRU(20, 256, num_layers=5, batch_first=True, dropout=0.4)
(fc): Linear(in_features=256, out_features=1, bias=True)
)

@YZH0216 YZH0216 added the question Further information is requested label Oct 21, 2024
@TPLin22
Copy link
Collaborator

TPLin22 commented Oct 21, 2024

Hi,

You could firstly check if you've chosen the correct base image in your Dockerfile to support GPU functionality.
The Dockerfile can be found at rdagent/scenarios/qlib/docker.

@YZH0216
Copy link
Author

YZH0216 commented Oct 21, 2024

I think I have right docker file, the codes are listed below.
`FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime

For GPU support, please choose the proper tag from https://hub.docker.com/r/pytorch/pytorch/tags

RUN apt-get clean && apt-get update && apt-get install -y \
curl \
vim \
git \
build-essential
&& rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/microsoft/qlib.git

WORKDIR /workspace/qlib

RUN git reset c9ed050ef034fe6519c14b59f3d207abcb693282 --hard

RUN python -m pip install --upgrade cython -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN python -m pip install -e . -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

RUN pip install catboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install xgboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install scipy==1.11.4 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
`

I also successfully generarte docker image called "local_qlib", and if I run this image by "docker run --rm -ti --gpus all local_qlib /bin/bash", I can see normal output by running "nvidia-smi" in this image.
`
(rdagent) youme@youme-System-Product-Name:~/Documents/PythonProjects/RD-Agent$ docker run --rm -ti --gpus all local_qlib /bin/bash
root@8fa2d3b4c6eb:/workspace/qlib# nvidia-smi
Mon Oct 21 12:30:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:0A:00.0 On | N/A |
| 44% 55C P2 111W / 350W | 2724MiB / 12288MiB | 16% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@8fa2d3b4c6eb:/workspace/qlib# ^C
root@8fa2d3b4c6eb:/workspace/qlib# exit
`

However, when I run "rdagent fin_model", the ERROR are listed below.

[1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model: DeepGRUModel( (gru): GRU(20, 128, num_layers=3, batch_first=True, dropout=0.2) (fc): Linear(in_features=128, out_features=1, bias=True) ) [1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model size: 0.2440 MB [1:MainThread](2024-10-21 12:20:21,520) INFO - qlib.timer - [log.py:127] - Time cost: 0.000s | waitingasync_logDone [1:MainThread](2024-10-21 12:20:21,522) ERROR - qlib.workflow - [utils.py:41] - An exception has been raised[RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. ]. File "/opt/conda/bin/qrun", line 8, in <module> sys.exit(run()) File "/workspace/qlib/qlib/workflow/cli.py", line 151, in run fire.Fire(workflow) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/qlib/qlib/workflow/cli.py", line 145, in workflow recorder = task_train(config.get("task"), experiment_name=experiment_name) File "/workspace/qlib/qlib/model/trainer.py", line 127, in task_train _exe_task(task_config) File "/workspace/qlib/qlib/model/trainer.py", line 45, in _exe_task model: Model = init_instance_by_config(task_config["model"], accept_types=Model) File "/workspace/qlib/qlib/utils/mod.py", line 180, in init_instance_by_config return klass(**cls_kwargs, **try_kwargs, **kwargs) File "/workspace/qlib/qlib/contrib/model/pytorch_general_nn.py", line 140, in __init__ self.dnn_model.to(self.device) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 216, in _apply ret = super()._apply(fn, recurse) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.

@YZH0216
Copy link
Author

YZH0216 commented Oct 21, 2024

Besides, it seems the docker container can correctly detect the gpu device, the log detail are listed below.

2024-10-21 20:20:18.348 | INFO | rdagent.utils.env:_gpu_kwargs:269 - GPU Devices are available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants