Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

Open
alexanderdurr opened this issue May 27, 2022 · 9 comments
Open

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

alexanderdurr opened this issue May 27, 2022 · 9 comments

Comments

@alexanderdurr
Copy link

alexanderdurr commented May 27, 2022

Hi,
can you help me and tell me which rlbench and yarr versions/tags are compatible with each other?
For most of the problems I believe that pytorch is the issue and I don't find in any requirements.txt which one you use to make things work.

I observe this error

Process train_env0:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/_env_runner.py", line 169, in _run_env
    raise e
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/_env_runner.py", line 143, in _run_env
    for replay_transition in generator:
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/utils/rollout_generator.py", line 35, in generator
    transition = env.step(act_result)
  File "/home/user/ARM/arm/custom_rlbench_env.py", line 128, in step
    obs, reward, terminal = self._task.step(action)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/rlbench/task_environment.py", line 99, in step
    self._action_mode.action(self._scene, action)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/rlbench/action_modes/action_mode.py", line 32, in action
    arm_action = np.array(action[:arm_act_size])
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/_tensor.py", line 732, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
[2022-05-27 10:10:31,983][root][ERROR] - Env train_env0 failed too many times (11 times > 10)
Exception in thread EnvRunnerThread:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 134, in _run
    raise RuntimeError('Too many process failures.')
RuntimeError: Too many process failures.

I pulled the current version of RLbench and YARR and did a re-install of all packages in a new conda environment.

I am wondering if you use a different torch version that can handle tensor to numpy automatically better.
Currently I fixed this by adding .cpu() in a few files

YARR

 git diff main

diff --git a/yarr/envs/rlbench_env.py b/yarr/envs/rlbench_env.py
index 6aad118..6460fb1 100644
--- a/yarr/envs/rlbench_env.py
+++ b/yarr/envs/rlbench_env.py
@@ -6,7 +6,7 @@ try:
 except (ModuleNotFoundError, ImportError) as e:
     print("You need to install RLBench: 'https://github.com/stepjam/RLBench'")
     raise e
-from rlbench.action_modes import ActionMode
+from rlbench.action_modes.action_mode import ActionMode
 from rlbench.backend.observation import Observation
 from rlbench.backend.task import Task
 
diff --git a/yarr/utils/rollout_generator.py b/yarr/utils/rollout_generator.py
index d4d2973..a3f12ee 100644
--- a/yarr/utils/rollout_generator.py
+++ b/yarr/utils/rollout_generator.py
@@ -27,7 +27,7 @@ class RolloutGenerator(object):
                                    deterministic=eval)
 
             # Convert to np if not already
-            agent_obs_elems = {k: np.array(v) for k, v in
+            agent_obs_elems = {k: np.array(v.cpu()) for k, v in
                                act_result.observation_elements.items()}
             extra_replay_elements = {k: np.array(v) for k, v in
                                      act_result.replay_elements.items()}
@@ -66,7 +66,7 @@ class RolloutGenerator(object):
                     prepped_data = {k: torch.tensor([v], device=self._env_device) for k, v in obs_history.items()}
                     act_result = agent.act(step_signal.value, prepped_data,
                                            deterministic=eval)
-                    agent_obs_elems_tp1 = {k: np.array(v) for k, v in
+                    agent_obs_elems_tp1 = {k: np.array(v.cpu()) for k, v in
                                            act_result.observation_elements.items()}
                     obs_tp1.update(agent_obs_elems_tp1)
                 replay_transition.final_observation = obs_tp1

(Side note: Also observe that with the recent changes in folder structure in RLbench I changed the import for ActionMode.)

RLBench

git diff master

diff --git a/rlbench/action_modes/action_mode.py b/rlbench/action_modes/action_mode.py
index 68171a37..a2c264ef 100644
--- a/rlbench/action_modes/action_mode.py
+++ b/rlbench/action_modes/action_mode.py
@@ -29,8 +29,8 @@ class MoveArmThenGripper(ActionMode):
 
     def action(self, scene: Scene, action: np.ndarray):
         arm_act_size = np.prod(self.arm_action_mode.action_shape(scene))
-        arm_action = np.array(action[:arm_act_size])
-        ee_action = np.array(action[arm_act_size:])
+        arm_action = np.array(action[:arm_act_size].cpu())
+        ee_action = np.array(action[arm_act_size:].cpu())
         self.arm_action_mode.action(scene, arm_action)
         self.gripper_action_mode.action(scene, ee_action)

I believe that the error comes from a change somewhere else though, or that you use a torch version that can deal with this? Can you please help me? I don't know which pyorch version you are using. It is missing in the requirements.txt. I installed pytorch with conda.
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

An error that I am unable to fix is this one

Exception in thread EnvRunnerThread:
Traceback (most recent call last):
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 141, in _run
    new_transitions = self._update()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 86, in _update
    self._agent_summaries = list(
  File "<string>", line 2, in __getitem__
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/managers.py", line 825, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
    send(msg)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 249, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/storage.py", line 623, in _share_cuda_
    return self._storage._share_cuda_(*args, **kwargs)
RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending.

---------------------------------------------------------------------------
[W CudaIPCTypes.cpp:92] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch

Do you have advice? It seems to me like pytorch is the issue for most of the problems I mentioned.

using: Python 3.9.12

@diegomaureira
Copy link

Hi Alexander! I have the same problem. Were you able to fix it?. I have another error about "CUDA out of memory", does anyone know which are the hardware requirements?

@jianingq
Copy link

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@weixiang-smart
Copy link

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@jianingq Hi, jianingq. I also meet this problem. I would be grateful if you can provide details of your solution.

@jianingq
Copy link

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@jianingq Hi, jianingq. I also meet this problem. I would be grateful if you can provide details of your solution.

So I think the error message happens when it collects new transitions. So double check elements in the summaries and make sure they are on cpu instead of gpu. For example, I changed ARM/arm/custom_rlbench_env.py line 123 from action = act_result.action to action = act_result.action.cpu().numpy()

@weixiang-smart
Copy link

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@jianingq Hi, jianingq. I also meet this problem. I would be grateful if you can provide details of your solution.

So I think the error message happens when it collects new transitions. So double check elements in the summaries and make sure they are on cpu instead of gpu. For example, I changed ARM/arm/custom_rlbench_env.py line 123 from action = act_result.action to action = act_result.action.cpu().numpy()

Thank you so much! It helps me solve the problem.

@kevin-xuan
Copy link

@weixiang-smart @jianingq , I modify action = act_result.action to action = act_result.action.cpu().numpy() but still encounter "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." when the process eval_envs runs for collecting new transitions. Is there any other code to be modified? Or is there some wrong with pytorch version (my pytorch=1.13.1)? I really appreciate any help you can provide.

@kevin-xuan
Copy link

kevin-xuan commented Jan 3, 2023

The error happens when collecting new transitions, so it is not about the training process. Therefore, for example, consider a BCAgent, I check the code again and find that the error seems to be related with this line. I modify this line into return ActResult(mu[0].cpu().detach().numpy()) instead of this solution. If you want to run QAttentionAgent of C2FARM method, maybe you could try to modify that line.

@yananliusdu
Copy link

@weixiang-smart @jianingq , I modify action = act_result.action to action = act_result.action.cpu().numpy() but still encounter "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." when the process eval_envs runs for collecting new transitions. Is there any other code to be modified? Or is there some wrong with pytorch version (my pytorch=1.13.1)? I really appreciate any help you can provide.

I met this kind of issues too even with '.cpu().numpy()', might be somewhere else that I forget to change?

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

@kevin-xuan
Copy link

@yananliusdu you can try the above method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants