You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to test the relationship between the training dataset size and training time in the SAM repository. I adjusted the train_queries variable in sam_multi/experiments.py to 1000 and ran the following command:
Traceback (most recent call last):
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::NeuroCard.train() (pid=81614, ip=172.17.0.5)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trainable.py", line 332, in train
result = self.step()
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trainable.py", line 636, in step
result = self._train()
File "run_uae.py", line 1264, in _train
q_weight=self.q_weight if self.semi_train else 0
File "run_uae.py", line 542, in run_epoch_query_only
all_loss.backward(retain_graph=True)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function 'MmBackward' returned nan values in its 0th output.
In the job-light-ranges-mscn-workload configuration within sam_multi/experiments.py, are there any additional parameters or settings that need to be adjusted to properly test the relationship between training dataset size and training time?
I appreciate your time and assistance. Looking forward to your guidance on resolving this issue. Thank you!
The text was updated successfully, but these errors were encountered:
I am attempting to test the relationship between the training dataset size and training time in the SAM repository. I adjusted the
train_queries
variable insam_multi/experiments.py
to 1000 and ran the following command:However, I encountered the following error:
In the job-light-ranges-mscn-workload configuration within sam_multi/experiments.py, are there any additional parameters or settings that need to be adjusted to properly test the relationship between training dataset size and training time?
I appreciate your time and assistance. Looking forward to your guidance on resolving this issue. Thank you!
The text was updated successfully, but these errors were encountered: