Testing Training Dataset Size vs Training Time Relationship #6

xuhuahuang813 · 2024-01-09T11:03:13Z

I am attempting to test the relationship between the training dataset size and training time in the SAM repository. I adjusted the train_queries variable in sam_multi/experiments.py to 1000 and ran the following command:

python run_uae.py --run job-light-ranges-mscn-workload

However, I encountered the following error:

Traceback (most recent call last):
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::NeuroCard.train() (pid=81614, ip=172.17.0.5)
  File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trainable.py", line 332, in train
    result = self.step()
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trainable.py", line 636, in step
    result = self._train()
  File "run_uae.py", line 1264, in _train
    q_weight=self.q_weight if self.semi_train else 0
  File "run_uae.py", line 542, in run_epoch_query_only
    all_loss.backward(retain_graph=True)
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'MmBackward' returned nan values in its 0th output.

In the job-light-ranges-mscn-workload configuration within sam_multi/experiments.py, are there any additional parameters or settings that need to be adjusted to properly test the relationship between training dataset size and training time?
I appreciate your time and assistance. Looking forward to your guidance on resolving this issue. Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing Training Dataset Size vs Training Time Relationship #6

Testing Training Dataset Size vs Training Time Relationship #6

xuhuahuang813 commented Jan 9, 2024

Testing Training Dataset Size vs Training Time Relationship #6

Testing Training Dataset Size vs Training Time Relationship #6

Comments

xuhuahuang813 commented Jan 9, 2024