We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running the example command provided in the readme here.
python3 submission_runner.py \ --framework=pytorch \ --workload=mnist \ --experiment_dir=$HOME/experiments \ --experiment_name=my_first_experiment \ --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py \ --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json
(after switching ...adamw/jax/submission.py to ...adamw/pytorch/submission.py) Fails at torch.compile
To reproduce
FROM nvcr.io/nvidia/pytorch:24.03-py3 RUN git clone https://github.com/mlcommons/algorithmic-efficiency/ && cd algorithmic-efficiency/ && git checkout 5b4914ff18f2bb28a01c5669285b6a001ea84111 RUN cd algorithmic-efficiency/ && python3 -m pip install -e '.[jax_cpu]' && python3 -m pip install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/cu121' && python3 -m pip install -e '.[full]'
docker run --rm --net host --ipc host --gpus all -v /home/lucas/algorithmic-efficiency:/opt/project -it 11ad40ed5330 bash -c 'export PYTHONPATH="/opt/project/:$PYTHONPATH" ; cd /opt/project/ ; python3 submission_runner.py --framework=pytorch --workload=mnist --experiment_dir=$HOME/experiments --experiment_name=my_first_experiment --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json'
============= == PyTorch == ============= NVIDIA Release 24.03 (build 85286408) PyTorch Version 2.3.0a0+40ec155e58 Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Copyright (c) 2014-2024 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.4 driver version 550.54.14 with kernel driver version 535.104.12. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details. ERROR:root:Unable to import wandb. Traceback (most recent call last): File "/opt/project/algorithmic_efficiency/logger_utils.py", line 26, in <module> import wandb # pylint: disable=g-import-not-at-top ModuleNotFoundError: No module named 'wandb' I0405 11:10:55.978553 140153789482816 logger_utils.py:76] Creating experiment directory at /root/experiments/my_first_experiment/mnist_pytorch. I0405 11:10:56.270130 140153789482816 submission_runner.py:561] Using RNG seed 2489964499 I0405 11:10:56.270576 140153789482816 submission_runner.py:570] --- Tuning run 1/1 --- I0405 11:10:56.270626 140153789482816 submission_runner.py:575] Creating tuning directory at /root/experiments/my_first_experiment/mnist_pytorch/trial_1. I0405 11:10:56.270741 140153789482816 logger_utils.py:92] Saving hparams to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/hparams.json. I0405 11:10:56.423277 140153789482816 submission_runner.py:215] Initializing dataset. I0405 11:10:56.423400 140153789482816 submission_runner.py:226] Initializing model. I0405 11:10:56.609794 140153789482816 submission_runner.py:264] Performing `torch.compile`. I0405 11:10:57.714589 140153789482816 submission_runner.py:268] Initializing optimizer. I0405 11:10:57.715128 140153789482816 submission_runner.py:275] Initializing metrics bundle. I0405 11:10:57.715188 140153789482816 submission_runner.py:293] Initializing checkpoint and logger. I0405 11:10:57.715620 140153789482816 submission_runner.py:313] Saving meta data to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/meta_data_0.json. fatal: detected dubious ownership in repository at '/opt/project' To add an exception for this directory, call: git config --global --add safe.directory /opt/project I0405 11:10:57.950806 140153789482816 logger_utils.py:220] Unable to record git information. Continuing without it. I0405 11:10:58.229494 140153789482816 submission_runner.py:317] Saving flags to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/flags_0.json. I0405 11:10:58.273115 140153789482816 submission_runner.py:327] Starting training loop. I0405 11:10:58.482898 140153789482816 dataset_info.py:736] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: mnist/3.0.1 I0405 11:10:58.719100 140153789482816 dataset_info.py:578] Load dataset info from /tmp/tmpco1rexddtfds I0405 11:10:58.723723 140153789482816 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code. I0405 11:10:58.724064 140153789482816 dataset_builder.py:593] Generating dataset mnist (/root/data/mnist/3.0.1) Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/data/mnist/3.0.1... I0405 11:10:58.867788 140153789482816 dataset_builder.py:640] Dataset mnist is hosted on GCS. It will automatically be downloaded to your local data directory. If you'd instead prefer to read directly from our public GCS bucket (recommended if you're running on GCP), you can instead pass `try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`. Dl Completed...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20.77 file/s] I0405 11:10:59.174530 140153789482816 dataset_info.py:578] Load dataset info from /root/data/mnist/3.0.1.incompleteX9SJH5 I0405 11:10:59.176377 140153789482816 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name, file_format] from disk and from code do not match. Keeping the one from code. Dataset mnist downloaded and prepared to /root/data/mnist/3.0.1. Subsequent calls will reuse this data. I0405 11:10:59.241586 140153789482816 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[:50000], from /root/data/mnist/3.0.1 Traceback (most recent call last): File "/opt/project/submission_runner.py", line 712, in <module> app.run(main) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/opt/project/submission_runner.py", line 680, in main score = score_submission_on_workload( File "/opt/project/submission_runner.py", line 585, in score_submission_on_workload timing, metrics = train_once(workload, workload_name, File "/opt/project/submission_runner.py", line 349, in train_once optimizer_state, model_params, model_state = update_params( File "/opt/project/reference_algorithms/paper_baselines/adamw/pytorch/submission.py", line 74, in update_params logits_batch, new_model_state = workload.model_fn(params=current_model, File "/opt/project/algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py", line 170, in model_fn logits_batch = model(augmented_and_preprocessed_input_batch['inputs']) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors return callback(frame, cache_entry, hooks, frame_state) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame result = inner_convert(frame, cache_size, hooks, frame_state) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 133, in _fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert return _compile( File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 569, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper r = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner out_code = transform_code_object(code, transform) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object transformations(instructions, code_options) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 458, in transform tracer.run() File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run super().run() File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 724, in run and self.step() File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 688, in step getattr(self, inst.opname)(inst) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper return inner_fn(self, inst) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars.items) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function self.push(fn.call_function(self, args, kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/nn_module.py", line 302, in call_function return wrap_fx_proxy( File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy return wrap_fx_proxy_cls( File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1274, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1376, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1337, in get_fake_value return wrap_fake_exception( File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception return fn() File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1338, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1410, in run_node raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1402, in run_node return nnmodule(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 185, in forward outputs = self.parallel_apply(replicas, inputs, module_kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 110, in parallel_apply output.reraise() File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise raise exception torch._dynamo.exc.TorchRuntimeError: Failed running call_module fn(*(FakeTensor(..., device='cuda:0', size=(16, 1, 28, 28)),), **{}): Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in _worker output = module(*input, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/opt/project/algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py", line 43, in forward return self.net(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 215, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1113, in __torch_dispatch__ return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 448, in __call__ return self._op(*args, **kwargs or {}) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1450, in dispatch return decomposition_table[func](*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 229, in _fn result = fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py", line 70, in inner r = f(*tree_map(increase_prec, args), **tree_map(increase_prec, kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py", line 1229, in addmm out = alpha * torch.mm(mat1, mat2) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1540, in dispatch with in_kernel_invocation_manager(self): File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 914, in in_kernel_invocation_manager assert meta_in_tls == prev_in_kernel, f"{meta_in_tls}, {prev_in_kernel}" AssertionError: False, True from user code: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, **kwargs) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True
Using --notorch_compile works as expected.
--notorch_compile
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Running the example command provided in the readme here.
python3 submission_runner.py \ --framework=pytorch \ --workload=mnist \ --experiment_dir=$HOME/experiments \ --experiment_name=my_first_experiment \ --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py \ --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json
(after switching ...adamw/jax/submission.py to ...adamw/pytorch/submission.py)
Fails at torch.compile
To reproduce
docker run --rm --net host --ipc host --gpus all -v /home/lucas/algorithmic-efficiency:/opt/project -it 11ad40ed5330 bash -c 'export PYTHONPATH="/opt/project/:$PYTHONPATH" ; cd /opt/project/ ; python3 submission_runner.py --framework=pytorch --workload=mnist --experiment_dir=$HOME/experiments --experiment_name=my_first_experiment --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json'
Using
--notorch_compile
works as expected.The text was updated successfully, but these errors were encountered: