Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise TritonModelException if the Triton model has an error #333

Merged

Conversation

oliverholworthy
Copy link
Member

@oliverholworthy oliverholworthy commented Apr 18, 2023

Raises error from PredictPyTorchTriton and in the executor model so that any errors in the pytorch model are surfaced.

Currently might see a cryptic error about missing keys in a response, while the actual error that happened in one of the backends (e.g. pytorch) doesn't get reported or printed anywhere.

Example

Using the test added in this PR as an example tests/unit/systems/ops/torch/test_ensemble.py::test_model_error

Before

Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.pytorch.PredictPyTorchTriton object at 0x7f017f6129d0>
Traceback (most recent call last):
  File "/workspace/merlin/systems/merlin/systems/triton/conversions.py", line 160, in triton_response_to_tensor_table
    values = _array_from_triton_tensor(response, f"{out_col_name}__values")
  File "/workspace/merlin/systems/merlin/systems/triton/conversions.py", line 197, in _array_from_triton_tensor
    raise ValueError(f"Column {name} not found in {type(triton_obj)}")
ValueError: Column output__values not found in <class 'c_python_backend_utils.InferenceResponse'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/merlin/dag/executors.py", line 175, in _run_node_transform
    transformed_data = node.op.transform(selection, input_data)
  File "/workspace/merlin/systems/merlin/systems/dag/runtimes/triton/ops/pytorch.py", line 89, in transform
    return triton_response_to_tensor_table(
  File "/workspace/merlin/systems/merlin/systems/triton/conversions.py", line 164, in triton_response_to_tensor_table
    outputs_dict[out_col_name] = _array_from_triton_tensor(response, out_col_name)
  File "/workspace/merlin/systems/merlin/systems/triton/conversions.py", line 197, in _array_from_triton_tensor
    raise ValueError(f"Column {name} not found in {type(triton_obj)}")
ValueError: Column output not found in <class 'c_python_backend_utils.InferenceResponse'>

After

Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.pytorch.PredictPyTorchTriton object at 0x7fa4860699d0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/merlin/dag/executors.py", line 175, in _run_node_transform
    transformed_data = node.op.transform(selection, input_data)
  File "/workspace/merlin/systems/merlin/systems/dag/runtimes/triton/ops/pytorch.py", line 87, in transform
    raise RuntimeError(str(inference_response.error().message()))
RuntimeError: PyTorch execute failure: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/tests/unit/systems/ops/torch/test_ensemble.py", line 8, in forward
  def forward(self: __torch__.tests.unit.systems.ops.torch.test_ensemble.MyModel,
    x: Dict[str, Tensor]) -> Tensor:
    _0 = torch.sum(torch.stack([x["a"], x["b"]]))
                                        ~~~~~~ <--- HERE
    return _0

Traceback of TorchScript, original code (most recent call last):
/usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(967): trace_module
/usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(750): trace
/workspace/merlin/systems/tests/unit/systems/ops/torch/test_ensemble.py(75): test_model_error
/usr/local/lib/python3.8/dist-packages/_pytest/python.py(192): pytest_pyfunc_call
/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py(39): _multicall
/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py(80): _hookexec
/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py(265): __call__
/usr/local/lib/python3.8/dist-packages/_pytest/python.py(1761): runtest
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(166): pytest_runtest_call
/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py(39): _multicall
/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py(80): _hookexec
/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py(265): __call__
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(259): <lambda>
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(338): from_call
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(258): call_runtest_hook
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(219): call_and_report
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(130): runtestprotocol
/usr/local/lib/python3.8/dist-packages/_pytest/runner.py(111): pytest_runtest_protocol
/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py(39): _multicall
/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py(80): _hookexec
/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py(265): __call__
/usr/local/lib/python3.8/dist-packages/_pytest/main.py(347): pytest_runtestloop
/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py(39): _multicall
/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py(80): _hookexec
/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py(265): __call__
/usr/local/lib/python3.8/dist-packages/_pytest/main.py(322): _main
/usr/local/lib/python3.8/dist-packages/_pytest/main.py(268): wrap_session
/usr/local/lib/python3.8/dist-packages/_pytest/main.py(315): pytest_cmdline_main
/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py(39): _multicall
/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py(80): _hookexec
/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py(265): __call__
/usr/local/lib/python3.8/dist-packages/_pytest/config/__init__.py(164): main
/usr/local/lib/python3.8/dist-packages/_pytest/config/__init__.py(187): console_main
/usr/local/lib/python3.8/dist-packages/pytest/__main__.py(5): <module>
/usr/lib/python3.8/runpy.py(87): _run_code
/usr/lib/python3.8/runpy.py(87): _run_code
/usr/lib/python3.8/runpy.py(194): _run_module_as_main
RuntimeError: KeyError: b

@oliverholworthy oliverholworthy added the chore Maintenance for the repository label Apr 18, 2023
@oliverholworthy oliverholworthy self-assigned this Apr 18, 2023
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/systems/review/pr-333

@oliverholworthy oliverholworthy marked this pull request as ready for review April 19, 2023 14:09
@karlhigley karlhigley merged commit e94d2a9 into NVIDIA-Merlin:main Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Maintenance for the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants