Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.12.1 incompatible with c++ built for 1.12.0 and vice versa #88301

Open
charliebudd opened this issue Nov 2, 2022 · 9 comments
Open

1.12.1 incompatible with c++ built for 1.12.0 and vice versa #88301

charliebudd opened this issue Nov 2, 2022 · 9 comments
Labels
module: cpp Related to C++ API oncall: jit Add this issue/PR to JIT oncall triage queue

Comments

@charliebudd
Copy link

charliebudd commented Nov 2, 2022

🐛 Describe the bug

I have a c++/cuda extension which I compile and distribute via PyPI. My build matrix covers all "recent" PyTorch major and minor versions (aswell as python and CUDA versions, if anybody is interest I host my build system as a git workflow here). So far this has been enough to ensure my code runs and I have not had to take patch version into account. However, I have recently had an issue (described below) with 1.12.1. My understanding is that only critical bug fixes should be released in patches. If this is not the case, then distributing built c++ PyTorch extensions would be even harder than it currently is.

The error I get is when passing a jit model down to c++. Before 1.12.1, the _c attribute on a jit script model returned a torch.jit.ScriptModule which will cast to a torch::jit::Module when passed to a c++ function. However, in 1.12.1, the _c attribute returns a torch.ScriptModule which no longer casts in the same way, causing an exception to be thrown.

Minimal code example below. To recreate, compile the c++ for torch 1.12.0 and run the python script with torch 1.12.1 and/or vice versa. When the version matches, the code will run fine, else there is an error.

import torch
import myextension

model = torch.jit.load("model.pt")
myextension.test(model._c)
#include <torch/extension.h>

void test_function(torch::jit::Module model)
{
    py::print("Test function called");
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) 
{
    m.def("test", &test_function);
}

See forum discussion on this issue here.

Versions

torch 1.12.0 and 1.12.1

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @jbschlosser

@malfet malfet added triage review module: cpp Related to C++ API oncall: jit Add this issue/PR to JIT oncall triage queue labels Nov 2, 2022
@malfet
Copy link
Contributor

malfet commented Nov 2, 2022

In general, there are no guarantee for binary compatibility between even the patch versions (as adding a new virtual function to base class will offset vptr of all inherited classes).
Also, even between patch revisions one is allowed to change internal apis (and _c clearly indicates that this method is internal)

But the change as you are describing it a bit unexpected

@malfet
Copy link
Contributor

malfet commented Nov 2, 2022

Hmm, is this OS specific, as I see torch.ScriptModule in both 1.12.0 and 1.12.1 on my MacOS:

% conda run -n torch-1.12.0 python -c "import torch; m=torch.nn.Linear(2, 3); n=torch.jit.script(m); print(torch.__version__, type(n._c))"
1.12.0 <class 'torch.ScriptModule'>

% conda run -n torch-1.12.1 python -c "import torch; m=torch.nn.Linear(2, 3); n=torch.jit.script(m); print(torch.__version__, type(n._c))"
1.12.1 <class 'torch.ScriptModule'>

@charliebudd
Copy link
Author

charliebudd commented Nov 2, 2022

n general, there are no guarantee for binary compatibility between even the patch versions (as adding a new virtual function to base class will offset vptr of all inherited classes).

The interface between python and the compiled c++ extension module is ultimately an interpreted interface, if I'm not mistaken.

Also, even between patch revisions one is allowed to change internal apis (and _c clearly indicates that this method is internal)

The use of _c is a bit of hack to have a way of passing a model down to c++ in the first place. This should really be a way of achieving this with the public API. There is a forum post bemoaning this issue which is reasonable trafficked (considering its obscure nature).

Hmm, is this OS specific, as I see torch.ScriptModule in both 1.12.0 and 1.12.1 on my MacOS:

Yes this does appear to be specific to Linux. I just tried on Windows with the same result as you found on MacOS. But for Linux your test command gives <class 'torch._C.ScriptModule'> for pre 1.12.1 and <class 'torch.ScriptModule'> for 1.12.1.

@zou3519
Copy link
Contributor

zou3519 commented Nov 2, 2022

Possibly related #80489?

@charliebudd
Copy link
Author

charliebudd commented Nov 2, 2022

But for Linux your test command gives <class 'torch._C.ScriptModule'> for pre 1.12.1 and <class 'torch.ScriptModule'> for 1.12.1.

Sorry this is incorrect, I get <class 'torch._C.ScriptModule'> for pre 1.12 and <class 'torch.ScriptModule'> for 1.12.0 AND 1.12.1. I still have the error as described in the original post, but your command is not suggesting any difference. I am currently unable to reproduce the torch.jit.ScriptModule type from _c that I originally reported.

@malfet
Copy link
Contributor

malfet commented Nov 2, 2022

Possibly related #80489?

I though about that, but c++ ABI changes should not result in different namespaces

@malfet
Copy link
Contributor

malfet commented Nov 2, 2022

@charliebudd not sure what original post you are referring to, but can you please post somewhere what issue are you seeing.

In 1.12 builds we've migrated to a newer compiler, which resulted in binary incompatibility between CUDA and CPU wheels, which I've fixed by explicitly specifying c++abi version. (one can check the ABI version using torch._C._PYBIND11_BUILD_ABI property )

@charliebudd
Copy link
Author

Sorry the conditions appear harder to replicate than I thought, I'll try and put together a reproducible example.

@charliebudd
Copy link
Author

Okay, I've made a gist demonstrating the issue...
https://gist.github.com/charliebudd/fef898d9bf6505ea455a3beed29880d0

Just create a virtual environment and run the run.sh script to see the error. It takes some time as it has to install different torch versions a few times.

I struggled to recreate the issue, as it seems to depend on exactly what wheel you install. As my build system needs to use torch versions with specific backends, I'm installing torch versions in the following way...
pip install -f https://download.pytorch.org/whl/torch_stable.html torch==1.12.1+cu113

If you simply pip install torch 1.12.1 it seems to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpp Related to C++ API oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

No branches or pull requests

3 participants