-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2657: [C++] Undo export pthread_once related symbols. #2096
Conversation
As you can see on Travis-CI (look at the manylinux1 build). |
If, as @wesm suggests, "TensorFlow is not respecting the manylinux1 standard and is using newer compilers", I'm not sure how we can reconcile TensorFlow and Arrow :-/ |
The original problem is that The fix for that was to ensure that those symbols stay process-global, so they that all resolve to a single location. |
This is a bit of a mess -- I'm reading through the original comments to understand what's happening better. Would this not be a problem in any setting where devtoolset was used to build multiple .so files? |
Where is the |
|
I think it's ok as long as you don't try to hide the "once" symbols using a visibility script. |
@xhochy is there a problem if we allow these pthread symbols to be visible in the .so? We might have to set up some kind of exclusions file for the visibility check script |
We actually hide these symbols to be compatible with other libs. In this case it sadly produces the opposite, so it's fine to exclude them. |
@xhochy, where does EDIT: Ok, I think I found it https://github.com/apache/arrow/blob/master/python/manylinux1/scripts/check_arrow_visibility.sh. |
That's the one, let me know if you need help. I think that may need to be turned into a Python script that can read from an exclusions file |
exit 0 | ||
fi | ||
exceptions = [ | ||
'__once_proxy', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove this line, the test fails in Travis with
Step 9/14 : RUN /check_arrow_visibility.py
---> Running in 3dc1e6cc807c
Traceback (most recent call last):
File "/check_arrow_visibility.py", line 36, in ?
raise Exception(symbols)
Exception: [u'_fini', u'_init', u'__once_proxy']
The command '/bin/sh -c /check_arrow_visibility.py' returned a non-zero code: 1
as desired.
Mostly working, although unfortunately there is still some issue ../../venv-test-2.7-16/lib/python2.7/site-packages/pyarrow/tests/test_convert_pandas.py::TestConvertMisc::test_threaded_conversion /io/build_arrow.sh: line 50: 4909 Segmentation fault (core dumped) py.test -v -r sxX --durations=15 --parquet ${VIRTUAL_ENV}/lib//site-packages/pyarrow* |
Is This gcc thread suggests that may fix the problem https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58104. Try adding |
Although I'm not sure why that particular test would fail =( I can try to tinker with this tomorrow possibly, I don't know where to turn for help |
Another thing we could try is to not use the global thread pool in Plasma, in case there is some issue with using the global thread pool in |
We could also try replacing the global thread pool initialization with an explicit use of |
I also think we should get the TensorFlow team involved in this. They bear some responsibility for their choice of a non-conforming compiler toolchain when building their wheels |
Rebased. Taking a look at this; I am curious if replacing the atomic initialization of the threadpool with a mutex will make the problem go away |
Looks like PyTorch and TensorFlow trampled each other over this issue also pytorch/pytorch#3059 (comment). Digging some more into this |
Been banging my head against this for a few hours now. I've been looking at PyTorch's equivalent symbols.map file https://github.com/pytorch/pytorch/blob/master/tools/pytorch.version I tried to make it look more like theirs:
But then this causes errors like
Intuitively, all of the symbols except for the |
@wesm thanks for looking into this. I'm out of ideas at the moment about the "right" way to solve this. If we can't figure something out then it will likely be possible to workaround by not using the global thread pool in plasma. |
After looking at this, I'm not sure that will help; we would have to find some other way to have a global thread pool without having a static global in libarrow.so. We should try to better understand the conflict with TensorFlow -- the fact that PyTorch was able to solve the problem by hiding the symbol is promising. I'm concerned that devtoolset-2 (which is based on gcc 4.8.x) is on the very edge of C++11 support and so there may be something idiosyncratic that's going wrong. I wonder how much trouble we'd be getting ourselves into if we built our wheels with devtoolset-3 (gcc 4.9.x-based) |
Do we already have other static singletons in Arrow? I suppose it could be possible to use our own mutex-backed implementation if that helps (I don't think the |
I tried using mutexes to sychronize -- the issue is having any global variable that is a C++ object it seems -- this causes the compiler to emit the |
Answering to myself: yes, we already have static singletons in the Arrow codebase. For example in the type factories in |
You're right; I'm pretty mystified why this problem only started coming up now TensorFlow needs a newer version of clang for all the LLVM-stuff they are using I guess |
Which idiom did you try exactly? I'm thinking something like: static std::mutex cputhreadpool_mutex;
static CPUThreadPool* cputhreadpool = nullptr;
GetCPUThreadPool() {
lock_guard lock(cputhreadpool_mutex);
if (cputhreadpool == nullptr) {
cputhreadpool = ...;
}
return cputhreadpool;
} By putting all |
That's what I tried. I suspect the call_once symbols are getting pulled in from something else (maybe from the condition_variable use?) -- we may be looking at a red herring. |
Disassembling the |
…ommand Change-Id: I53818b506416bc87574514f1cc594c58ec20fb3e
OK, so I found out the mystery of the threads being spawned. They're being spawned by openblas when NumPy imports:
@pitrou I tried your workaround and it didn't work unfortunately. I'm going to take a crack at adding options to pass in a thread pool in the various places where we use it so we can unblock ourselves on this; not sure what else to do |
Changing our API just so that TensorFlow's non-compliant wheels don't crash sounds a bit over the board to me. I'd rather redirect user complaints to the TensorFlow project. We shouldn't feel responsible for this issue. |
AFAIK this is happening with wheels, but how about using conda packages instead? |
Conda packages for Arrow or for TF? |
For both. |
@pitrou I agree with you, for the record. Currently, the GTP is being used in two places: parallel memcpy and parallel pandas conversions. The change I'm proposing is to:
And then:
Another "nuke it from orbit" option is to import TF if it is available prior to loading pyarrow's shared libraries. @robertnishihara is right that having a conflict with TF at this point in history could make Arrow toxic to a large number of Python programmers. |
I'm really curious: do you have a backtrace of where it fails? Removing the |
Let me do a from scratch build of everything to prevent any contamination and get the backtrace |
Here's the backtrace
I'm pushing the branch as it stands now. I'll try a few more things |
I agree that fixing this in some way is extremely important. I can investigate the solution to try to import tensorflow before the pyarrow libraries (if it is available). Any thoughts? |
This is the cheapest thing I can think of:
Loading the tensorflow shared library is fast, whereas |
Yep, that's the way to do it. Let me see if I can get the path without importing tensorflow ;) |
Probably have to look in site-packages
suffice to say this workaround should only be needed on Linux |
Change-Id: I99c29a98e248e31ab2b1356ba191715c96117161
I think it's more user-friendly to have |
Change-Id: Iad094791792b5904214b9145c9e0e5acc4474c65
Roger, reverted |
Closing because hopefully this is worked around by #2210. |
This seems to fix https://issues.apache.org/jira/browse/ARROW-2657, but I don't fully understand this code so it may introduce other issues.
This code was introduced in #1953.
cc @wesm @pitrou