Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2657: [C++] Undo export pthread_once related symbols. #2096

Closed
wants to merge 12 commits into from
Closed

ARROW-2657: [C++] Undo export pthread_once related symbols. #2096

wants to merge 12 commits into from

Conversation

robertnishihara
Copy link
Contributor

This seems to fix https://issues.apache.org/jira/browse/ARROW-2657, but I don't fully understand this code so it may introduce other issues.

This code was introduced in #1953.

cc @wesm @pitrou

@pitrou
Copy link
Member

pitrou commented Jun 1, 2018

I don't fully understand this code so it may introduce other issues.

As you can see on Travis-CI (look at the manylinux1 build).

@pitrou
Copy link
Member

pitrou commented Jun 1, 2018

If, as @wesm suggests, "TensorFlow is not respecting the manylinux1 standard and is using newer compilers", I'm not sure how we can reconcile TensorFlow and Arrow :-/

@pitrou
Copy link
Member

pitrou commented Jun 1, 2018

The original problem is that libarrow.so and libplasma.so both link with libstdc++ and, when the RH devtoolset is used (i.e. when building wheels), some parts of the libstdc++ are linked statically (the parts that aren't available at runtime with an old libstdc++). Then, when loading libplasma.so which also loads libarrow.so, those static symbols end up instantiated twice in the process (because they are made library-local by the visibility script).

The fix for that was to ensure that those symbols stay process-global, so they that all resolve to a single location.

@wesm
Copy link
Member

wesm commented Jun 1, 2018

This is a bit of a mess -- I'm reading through the original comments to understand what's happening better. Would this not be a problem in any setting where devtoolset was used to build multiple .so files?

@wesm
Copy link
Member

wesm commented Jun 1, 2018

Where is the call_once call happening? I searched in the codebase and didn't find that anywhere

@pitrou
Copy link
Member

pitrou commented Jun 1, 2018

call_once is apparently used under the hood for lazily-initialized static singletons (i.e. when a static object is defined inside a function).

@pitrou
Copy link
Member

pitrou commented Jun 1, 2018

Would this not be a problem in any setting where devtoolset was used to build multiple .so files?

I think it's ok as long as you don't try to hide the "once" symbols using a visibility script.

@wesm
Copy link
Member

wesm commented Jun 1, 2018

@xhochy is there a problem if we allow these pthread symbols to be visible in the .so? We might have to set up some kind of exclusions file for the visibility check script

@xhochy
Copy link
Member

xhochy commented Jun 1, 2018

We actually hide these symbols to be compatible with other libs. In this case it sadly produces the opposite, so it's fine to exclude them.

@robertnishihara
Copy link
Contributor Author

robertnishihara commented Jun 2, 2018

@xhochy, where does check_arrow_visibility.sh live?

EDIT: Ok, I think I found it https://github.com/apache/arrow/blob/master/python/manylinux1/scripts/check_arrow_visibility.sh.

@wesm
Copy link
Member

wesm commented Jun 2, 2018

That's the one, let me know if you need help. I think that may need to be turned into a Python script that can read from an exclusions file

exit 0
fi
exceptions = [
'__once_proxy',
Copy link
Contributor Author

@robertnishihara robertnishihara Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove this line, the test fails in Travis with

Step 9/14 : RUN /check_arrow_visibility.py
 ---> Running in 3dc1e6cc807c
Traceback (most recent call last):
  File "/check_arrow_visibility.py", line 36, in ?
    raise Exception(symbols)
Exception: [u'_fini', u'_init', u'__once_proxy']
The command '/bin/sh -c /check_arrow_visibility.py' returned a non-zero code: 1

as desired.

@robertnishihara
Copy link
Contributor Author

Mostly working, although unfortunately there is still some issue

../../venv-test-2.7-16/lib/python2.7/site-packages/pyarrow/tests/test_convert_pandas.py::TestConvertMisc::test_threaded_conversion /io/build_arrow.sh: line 50: 4909 Segmentation fault (core dumped) py.test -v -r sxX --durations=15 --parquet ${VIRTUAL_ENV}/lib//site-packages/pyarrow*

@wesm
Copy link
Member

wesm commented Jun 3, 2018

Is -pthread being passed when linking libplasma?

This gcc thread suggests that may fix the problem https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58104. Try adding ${PTHREAD_LIBRARY} to https://github.com/apache/arrow/blob/master/cpp/src/plasma/CMakeLists.txt#L66?

@wesm
Copy link
Member

wesm commented Jun 3, 2018

Although I'm not sure why that particular test would fail =( I can try to tinker with this tomorrow possibly, I don't know where to turn for help

@wesm
Copy link
Member

wesm commented Jun 3, 2018

Another thing we could try is to not use the global thread pool in Plasma, in case there is some issue with using the global thread pool in libarrow in libplasma (and we can resolve the shared-library-crossing issue at some later time)

@wesm
Copy link
Member

wesm commented Jun 3, 2018

We could also try replacing the global thread pool initialization with an explicit use of std::call_once in case the issue relates to something having to do with gcc-4.8's implementation of threadsafe static initialization

@wesm
Copy link
Member

wesm commented Jun 3, 2018

I also think we should get the TensorFlow team involved in this. They bear some responsibility for their choice of a non-conforming compiler toolchain when building their wheels

@wesm
Copy link
Member

wesm commented Jun 4, 2018

Rebased. Taking a look at this; I am curious if replacing the atomic initialization of the threadpool with a mutex will make the problem go away

@wesm
Copy link
Member

wesm commented Jun 4, 2018

Looks like PyTorch and TensorFlow trampled each other over this issue also pytorch/pytorch#3059 (comment). Digging some more into this

@wesm
Copy link
Member

wesm commented Jun 4, 2018

Been banging my head against this for a few hours now. I've been looking at PyTorch's equivalent symbols.map file

https://github.com/pytorch/pytorch/blob/master/tools/pytorch.version

I tried to make it look more like theirs:

{
    global:
        extern "C++" {
          *arrow::*;
        };

    local:
        *;
  };

But then this causes errors like

$ python -c "import pyarrow"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/wesm/code/arrow/python/pyarrow/__init__.py", line 47, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: /home/wesm/local/lib/libarrow.so.10: undefined symbol: _ZNK6google8protobuf7Message11GetTypeNameEv

Intuitively, all of the symbols except for the arrow:: symbols (and other C++-related symbols like typeinfo, etc.) should be hidden. I'm going to have to take a break from this for a day or two before I'll be able to work on it again as it's too frustrating

@robertnishihara
Copy link
Contributor Author

@wesm thanks for looking into this. I'm out of ideas at the moment about the "right" way to solve this. If we can't figure something out then it will likely be possible to workaround by not using the global thread pool in plasma.

@wesm
Copy link
Member

wesm commented Jun 4, 2018

After looking at this, I'm not sure that will help; we would have to find some other way to have a global thread pool without having a static global in libarrow.so. We should try to better understand the conflict with TensorFlow -- the fact that PyTorch was able to solve the problem by hiding the symbol is promising.

I'm concerned that devtoolset-2 (which is based on gcc 4.8.x) is on the very edge of C++11 support and so there may be something idiosyncratic that's going wrong. I wonder how much trouble we'd be getting ourselves into if we built our wheels with devtoolset-3 (gcc 4.9.x-based)

@pitrou
Copy link
Member

pitrou commented Jun 4, 2018

Do we already have other static singletons in Arrow?

I suppose it could be possible to use our own mutex-backed implementation if that helps (I don't think the GetCpuThreadPool performance is that critical - if you're planning to offload tasks to a thread pool, you're obviously ok with enduring the cost of thread synchronization). But I'm curious why TensorFlow is using a different toolchain - obviously that should conflict with the manylinux1 requirements?

@wesm
Copy link
Member

wesm commented Jun 4, 2018

I tried using mutexes to sychronize -- the issue is having any global variable that is a C++ object it seems -- this causes the compiler to emit the std::once_call instructions resulting in the pthread symbols being statically linked. I don't yet understand why these symbols are a conflict. From the PyTorch solution, it seems that getting our symbol exports right should solve the problem

@pitrou
Copy link
Member

pitrou commented Jun 4, 2018

Answering to myself: yes, we already have static singletons in the Arrow codebase. For example in the type factories in type.cc.

@wesm
Copy link
Member

wesm commented Jun 4, 2018

You're right; I'm pretty mystified why this problem only started coming up now

TensorFlow needs a newer version of clang for all the LLVM-stuff they are using I guess

@pitrou
Copy link
Member

pitrou commented Jun 4, 2018

I tried using mutexes to sychronize

Which idiom did you try exactly?

I'm thinking something like:

static std::mutex cputhreadpool_mutex;
static CPUThreadPool* cputhreadpool = nullptr;

GetCPUThreadPool() {
  lock_guard lock(cputhreadpool_mutex);
  if (cputhreadpool == nullptr) {
    cputhreadpool = ...;
  }
  return cputhreadpool;
}

By putting all static variables outside of the function we would ideally remove the call_once usage.

@wesm
Copy link
Member

wesm commented Jun 4, 2018

That's what I tried. I suspect the call_once symbols are getting pulled in from something else (maybe from the condition_variable use?) -- we may be looking at a red herring.

@pitrou
Copy link
Member

pitrou commented Jun 4, 2018

Disassembling the libarrow.so that's part of the generated wheel would probably tell us where call_once is used exactly.

@wesm
Copy link
Member

wesm commented Jul 2, 2018

OK, so I found out the mystery of the threads being spawned. They're being spawned by openblas when NumPy imports:

(gdb) break pthread_create
Function "pthread_create" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (pthread_create) pending.
(gdb) r
...
>>> import pyarrow
#0  __pthread_create_2_1 (newthread=0x7fffeedcd280 <blas_threads>, attr=0x0, 
    start_routine=0x7fffecbc08e0 <blas_thread_server>, arg=0x0) at pthread_create.c:610
#1  0x00007fffecbc0c08 in blas_thread_init ()
   from /home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#2  0x00007fffec9a0085 in gotoblas_init ()
   from /home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#3  0x00007ffff7de5733 in call_init (env=0x962b20, argv=0x7fffffffc4b8, argc=1, l=<optimized out>) at dl-init.c:72
#4  _dl_init (main_map=main_map@entry=0x963080, argc=1, argv=0x7fffffffc4b8, env=0x962b20) at dl-init.c:119
#5  0x00007ffff7dea1ff in dl_open_worker (a=a@entry=0x7fffffff1310) at dl-open.c:522
#6  0x00007ffff6a422df in __GI__dl_catch_exception (exception=0x7fffffff12f0, 
    operate=0x7ffff7de9dc0 <dl_open_worker>, args=0x7fffffff1310) at dl-error-skeleton.c:196
#7  0x00007ffff7de97ca in _dl_open (
    file=0x7fffef759b90 "/home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so", mode=-2147483646, caller_dlopen=0x7ffff7a51161 <_PyImport_FindSharedFuncptr+417>, 
    nsid=<optimized out>, argc=1, argv=<optimized out>, env=0x962b20) at dl-open.c:605
#8  0x00007ffff7475f96 in dlopen_doit (a=a@entry=0x7fffffff1540) at dlopen.c:66
#9  0x00007ffff6a422df in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff14e0, 
    operate=0x7ffff7475f40 <dlopen_doit>, args=0x7fffffff1540) at dl-error-skeleton.c:196
#10 0x00007ffff6a4236f in __GI__dl_catch_error (objname=0x69a350, errstring=0x69a358, mallocedp=0x69a348, 
    operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#11 0x00007ffff7476735 in _dlerror_run (operate=operate@entry=0x7ffff7475f40 <dlopen_doit>, 
    args=args@entry=0x7fffffff1540) at dlerror.c:162
#12 0x00007ffff7476051 in __dlopen (
<SNIP>

@pitrou I tried your workaround and it didn't work unfortunately. I'm going to take a crack at adding options to pass in a thread pool in the various places where we use it so we can unblock ourselves on this; not sure what else to do

@pitrou
Copy link
Member

pitrou commented Jul 2, 2018

Changing our API just so that TensorFlow's non-compliant wheels don't crash sounds a bit over the board to me. I'd rather redirect user complaints to the TensorFlow project. We shouldn't feel responsible for this issue.

@robertnishihara
Copy link
Contributor Author

Thanks @wesm! @pitrou I agree it'd be best to fix the issue on the TF side, but that's most likely a long-termer item. Not supporting TF at this point would be like being incompatible with numpy or pandas.

@pitrou
Copy link
Member

pitrou commented Jul 2, 2018

AFAIK this is happening with wheels, but how about using conda packages instead?

@robertnishihara
Copy link
Contributor Author

Conda packages for Arrow or for TF?

@pitrou
Copy link
Member

pitrou commented Jul 2, 2018

For both.

@wesm
Copy link
Member

wesm commented Jul 2, 2018

@pitrou I agree with you, for the record. Currently, the GTP is being used in two places: parallel memcpy and parallel pandas conversions.

The change I'm proposing is to:

  • Add a ThreadPool* member to py::PandasOptions
  • Add an analogous parameter to internal::parallel_memcopy

And then:

  • Disable the GTP temporarily (a "global" one to be spawned when import pyarrow happens)
  • Address the problem directly with TensorFlow

Another "nuke it from orbit" option is to import TF if it is available prior to loading pyarrow's shared libraries.

@robertnishihara is right that having a conflict with TF at this point in history could make Arrow toxic to a large number of Python programmers.

@pitrou
Copy link
Member

pitrou commented Jul 2, 2018

@pitrou I tried your workaround and it didn't work unfortunately.

I'm really curious: do you have a backtrace of where it fails? Removing the static singleton should remove implicit use of pthread_once, AFAICT...

@wesm
Copy link
Member

wesm commented Jul 2, 2018

Let me do a from scratch build of everything to prevent any contamination and get the backtrace

@wesm
Copy link
Member

wesm commented Jul 2, 2018

Here's the backtrace

#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7688827 in __pthread_once_slow (
    once_control=0x7fffd4cd51c8 <tensorflow::port::(anonymous namespace)::cpuid_once_flag>, 
    init_routine=0x7ffff088bfe1 <std::__once_proxy()>) at pthread_once.c:116
#2  0x00007fffd4575faa in void std::call_once<void (&)()>(std::once_flag&, void (&)()) ()
   from /home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#3  0x00007fffd4575fde in tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
   from /home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#4  0x00007fffd4575f11 in tensorflow::port::(anonymous namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string const&) ()
   from /home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#5  0x00007fffd43b7394 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
   from /home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#6  0x00007ffff7de5733 in call_init (env=0x944d20, argv=0x7fffffffc4a8, argc=1, l=<optimized out>) at dl-init.c:72
#7  _dl_init (main_map=main_map@entry=0xc29300, argc=1, argv=0x7fffffffc4a8, env=0x944d20) at dl-init.c:119
#8  0x00007ffff7dea1ff in dl_open_worker (a=a@entry=0x7fffffff57b0) at dl-open.c:522
#9  0x00007ffff6a422df in __GI__dl_catch_exception (exception=0x7fffffff5790, 
    operate=0x7ffff7de9dc0 <dl_open_worker>, args=0x7fffffff57b0) at dl-error-skeleton.c:196
#10 0x00007ffff7de97ca in _dl_open (
    file=0x7fffd9bcf640 "/home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so", mode=-2147483646, caller_dlopen=0x7ffff7a51161 <_PyImport_FindSharedFuncptr+417>, 
    nsid=<optimized out>, argc=1, argv=<optimized out>, env=0x944d20) at dl-open.c:605

I'm pushing the branch as it stands now. I'll try a few more things

@pcmoritz
Copy link
Contributor

pcmoritz commented Jul 2, 2018

I agree that fixing this in some way is extremely important. I can investigate the solution to try to import tensorflow before the pyarrow libraries (if it is available). Any thoughts?

@wesm
Copy link
Member

wesm commented Jul 2, 2018

This is the cheapest thing I can think of:

$ python
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ctypes
>>> ctypes.CDLL('/home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so')
<CDLL '/home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so', handle ed4b70 at 0x7f1653aa0048>
>>> import pyarrow
>>> import tensorflow
WORKS

Loading the tensorflow shared library is fast, whereas import tensorflow takes about as long as importing numpy

@pcmoritz
Copy link
Contributor

pcmoritz commented Jul 2, 2018

Yep, that's the way to do it. Let me see if I can get the path without importing tensorflow ;)

@wesm
Copy link
Member

wesm commented Jul 2, 2018

Probably have to look in site-packages

>>> import site
>>> site.getsitepackages()
['/home/wesm/miniconda/envs/wheel-test-3/lib/python3.6/site-packages']

suffice to say this workaround should only be needed on Linux

Change-Id: I99c29a98e248e31ab2b1356ba191715c96117161
@wesm
Copy link
Member

wesm commented Jul 2, 2018

OK @pcmoritz I will leave this branch to you.

@pitrou I changed the various ThreadPool factory functions to use unique_ptr instead of shared_ptr; I can change it back, but I wonder if you feel strongly about one vs. the other?

@pitrou
Copy link
Member

pitrou commented Jul 2, 2018

I think it's more user-friendly to have ThreadPool::Make return a shared_ptr (as we do for most of our static constructors). However the private singleton can of course be a unique_ptr.

Change-Id: Iad094791792b5904214b9145c9e0e5acc4474c65
@wesm
Copy link
Member

wesm commented Jul 2, 2018

Roger, reverted

@robertnishihara
Copy link
Contributor Author

Closing because hopefully this is worked around by #2210.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants