Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our numpy uses the default channel gcc via openblas #15

Closed
ocefpaf opened this issue Jun 10, 2016 · 27 comments
Closed

Our numpy uses the default channel gcc via openblas #15

ocefpaf opened this issue Jun 10, 2016 · 27 comments

Comments

@ocefpaf
Copy link
Member

ocefpaf commented Jun 10, 2016

Even though we don't recommend the use of the default channel gcc, and some people here has advocated strongly against it in the past BTW, we added a recipe for openblas using gcc, and that made into conda-forge's numpy recipe.

Unfortunately that became a problem for some feedstocks, like rios (see conda-forge/gdal-feedstock#72 (comment)), and I am getting reports of failing environments from many users. Not sure what is the best approach here as I lost most of the blas.

Right now I am recommending people do force install the numpy from the default channel.

PS: I also see a strange behavior where mkl is installed anyways at run-time alongside numpy with openblas.

@jakirkham
Copy link
Member

Even though we don't recommend the use of the default channel gcc, and some people here has advocated strongly against it in the past BTW, we added a recipe for openblas using gcc, and that made into conda-forge's numpy recipe.

The problem is just about every BLAS I know has Fortran code. Also, the standard LAPACK distribution (included with OpenBLAS) is Fortran. The one exception would be BLIS. However, BLIS is too immature for us to rely upon at this point. Hence we had to use gcc to build it as gfortran is really the only game in town. That may change in the future, but not on the time scales we are thinking about.

Unfortunately that became a problem for some feedstocks, like rios (see conda-forge/gdal-feedstock#72 (comment)), and I am getting reports of failing environments from many users. Not sure what is the best approach here as I lost most of the blas.

I can try to take a look at the recipe, but I don't think I will have time to do it today.

Right now I am recommending people do force install the numpy from the default channel.

Alright, as long as you have a way to get things to work that is your call.

PS: I also see a strange behavior where mkl is installed anyways at run-time alongside numpy with openblas.

This is probably more concerning and may mean that we are running into a different problem and the above was merely a symptom of this.

Could you please include a reference to relevant logs that show these problems so that we can get some idea of what is happening?

@gillins
Copy link

gillins commented Jun 10, 2016

Just to clarify, what I'm seeing on OSX is that C++ packages built against numpy (or anything else pulling in libgcc) link against libstdc++.dylib in the Conda directory thanks to rpath. However anything else links against /usr/lib/libstdc++.dylib.

Packages like gdal end up pulling in both these varieties. This is reminiscent of our struggles on Windows 😦

What makes this hard to pin down is that programs link happily and even load OK, but it is when you start doing some processing that things crash, no doubt due to incompatibilities between these library versions.

Would it be possible to split just the Fortran libs out of libgcc - say into a libgfortran and have that as a dependency of numpy. Or even better statically link against libgfortran?

@gillins
Copy link

gillins commented Jun 10, 2016

Or we add libgcc to absolutely everything on OSX, but I think that would be less ideal...

@jakirkham
Copy link
Member

So, this is a tricky issue. It is definitely worthy of investigation, but there are a number of things that need to be considered.

For instance, static linking to any libgcc library results in the binary being GPL'd unless it qualifies for the runtime exception. This can result in weird things for ones stack legally. If we get passed the legal hurdle, static linking libgfortran requires static linking to libquadmath too. Despite lots of discussion on this last point on various gcc bug reports and mailing list threads, there is still no option to do the latter with gcc. So, beyond putting users in a legally challenging situation potentially, we lack the functionality to do it. This has actually been discussed to some extent in other threads ( primarily this one conda-forge/conda-forge.github.io#29 ). So, I won't go into these points too much.

However, it sounds like we are going to change how we deal with compilers again following a meeting yesterday. Basically, Continuum is going to start building gcc again and we are going to use it everywhere for everything UNIX. Before that can happen though, they need to get an idea of what OSes people want to support (particularly for Linux) and they need to get and idea of what features the compiler must support (e.g. C++11). Related there is a question of breaks from gcc 6.0 and how we want to deal with these. Fortunately, Linux distros dived head first into this mess near the beginning of this year. So, we hopefully won't have to reinvent solutions to these problems.

There has been a separate discussion that has occurred over time and again at yesterday's meeting of breaking up libgcc into smaller libraries and keeping libgcc as a metapackage. Presumably, this would be split into things like libstdc++, libgfortran, and libgomp. Though smaller divisions are certainly the possible, these are the main divisions that we definitely need.

In the near term, I can certainly re-explore how we are building OpenBLAS. When I had been building it pre-conda-forge, I built all of the C/C++ portions with clang and used gfortran only for the Fortran parts. This seemed to work fine and I used it for heavy duty image analysis. However, I did not use Continuum's gcc to do this. So, there might be some ugly hacks to get this to work (like deleting gcc and g++ binaries from the path) amongst other potentially ugly solutions. I had tried to separate gcc into a gfortran only package before, but the libraries that gcc ships with are hopelessly intertwined. (I can't wait for LLVM to have a Fortran compiler.) This is unfortunately not going to be a fast process because it is tricky, subtle, and will likely require other people to help test the package for breaks.

@jakirkham
Copy link
Member

jakirkham commented Jun 11, 2016

Sorry, guys, I have tried everything I can do with OpenBLAS and there is no way forward at present. We cannot remove the C/C++ compiler portions and it doesn't matter even if we could. On Mac, OpenBLAS is only linked to libgfortran. I tried to use libgfortran as a dependency, but we still don't have a package for this on Mac. So, we are stuck with libgcc.

However, I doubt any of this is the real problem for you. Having a newer version of libgcc shouldn't cause you any problems on Mac just as it doesn't cause any problems on Linux. The only problem is if an older version of libgcc that systems copy is distributed. As Mac's system copy is 4.2.1 (and will never be upgraded) and we are shipping either libgcc 4.8.5 on Mac, this is not the cause of the problem.

While I am concerned that you find yourselves linked to the system libstc++, neither OpenBLAS nor NumPy have any C++ code. Inspection of them demonstrates that they are clean of linkages to libstc++. As far as building in the face of the libgcc package goes, I have had no problem building vigra with C++11 support using clang when a copy of libgcc was in the path because of openblas either pre-conda-forge or now. See this build for an example.

I'm not sure what the issue is here, but it if really relates to libstc++ neither NumPy nor OpenBLAS is the cause. While I would like to help, there is simply not much information to go here on. If we could know the libraries involved and a reproducible example, I think we would have a better chance of figuring this out.

@gillins
Copy link

gillins commented Jun 11, 2016

The problem (I think) is when libgcc is available (because it is one of the dependencies of eg numpy) and the code is C++ then the rpath gets set for that package to use Conda's libstc++. When libgcc isn't available (not one of the build dependencies) then the system libstc++ gets linked in. At runtime you end up with both Conda's libstc++ and the system libstc++ linked in a the same time when you have lots of libs.

So you can end up with one library calling another, passing C++ objects but using different versions of the runtime library. Invariably there will be incompatibilities (and crashes). If there was a way of never having libgcc on the system at build time I think that would fix it, but I don't think this is possible currently as building against numpy ends up with libgcc installed. Does this make sense?

It's not that we are using a newer libstc++ or anything - as you point out that should be fine. We are using both the older and newer one at the same time. I think this is always going to end in tears...

@gillins
Copy link

gillins commented Jun 11, 2016

If you have access to a Mac, install gdal and then do a otool -L on eg libhdf5_cpp.dylib and look for libstdc++.dylib. Then do the same on libgdal.dylib and I think you will see what I mean.

@jakirkham
Copy link
Member

jakirkham commented Jun 11, 2016

Based on what you are now saying I don't think the problem is which libstdc++ is used, but it may be libc++ and libstdc++ interacting in undefined ways (e.g. passing STL objects around). So, I can try to switch HDF5 over to the toolchain so it will be linked with libc++. It came in long before we really settled most of the compiler stuff so it is due for a change. We can't seem to rebuild HDF5 1.8.15.1 though so I hope using 1.8.17 will work for you. Though I would recommend looking carefully through your dependencies and making sure they are all using libc++ as well if they have C++ code.

Also, side note, it appears on Mac gdal is using the system expat. Might want to take a look at that.

@msarahan
Copy link
Member

@jakirkham

Basically, Continuum is going to start building gcc again and we are going to use it everywhere for everything UNIX. Before that can happen though, they need to get an idea of what OSes people want to support (particularly for Linux) and they need to get and idea of what features the compiler must support (e.g. C++11).

I think you are jumping the gun here. The survey results are not final until Thursday. Depending on how they turn out, Continuum may need to have a separate compiler to fulfill our needs. It will be a happy coincidence if Continuum and conda-forge can use the same compiler and general compiler strategy, and thus pool effort, but it is not a foregone conclusion. Also, even if Continuum becomes the general source for a compiler, I propose that we take steps to make the package reproducible, using Debian's guidelines and whatever we can come up with on Mac. I hope that this compiler can be something not so much "from Continuum for the community", but rather something that Continuum would like to share with the community - the responsibility of building and providing, as well as decisions around the feature set and compile options.

I encourage everyone interested in this issue to vote at http://goo.gl/forms/FMCT1bCNVg6ywpKD3

@jakirkham
Copy link
Member

jakirkham commented Jun 11, 2016

I'm basing that statement on the fact that @pelson wants to use the same strategy on both Mac and Linux platforms (which we seem to agree on) and the fact that we have also discussed the fact the gcc package needs to be built. The organizational structure of how this will be done is certainly undetermined (though this is not really what I'm interested in discussing here). However, I don't see how using the gcc package is not what we have decided based on our discussion in the meeting and after.

@msarahan
Copy link
Member

Yes, having a package-based gcc (as opposed to strictly docker with system compilers) is what we decided, but we have not decided whether that package uses the devtoolset patch, or whether we ship runtimes. We also have not decided which version of GCC to commit to. Please vote.

@jakirkham
Copy link
Member

PS: I also see a strange behavior where mkl is installed anyways at run-time alongside numpy with openblas.

As far as this is concerned, I think we may be hitting point 2 in this PR ( conda/conda#2036 ). I don't see much way out of this at present. Interestingly running conda update --all pulls in our NumPy after. Please feel free to recheck with other versions of conda to see if this is not always the case, @ocefpaf.

@jakirkham
Copy link
Member

jakirkham commented Jun 11, 2016

Yes, having a package-based gcc (as opposed to strictly docker with system compilers) is what we decided...

👍

...we have not decided whether that package uses the devtoolset patch, or whether we ship runtimes.

I didn't try to get into the nitty gritty details. My main point with the issue they raised is we may have a mixture of libstdc++ and libc++ where STL symbols are shared that is causing problems. However, if we just use gcc on Mac, that won't be an issue.

Though you are of course correct there are details about the compiler that we are very interested in getting feedback on.

@gillins
Copy link

gillins commented Jun 11, 2016

The old libhdf5_cpp was in fact linked to /usr/lib/libstdc++.6.dylib, not libc++. I don't understand quite why as I thought we were using clang, but the same thing still stands - two versions of the C++ runtime library were linked in - both libstdc++ but on the system one and one the Conda libgcc version.

GDAL: @rpath/./libstdc++.6.dylib (compatibility version 7.0.0, current version 7.19.0)
libkea, geos, previous hdf5 etc: /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 60.0.0)

GDAL pulls these other libs in.

OK so the solution is to rebuild (with toolset) all C++ packages pulled in by gdal that aren't already linked against libgcc? I just had a look and libkea and geos are other examples, there may be more.

If that is correct, I'll look at doing this later.

@gillins
Copy link

gillins commented Jun 12, 2016

Just to recap: conda-forge on OSX is broken. I think we need to come up with a plan to fix this ASAP.

Since libgcc got added to dependencies of numpy with the openblas change we have some packages using system libstdc++ and others using the one that comes with libgcc leading to weird runtime behaviour.

I think there are 2 options:

  1. Remove libgcc from numpy (somehow) and rebuild anything recently linked against it
  2. Add libgcc to all packages (especially C++ ones, but maybe all to be safe).

2 Seems like it would be a very big job and would impact defaults. Thoughts?

cc @pelson @danclewley @ocefpaf

@ocefpaf
Copy link
Member Author

ocefpaf commented Jun 12, 2016

@gillins I prefer 1. I do look forward to use numpy with openblas in the future, but we need a proper solution for Fortran first.

What do you think @msarahan?

@jakirkham
Copy link
Member

Just to recap: conda-forge on OSX is broken. I think we need to come up with a plan to fix this ASAP.

So, I am able to use this NumPy with a plethora of things without problems including packages that use C++11 from conda-forge. I think there is a conflation of problems with gdal and all of conda-forge, which is inaccurate and misleading.

As I still haven't seen clear reproducible problems, it is difficult to actually pin-point what they are. It would be nice if either @gillins or @ocefpaf could help me by doing this. Let's pin-point the problem before we start coming up with such proposals.

It also seems there is some confusion about NumPy and OpenBLAS so I will try to simply explain this below. Please ask questions.

First, if I inspect NumPy's linkages, this is what I find. Basically, it is not linked to libgcc whatsoever.

$ conda inspect linkages numpy
numpy
-----

openblas-0.2.18-1:
    libopenblas-r0.2.18.dylib (lib/libopenblas-r0.2.18.dylib)

system:
    libSystem.B.dylib (/usr/lib/libSystem.B.dylib)

not found:

Second, numpy is not dependent on libgcc. It is dependent on openblas, which is dependent on libgcc. There is no way to remove the later as it needs to link against libgfortran.

Third, OpenBLAS are shown below. As we are not guaranteed to have libgfortran or libquadmath on the users system, we ship them. It is not linked in any other way to libgcc.

$ conda inspect linkages openblas
openblas
--------

libgcc-4.8.5-1:
    libgfortran.3.dylib (lib/libgfortran.3.dylib)
    libquadmath.0.dylib (lib/libquadmath.0.dylib)

openblas-0.2.18-1:
    libopenblas-r0.2.18.dylib (lib/libopenblas-r0.2.18.dylib)

system:
    libSystem.B.dylib (/usr/lib/libSystem.B.dylib)

not found:

Now if we had a libgfortran package for Mac (as exists on Linux), I would happily switch OpenBLAS over to it, but that doesn't exist so we have no other option to ship all of the libgcc. If being dependent libgcc were the problem, which I still have yet to see convincing evidence of this in reproducible form, we could solve it the problem with this simple change.

@jakirkham
Copy link
Member

The old libhdf5_cpp was in fact linked to /usr/lib/libstdc++.6.dylib, not libc++. I don't understand quite why as I thought we were using clang, but the same thing still stands - two versions of the C++ runtime library were linked in - both libstdc++ but on the system one and one the Conda libgcc version.

Unfortunately, some things pick up gcc by default. So, it requires a fair bit of coercion to fix them. The purpose of toolchain is to do this in a systematic way instead of doing it via copy-pasta of a large block of variable export statements. The hdf5 package was added long before this package existed and possibly IIRC before we even had best practices with compilers.

@msarahan
Copy link
Member

Now if we had a libgfortran package for Mac (as exists on Linux), I would happily switch OpenBLAS over to it, but that doesn't exist so we have no other option to ship all of the libgcc

This should be easy to create by having a libgfortran package that is essentially a metapackage.

requirements:
    build:
        - libgcc

build:
    always_include_files:
        - path(s) to fortran libs

@jakirkham
Copy link
Member

Certainly a possibility. Might want to do some testing to make sure this actually behaves ok on Mac.

In any event, not really interested on going on a wild goose chase here. If we can please come up with a solid example that can repeatedly demonstrate the problem, then we would be in much better shape for determining how to go about fixing it. Right now, I remain unclear on what that problem is.

@gillins
Copy link

gillins commented Jun 12, 2016

OK my first point is that we shouldn't have more than one libstdc++ linked in at one time. Are we agreed on this?

I think what is happening is that when libgcc is available at build time (eg a package that relies on numpy), then the libstdc++ is pulled in from there. When it isn't, then the system one is linked it.

Does this sound as bad to you as it does to me??

This isn't to do with whether openblas actually uses libstc++ - the problem is that having it there makes it available for the linker to grab.

@jakirkham I suspect the reason you haven't seen a problem is that all your packages are built with numpy present. Things go wrong when it isn't (and then you mix packages built with numpy). gdal is loading a whole bunch of libs, some built with numpy present, others without.

Does this make sense?

@jakirkham
Copy link
Member

jakirkham commented Jun 12, 2016

While there are non-compliant packages that need to be fixed, my point remains that NumPy and OpenBLAS are not them. We should fix the non-compliant packages to be compliant if that is your question. Compliance has meant using clang. Given that we have a toolchain package that can be used to do this, it should be significantly simpler than it use to be.

Now, whether this has anything to do with the problem that you are experiencing is another question. This has really all been guess work up to this point. While I would like to help address it if I can, right now I still don't know of a reproducer. Would it be possible to come up with one? If not, I'm afraid that we still have no better idea of how or whether this will fix the issues you are running into.

@gillins
Copy link

gillins commented Jun 12, 2016

Two things you can do to reproduce this:

  1. Use otool -L to compare the linking between a numpy dependent package (eg gdal) and a non-numpy package that it pulls in (eg hdf5_cpp, geos etc).
  2. Install RIOS (http://rioshome.org/) and run the test suite (testrios.py). It will fail with:
python(743,0x7fff78b49310) malloc: *** error for object 0x7fff77b6f330: pointer being freed was not allocated

*** set a breakpoint in malloc_error_break to debug

See conda-forge/rios-feedstock#4
RIOS is running fine on other platforms, and was fine before the openblas change on OSX...

@gillins
Copy link

gillins commented Jun 12, 2016

@ocefpaf do you have other examples from the users that have contacted you?

@jakirkham
Copy link
Member

Thanks for the example, @gillins.

@jakirkham
Copy link
Member

So, we have rebuilt HDF5 to use libc++. However, this may expose other issues in packages that do not use the toolchain to build.

@ocefpaf
Copy link
Member Author

ocefpaf commented Jul 25, 2016

After months of broken packages this is finally fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants