-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catastrophic 2.8 and 2.10 x.plot failure on our new CentOS 7 server #230
Comments
@jypeter Indeed, the problem is OpenGL through vnc. Note it is the same vnc client but not the same vnc server (as the vnc server runs on the newer machine). Here we are running vcs remotely through Virtual GL + TurboVNC. See the doc that @sankhesh put together. |
@danlipsa @doutriaux1 @aashish24 it would be great to augment the testing suite to include redhat7/CENTOS7 in addition to the redhat6 machines that we're currently using |
@durack1 I think it will be a a hard thing since for testing we switched to circleci and gave up on buildbot which would have allowed us to setup our own testing infrastructure. I like the idea/suggestions though so if there are any other easy way, I am all for it. |
@aashish24 @doutriaux1 @durack1 @williams13 We'll have to do something (bring back buildbot?) to be able to test on-screen vcs on both linux and mac. Right now we have regressions for window close and window resize that we fixed a year ago. We have a test that passes for mesa but fails for on-screen driver. |
@aashish24 @durack1 we could use ci-bot https://github.com/UV-CDAT/ci-bots |
@danlipsa I have started reading your VNC wiki page. I can check with our sysadmins if they have a test server and time to do all the things you mention. The trouble is that, as far as I know, I'm not using VNC!. I'm running Fedora Core 20 (Fedora release 20 (Heisenbug)) in my VirtualBox VM, on a Windows 7 machine with a basic NVIDIA NVS 310 card. I think I understand that maybe using TurboVNC would allow me to squeeze some extra performance when running vcs on a remote server (which would be nice, because it is often painfully slow). But would not using this be a duplicate of using my VM+ssh, or an X server (Cygwin or VcXsrv) + ssh? I would like first to be able to run vcs run out-of-the-box on the new server, or understand why it is not running. glxgears is running fine when started locally on my VM, or when started on the old and the new server I have added below the output of glxinfo in my VM, and on the old and the new servers (when connected from my VM). I have also added what GL stuff we have on the remote servers glxinfo in the VM
glxinfo on the old servers (RH 6)
glxinfo on the new servers (CentOS 7)
|
@doutriaux1 and @danlipsa Has anybody tried vcs (either 2.8 or 2.10) on a Centos 7 or RH7 machine? I have just run some additional tests, and got the usual catastrophic final crash. Note that glxgears worked in all cases
|
@aashish24 we really need to address this in a straight forward way. It is making us look really bad. |
@jypeter Do you always run inside a VM? If that is true it means that you depend on the OpenGL implementation in the VM which might not work correctly. In this case I would recomment using OSMesa. |
@danlipsa If you look at my previous comment, you'll see in the additional tests that I've made that 2.8.0 did not work either on a) a linux desktop connecting to a remote server, b) directly on the server. I got similar results with 2.10 Note also that 2.8 and 2.10 are not installed in the VM. I only use my VM to connect to the remote servers (basically instead of running a local X server) Just to be sure, I have run my test again from a ubuntu desktop. You will find the results in the test_170811 folder of the folder where I have already put log files. Note that, again, glxgears ran fine on the local desktop, and on the remote server where I connected from the local desktop
|
@danlipsa I have also just installed mesalib the following way
So, the result is... No crash, I get the simple plot I was expecting in the pdf and png files BUT I don't get interactive plots. Is this what I should be expecting? Is this the solution for generating plots on a headless display, and you don't even need to use bg=1? That's useful to know, but I'm looking for interactive graphics. At least a graphical window I can look at, even if I don't do and x.interact() (that can sometimes be painfully slow) An extra question: once I have installed mesalib, is there a way I can go back to interactive graphics, or do I have to uninstall mesalib (and other packages?)? |
@jypeter Indeed, this is it.
I usually just install another conda env that has interactive graphics (direct hardware access). Probably you could replace the vtk pakage and remove mesa, but I have never done it. |
@aashish24 @doutriaux1 @jypeter This does seem to be a OpenGL installation problem. A solution to this kind of problem is to ship mesa (not osmesa) with our packages and switch at runtime which opengl implementation we use. Maybe by default we use the driver install on your system. If that does not work, the user can switch to the software renderer (mesa). ParaView does that https://blog.kitware.com/messing-with-mesa-for-paraview-5-0vtk-7-0/ |
@danlipsa you mean like using a conda-based mesa? |
I'm on holidays this week, so I won't be able to test things. I will try to remember the mesalib trick in a cloned environment for offline graphics (this should probably go to a uvcdat cookbook). But in order to have offline graphics, you need to be able to test things interactively |
@jypeter The idea is to be able to switch at runtime between the system installed OpenGL driver (which is usually hardware accelerated and might not be installed correctly - especially on Linux) and a software OpenGL driver that will be installed by us in coda and will always work (but might be slower than the hardware driver). |
@danlipsa do you know what the magic is for the |
@doutriaux1 I think we only need to pass LD_LIBRARY_PATH pointing to the libGL.* from mesa (installed by conda). I wanted to try this but you are welcome to try it. |
@doutriaux1 Or python script. |
@danlipsa but when we use the offscreen mesa we send a special build keyword to the VTK build. Are you saying I could use the regular VTK build and simply points my LD_LIBRARY_PATH to the mesalib and it would get me the offscreen as well? |
@doutriaux1 Our special VTK build is 'headless' which means it does not use X. If we point LD_LIBRARY_PATH to mesa's libGL.* we'll still get onscreen (but software) rendering. I think the headless VTK uses different libraries than libGL.*. |
@danlipsa ok thanks for clarification I rwad the articlle too fast it does say:
I wrongfully assumed at first that |
@jypeter I just learn a trick that you can try in your virtual box. |
@danlipsa 3D acceleration has been enabled in my VM for a long time, and I get the following if I run glxgears in local terminal in the Fedora 20 running in the VM. I see the cogs turning the way they should
glxgears also runs fine when I connect to our CentOS server, apparently in software rendering
I have even checked that glxgears keeps on working after I activate cdat
And then I get the usual very verbose crash, that mentions lots of libraries, both in /usr/lib64 and minconda2/envs. The giant backtrace always consistently starts with a memory related error
By the way, why am I getting such a looong error output, instead of just the usual |
@danlipsa that sounds awfully similar to what I experience on my ubuntu since 2.10. |
Note that on the CentOS 7 server, I get the same failure with both 2.8 and 2.10 |
@jypeter @danlipsa @durack1 @dnadeau4 I just randomly found a solution for my ubunut. It appears to be a mismatch between ANACONDA's stdc++ and the system see: https://stackoverflow.com/questions/35911302/cannot-launch-emulator-on-linux-ubuntu-15-10 in my case I ran:
And I am up and running for my ubuntu the locate shows on my system:
So the system one is |
@doutriaux1 I think the reason you added gcc was to be able to build on redhat 6 which has an older gcc that what VTK requires. |
@danlipsa yes we wanted a consistent gcc to build everything. After some investigation with @dnadeau4 it is definitely graphic card drivers related. logging in from a mac and using X forwarding works fine, but on the machine itself (using the local drivers) it seg faults. Will try to build all of our package w/o runtime gcc or libgcc to see if it helps.... |
I have the following on our CentOS 7 server
I have tried @doutriaux1 trick, and it does work! Congratulations! The interactive mode in software opengl is still painfully slow (even for such a simple plot), but it's better than nothing
Now, I can't conveniently tell my users to use this trick. Is there a way to make this work out-of-the-box? I could try @doutriaux1 other trick and remove the Note that in my case, out system as the same 19 version as uvcdat-2.10. Could the problem come from the libstdc++.so.6.0.21 file? I have checked where libstdc++ was coming from
|
@jypeter I will try and see if building all of our package on linux w/o runtime libgcc fixes things. |
actually esmf needs libgcc at runtime so we won't be able to use the trick... The other possibility is to add to the source activate code. Having some magic bash code that locates the system's |
@jypeter I'm building a conda version that might work out of the box. When it's built and up can i ask you you to try it as well? Thanks. |
Yep, no pb. Just give me the full conda command I should use. I'd rather not lose esmf, now that we have it at last! Though it proves to be harder to use than I expected This is becoming my longest github issue |
@jypeter esmf is integrated in cdms though, so it should be transparent to you. |
@doutriaux1 Not sure if you found a solution for using a newer gcc on redhat-6. This is the advice that Ben, our build expert give us on Jan 20th.
|
@doutriaux1 and @aashish24 the following very simple plot works fine on our old servers, but crashes python in a most violent way (with a VERY verbose output, not the usual simple core dumped) on our new CentOS 7 servers, with both 2.8 and 2.10
No problem with cdms2 in other non graphical scripts. I see lots of references to vtk in the long crash output I get. Note that, as usual, I'm connected to the servers with a VM. But with the same VM the plot works fine on the old servers, and fails on the new ones
Old servers
New servers
when I type x.plot(i10), a graphical window opens, and then quickly closes, and my terminal is swamped by the following output
You can access the full error dump I got for 2.8 and 2.10 in this folder
Let me know if you need more information on the server hardware/software, etc... All our disks are now on these servers, so I'd rather like to be able to run vcs there... Must be something wrong with opengl or something related
The text was updated successfully, but these errors were encountered: