-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpm --rebuilddb causes all future rpm functions to segfault/yum to hang #20
Comments
strace for rpm --rebuilddb: strace for rpm -q rpm after rebuild: https://gist.github.com/sirredbeard/c5c0e9aefdd10d08a1e33014ddc930d3 |
My money. Could in principle by worked around local-disto patches to |
Do the straces above resemble other mmap failures we have seen in berkeleydb? This workaround seems to be promising. microsoft/WSL#1812 (comment) |
No not in the sense I can point to a hard fail like the issues you cited. [Bear in mind, that database could be borked in subtle ways before you even did the Yes that work-around is promising, or at least a promising line of inquiry. I'm not convinced that whizzter's point 2 from the incipient "mmap's problem" issue was addressed but then I've never done a test case to prove it isn't, either. Doing that 1MB padding is basically the same as I was talking about with openldap's code here. That particular code is a discount BDB-alike backend (not the real Berkeley/Oracle thing) but same idea. If you look at the This has been the standard work-around for not being able to |
Thank you for your help. We are going to look at the work-around. It could also be a general rpm database issue like you mention but otherwise basic rpm, yum, and dnf work fine. The seg fault in rpm -q rpm occurs on a write, so I also wonder if we're dealing with a permissions issue here. |
Wouldn't be my first guess, but who knows. From the
You're probably hitting the cause of the "another error" mentioned (whatever it is). Also be real cautious because errors in WSL that are related to open handles look like permission problems but really have nothing to do with permissions. Unless they have to do with permissions, mind you, in which case you disregard everything I just said. |
Has anyone tested if rebuilding libdb in WLE would fix it? My understanding is that libdb does a feature test for |
Before we get to rebuilding libdb, which we can do if we have to, let's look at less invasive means by padding the rpm db. This is what I have tried:
returns:
|
Here is strace on the above rpm -qa after rebuild and then padding. I also tried padding before the rpm rebuild:
Returns:
Here is strace on that rpm --rebuilddb with padding first. |
Which looks a whole lot like this dontit:
Per your You could try the same Or if you were feeling highly motivated, attack it head on and try rebuilding |
Rebuilding libdb on WSL per @Conan-Kudo's suggestion, though I took his word that there is an auto fallback for mmap:
results in:
We could also per @therealkenc's suggestion to rebuild libdb with patch to force HAVE_MMAP_EXTEND to being not defined. Also @crramirez suggested the following
|
I can confirm that @crramirez's approach works to allow yum and rpm to work again after It complains a little bit at the end.
does not remove checksum message. |
If it helps, we had exactly this problem on RHEL7.3, but upgrading to 7.5 magically solved our problem:
The version of RPM you are using is not that different, so I wonder if there is something else going on - but i'm not aware of much. |
@daviesalex Huh. Interesting. Currently I can reproduce the bug on RHEL 7.6 and SL 7.6. |
I tried placing a DB_CONFIG in /var/lib/rpm containing:
Still no change in behavior. |
@sirredbeard rpmdb config is controlled via rpm macros, not via If you want to override the setting, write out a file # Override the bdb dbi config for WSL
# Cf. https://github.com/WhitewaterFoundry/WLE/issues/20
%_dbi_config nommap %{?__dbi_other} That should do the trick. |
So we have a workaround that can fix broken rpm databases. That is very good. I still don't know if we can say for certain whether this issue is the mmap syscall issue, because the usual indicators are not there in the straces. If I am wrong, please correct me in any of this. @daviesalex was able to avoid this issue in a docker pull from 7.5 and he is sending me that to look over and compare tomorrow. That touches on the possibility that this is still an issue with our build. I reviewed our kickstart files and build scripts and we don't touch /var/lib/rpm or anything related. However some of the error messages I encountered when debugging were similar to errors reported in some Docker builds, so it still a possibility. BTW, the solution in most of those cases was to run rpm --rebuilddb. The issue still seems to be centered around BerkeleyDB though. There have been suggestions to patch libdb. For several reasons that is a worst-case scenario, I would much rather just make a simple change to the build image if possible. One possible route: Turns out we can tweak BerkeleyDB quite extensively with config files. This may be a better option than rebuilding libdb. Maybe. How DB_CONFIG works: https://docs.oracle.com/cd/E17076_05/html/programmer_reference/env_db_config.html:
The DB_ENV Handle |
I think I might be onto something... Also @Conan-Kudo I am pretty sure it does because a typo in a DB_CONFIG will throw a message even when yum is called for an yum install. |
@sirredbeard Well, you can also set it via the rpmdb flags as I described in #20 (comment), which also automatically applies to chroots created by tools like |
@Conan-Kudo I see. We'll test both approaches if this bears fruit. |
|
@therealkenc |
Sure way to find out would be to copy that database after the This all said I am coming around to the idea it isn't |
@daviesalex Sent over some documentation how they built a working RHEL 7.5 image without this issue from a docker pull. We will look at that. We will also look at the patches OpenSUSE has made, though advice from Red Hat is that we don't want to go there. In the mean time the following DB_CONFIG in /var/lib/rpm seems to address this issue some of the time. More testing is needed:
The above still results in:
|
Following up on @therealkenc's idea here I have generated a CentOS-based build of WLinux Enterprise to compare with a known good CentOS server install. You can download .appx of my build here: https://1drv.ms/u/s!AspPK83V8Sf2hvEoloVoF9rwgpcq5A. /var/lib/rpm in this build is set to:
Here are zip of /var/lib/rpm on default and after rpm --rebuilddb: I overwrote /var/lib/rpm on the known good CentOS server with both sets of /var/lib/rpm from above and both worked fine, no errors. |
Well, I did say If you've got a |
Not sure I follow, @therealkenc. This issue is equally reproducible across CentOS, Scientific Linux, Oracle, and RHEL current 7.6. They are all identical versions of packages, RHEL's may be slightly newer. I used a CentOS WLE for this test so it would match my known good CentOS VPS. Only OpenSUSE has figured this out. Red Hat folks say they did it by patching extensively. On my list of plans is to diff the code from the two distros for rpm and libdb, see if I could find anything. I have also reached out to some known Berkeley DB experts and the folks at Oracle. If we get an answer that it's a syscall issue, then we can talk to Craig and Ben. |
Here are some procmon logs taken at various points described in the file name. |
openSUSE's For one, they have a vendored copy of Berkeley DB 4.8, with a few patches for it to be built inside the RPM tree. The other changes they make to the rpm package related to the rpmdb are the following:
The major difference is that openSUSE rpm uses global locking with Perhaps the nofsync might be meaningful? I hope that's not the case, but if it is, that's going to suck, because it reduces the reliability of rpmdb considerably... |
Sorry, we crossed wires. Works on your CentOS VPS. Doesn't work on your CentOS WSL. Your quote "both worked fine no errors" threw me. Your "known good" scenario being a VPS was lost in translation. [This was compounded by comment (not yours) that implied there exists in the universe working rpm based distros on WSL.] |
@Conan-Kudo - Did they (SUSE) go to the trouble of statically linking a vendored copy of Berkeley DB 4.8 etc etc specifically to make WSL happy, or for other reasons? |
@therealkenc Nah, they've been doing that for many, many years. The last time that stuff was touched was in 2011, I believe. Predates WSL by quite a bit. :) |
That is precisely what Red Hat said too. It just so happens to avoid this issue as well. Even if we don't fix this completely then rpm still works for 99% of use. If rpmd becomes corrupted then users will be limited to simply resetting WLE, with the option to rebuild and using @crramirez's workaround to restore enough functionality if required to move data off. Does anything jump out at you in the procmon logs? I am adding etl files for each of the 4 captured events too. Would you mind taking a look? I appreciate it. etls_rpmbuild_withbdconfig.zip |
Give me a couple of days; maybe the weekend. I have a soft spot for the particular problem. I didn't take a run at this previously because I don't actually use any rpm-based distros and we're probably going to end up at WSL dupe#something for the effort. Anyway right now I am ass deep in mouse problems on a CentOS VM in prep for tracking the diverge. It's a start. |
Thank you so much, I really appreciate it. |
Bollocks. There's a lot of misinformation and misunderstanding going on here. DB_PRIVATE has nothing to do with static linking. The problem with it is that it effectively disables all BDB-level locking on concurrent access because the locking data is in the environment (those __db.* files) which is shared by all clients accessing the database, but DB_PRIVATE makes it use in-memory environment. Which is fine, IFF you take care of the locking by other means. Which Suse does in their rpm: their patches effectively disable the finer-grained locking of BDB with an rpm-level fcntl() lock that permits multiple readers but only a single writer to the database. Which is still absolutely fine, but in order to preserve the ability to perform rpm queries from within rpm scriptlets, they now need to "suspend" the writer lock during scriptlets. Which is where I think it gets icky. Like I said in an earlier comment, AIUI the only "magic" with Suse's rpm working inside WSL is that DB_PRIVATE mode, because it avoids the much trickier to implement (for a kernel, thus WSL here) shared memory map mode. |
Its probably worth updating this issue with our findings (some of which we have shared privately with @sirredbeard We had this issue with our RHEL 7.3 image and effectively gave up on WSL as I fear many have. We also spoke to RH and got the same support response that SUSE have hacked up their yum and RH are not willing to do so, and it was impossible that RHEL would ever work on WSL. They pointed us to https://bugzilla.redhat.com/show_bug.cgi?id=1668380 which ends with "Yeah, it "works" because they carry a patch to the shared environment of Berkeley DB (essentially disabling BDB level locking on concurrent access, the same as we do for unprivileged users) and then a bunch of other patches to try and deal with the consequences.". RH have stated that they have no intention of making changes to RHEL to make this work, and no engineering resources to work on it, which is a shame but good to have clarity on. Last week I tried to re-deploy WSL using our RHEL 7.5 image (and a clean non-corpinstall of Windows), and the problem is not happening. We then realized that it works perfectly well on corporate windows installs - somehow our RHEL 7.5 install is now working reliably with yum. We now have tens of users for WSL and this problem has not happened to any of them in nearly a week. The question is why when others are having this problem on RHEL/SL 7.6. One thing we changed between 7.3 and 7.5 image for us was adding the yum ovl plugin. We did this not because of WSL; we were having some bdb issues after running yum in all 7.5 docker images (on Linux) when the docker storage driver was overlay2, the current default. Either running rpm --rebuilddb after any yum operations or using the ovl plugin made those bdb issues “go away”. On WSL, removing yum-plugins-ovl does not break it, so this could be a red herring, but I thought i'd mention it. We have a lightly modified version of an older version of this script to generate our RHEL7 base images: https://github.com/moby/moby/blob/master/contrib/mkimage-yum.sh Somehow though something about what we have done is working, and some sort of bisect should be able to figure out what it is. I'm about to travel for a few weeks so wont have any time to look at this, but hopefully these details help others. |
@daviesalex I forwarded what you sent over to @nunix to see if he could duplicate your success. He does a lot of work making WSL images with Docker. He is still hitting the seg fault issues. That's not to say this isn't a valuable line of inquiry that could yield some results but we are going to need to work on this more. Here is what he reported back to me via Twitter DM:
I then asked him:
He replied:
Note: dd'ing 001 works, as documented here, but still creates rpmdb checksum issues after a yum install.. |
At the suggestion of an Oracle dev on Berkeley DB I have opened a thread on the Oracle forums here that summarizes this issue and links back here. Post is still pending moderator approval. |
Our RH TAM and I did a little more digging here. We came to this issue originally because we simply could not get yum to work (yum install, etc.) on a new install - at all. This made WSL totally unusable for users (for RHEL). I have now realized that we have not in fact fixed the "rpm --rebuilddb" issue, BUT the way we built our image means that yum just works and no users ever really need to run that. What is odd is that things like rpm --initdb exit 0 (as does --rebuilddb) BUT does not in fact work; shell output:
Notice that final rpm returns nothing - the DB is at this point broken. I did not find the workaround Its unclear to me why we had this problem even before running rebuilddb on 7.3 and dont have it now, but most likely that is a legitimate "how we built the image/ how the RPM DB ended at the end of it" type issue. For the purposes of this/the berkeleydb issue our experience is noise. However, for the purpose of "I want to make this work", the way we did this is probably effective for others - clone a working image with a working RPM DB, and to all intents and purposes it works to the point that your users just wont notice (how often do you run a RPM rebuild in normal operation on a desktop?). We might even do something horrific like alias rpm --rebuilddb to print "Dont do this" in our image to really ensure it because as noted above the workaround to dd the files does not actually seem to work for us so if you run it once, you are back to starting with a clean image ;-) Thanks for writing up the berkeleydb issue - i'll keep an eye on that too. We would obviously like this fixed properly long term too! |
In regards to the work-around, I have found that it will not work if you have yum/rpm command running in that WSL instance. So before doing the work-around really should be sure to kill all these off, or maybe safer way would be go to windows task manager and kill off all the init processes. Then start new wsl instance and do the dd command work-around. |
just been trying something and it seems to work, but please test it on your end: rpm -q rpm # works
rpm --rebuilddb # works
rpm -q rpm # fails
dd if=/dev/zero of=/var/lib/rpm/__db.001 bs=1M count=1 # works
rpm -q rpm # works
yum install wget # fails <-- actually it installs but gets an error message still, so I set it as fail
rpm --setperms wget # fails ok, untill here, everything seems broken, however I continued as follow:
so by re-installing the package "locally" seems to at least move a bit forward. |
Here is an additional workaround that has been submitted: put this in DB_CONFIG
then try:
|
We have been working with partners on the Berkeley DB team at Oracle on this issue, sadly not the rpm team at Red Hat. Oracle team thinks they have nailed down the issue to be mmap handling in WSL after all: "[WSL] has a bug in mmap when the underlying file is extended, the extended part of the mapping is actually mapped back to the beginning of the file. This is why BDB would crash when it extended the size of the file that backed the in-memory cache (one of the __db.### files), and why setting the cache size to a small value works as a work around" - Dr. Lauren Foutz, Oracle See attached C code which replicates the issue: mmap_extend.c.txt I have e-mailed Craig and Ben and asked if they would like us to open another bug on WSL or to go under one of the existing mmap-related issues. In the mean time we are working on a wrap-around for rpm that will backup and restore good rpmdb components around commands that are known to break the rpmdb. |
That linked repro is more-or-less variation this post from WSL#658 ("mmap's problem") August 2016. Which is pretty much what I expected and why I've been low energy, only to end up showing Thanks for pursuing this so diligently. |
Opened an issue in WSL main per direction of Microsoft WSL team. |
Per e-mail from Microsoft WSL team a fix is in works for the underlying issue. |
Agreed, thanks to everybody involved getting this sorted out. Fix has been checked into Windows. |
Update: This is a confirmed bug in WSL, not Berkeley DB.
Reproducible on RHEL, CentOS, Oracle, and Scientific Linux.
To reproduce:
Install WLinux Enterprise. A signed build you can sideload on Windows 10 for testing is available here for debugging this bug, see here to see custom DB_CONFIG set on that image.
Set root password and create new default user.
su -
into root.Example 1: Type the following commands:
Expected result:
rpm --rebuilddb
would rebuild a working rpmdb.Actual result:
Running rpm --rebuilddb breaks rpm and yum. Modifications to BD_CONFIG improve things somewhat and there is a partial fix after the fact.
Theories:
What works:
Logs:
Thanks:
Thank you so far to @therealkenc @crramirez @Conan-Kudo @daviesalex and @pmatilai for working on this issue.
The text was updated successfully, but these errors were encountered: