Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dive into fixing rpm-md #1127

Open
cgwalters opened this issue Dec 2, 2017 · 26 comments
Open

Dive into fixing rpm-md #1127

cgwalters opened this issue Dec 2, 2017 · 26 comments

Comments

@cgwalters
Copy link
Member

cgwalters commented Dec 2, 2017

This is and has for a long time been a big problem:

$ du -shc /var/cache/rpm-ostree/repomd/fedora/repodata/48986ce4583cd09825c6d437150314446f0f49fa1a1bd62dcfa1085295030fe9-primary.xml.gz
15M	/var/cache/rpm-ostree/repomd/fedora/repodata/48986ce4583cd09825c6d437150314446f0f49fa1a1bd62dcfa1085295030fe9-primary.xml.gz
15M	total
$ zcat /var/cache/rpm-ostree/repomd/fedora/repodata/48986ce4583cd09825c6d437150314446f0f49fa1a1bd62dcfa1085295030fe9-primary.xml.gz > /tmp/primary.xml
$ du -shc /tmp/primary.xml 
130M	/tmp/primary.xml
130M	total
$ 

It's insane. And the problem is that with rojig ♲📦 even for Atomic Host/Silverblue users who don't use package layering, we'll go back to "download 30+ MB of repodata, uncompress to 100+MB" just to check whether there are any updates at all.

The plus side is us working on this will also benefit containers and traditional, if we can do it right.

One idea I had is to "presolve" - a lot of this data is completely redundant dependencies. Take this chunk from the very first package I looked at, 0ad:

    <rpm:requires>
      <rpm:entry name="libstdc++.so.6()(64bit)"/>
      <rpm:entry name="libstdc++.so.6(CXXABI_1.3)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(CXXABI_1.3.5)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(CXXABI_1.3.8)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(CXXABI_1.3.9)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.11)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.14)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.15)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.18)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.19)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.20)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.21)(64bit)"/>
      <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.9)(64bit)"/>
   ...

But those are all provides of the libstdc++ package - and I don't think we're ever going to have different symbol versions provided by separate packages.

So doing a pass where we just drop redundant requires would probably make a notable difference.

Taking that idea slightly farther, do "presolve" - 0ad has a requires on /bin/sh, but that's already going to be a dependency of one of its library dependencies I bet.

A user-visible result of this would be that yum info would no longer show all of the dependencies, but...eh.

Of course another thing to do is simply to split up the repos. For Atomic Host in particular we can omit all of the desktop apps, all of the -devel packages etc.

@dustymabe
Copy link
Member

I still don't see why we don't use something like git (or even ostree) to just download just the changes to the metadata (i.e. the diff) rather than downloading the entire thing every time.

@cgwalters
Copy link
Member Author

Changing the format is a larger topic; I wouldn't say it's out of scope here, but it feels like we need to do both? Part of the inefficiency here is feeding 100MB of data to libsolv, complete with redundant dependencies.

AIUI @james-antill did some analysis versus Debian and he concluded that the "file dependencies" were a major part of the wire size. And yes holy cow, I just looked at a filelists.xml. I think my vote there would be to only do file entries for "entrypoints" like /usr/bin - there's really no sane scenario where an RPM package should Require: /usr/share/doc/GeographicLib-doc/html/C/annotated.html or whatever.

Briefly though, AIUI Rust's cargo actually does use git for metadata in the "crate index". I can't find any documentation on the formats in a quick search though.

There's also a project from the dnf team on this somewhere, I forget the name.

Using libostree for this isn't crazy; we could do something like split each XML entry into its own file. But it's quite a new nontrivial format to impose on the rpm side, and would also be quite ironic to do after jigdo...

@Conan-Kudo
Copy link

@cgwalters You're referring to the deltarepo project? https://github.com/rh-lab-q/deltametadata-prototype

There are some realistic cases where people do file requires on more than /usr/bin, but my vote would be that rpm-md should be extended to cover /usr/libexec in addition to /usr/bin and /usr/lib(64) and /etc. Maybe also /usr/include, which would more or less basically mean that only /usr/share wouldn't be included.

Debian does have a file list metadata now (that's the Contents file), but it's downloaded on-demand rather than always like we do.

The main trouble, as I understand it, is that it's not easy to "reset" the sack for on-demand fetching of filelists like Yum did. Once you're over that hurdle, it's really simple, as we can control what files are fetched for the solver cache in libdnf.

The other thing is that people do want a way to access all of the rpm-md repodata, including the changelogs, but we don't yet offer that capability. It's one of the pain points mentioned in the devel mailing list a while back when @ignatenkobrain solicited feedback about dropping Yum.

@ignatenkobrain
Copy link

@cgwalters
Copy link
Member Author

I quickly glanced and of course cargo prints out the relevant URL; browsing https://github.com/rust-lang/crates.io-index makes it fairly obvious how things work there.

@cgwalters
Copy link
Member Author

One thing we can do in jigdo mode is only download the primary metadata. Or at least avoid the filelists. There's also the (currently desktop specific I believe) appstream data we could skip.

@cgwalters
Copy link
Member Author

However, doing this quickly blocks on #1114 ...

@cgwalters
Copy link
Member Author

Let me briefly talk more about ostree vs git here. Where ostree is most valuable is when one uses its hardlink checkouts. Which is really most useful for OS/container filesystem trees.

One can use libostree to store arbitrary data and read it via the API, but...it's just not what its primarily designed to do. For example, libostree is also explicitly not intended to be a backup system.

Now obviously as I mentioned above Rust's cargo is using git to store metadata. I think that's generally a great idea, however, there are some details such as the fact that OSTree uses SHA-256 and not SHA-1. That's obviously going to be addressed in git...someday.

The other major thing about git is that when using it as we'd want to do in libdnf, I think we'd really want a shared library. And libgit2 is great but it's also not ABI stable, which is somewhat problematic in the context of a base operating system updater. libostree is fully API/ABI stable.

But my overall feeling here is that for metadata, git is probably a better choice.

@Conan-Kudo
Copy link

Conan-Kudo commented Jan 11, 2018

Is there a particular reason you'd want to have repodata checkouts via git? Wouldn't something like casync or zsync help in alleviating the "download too much every time" thing?

Generally speaking, git checkouts of metadata doesn't make as much sense because the point in time part isn't very helpful when it isn't part of the rest of the tree.

@cgwalters
Copy link
Member Author

I just explicitly said above that I didn't see a use case for repodata hardlink checkouts, right? Which was an argument against libostree.

As far as improving network efficiency; yes, those would help, but so would ostree and git.

I think a bifurcation here is - are we defining a new on-disk format? I would argue to do so because...XML is not great. The way cargo uses a lot of split-up JSON seems pretty sane to me.

Anyways this isn't quite the right place to debate this, rpm-ecosystem list is.

@james-antill
Copy link
Contributor

This started when I was on holiday, but I'm not sure how much it's worth me stating the obvious things ... mainly: everything is too big, something needs to change. Seven years ago we discussed and Seth wrote down: http://yum.baseurl.org/wiki/dev/NewRepoDataIdeas ... that doesn't take into account weak deps. but it's the same general idea: Only way to win is to ship less data.

For older data there is:

...note that since those were written Fedora has gone from ~24k packages to ~55k, with the obvious increases in metadata size.

@dustymabe
Copy link
Member

Only way to win is to ship less data.

ehh. I think only downloading data the client doesn't have is a win too. how much metadata really changes in two weeks?

@Conan-Kudo
Copy link

Not much changes, really, and it's even less these days with the batched thing.

@james-antill
Copy link
Contributor

only downloading data the client doesn't have is a win too

Yeh, that could result in shipping less data to the client.
I would be cautious with assuming deltas will more easily solve the problem though, as historically drpms took a long time to get into production and are far from perfect even now (and without being clever it doesn't help some common usecases -- like running DNF inside a docker container).

how much metadata really changes in two weeks

This isn't easy to answer. Bodhi now has the graph things in https://bodhi.fedoraproject.org/releases/F27 which show updates over time, but deciding how much the metadata would change is much harder.
In theory putting the XML in a git repo. for a few weeks shows you something, but I'm not even sure the data is sorted correctly anymore ... so that might be misleading.

@james-antill
Copy link
Contributor

In theory putting the XML in a git repo. for a few weeks shows you something

So the data does seem to be usable for this, and it's easy-ish to get the daily F27 composes from Dec. 10th. In that time the primary.xml.gz has grown from 1.8M to 2.5M.

Dec. 10th data:

 primary.xml | 360786 
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 360786 insertions(+)

primary.xml.gz = 1.8M

Dec. 10th - Dec. 24th

 primary.xml | 329836 +++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 203456 insertions(+), 126380 deletions(-)

primary.xml.gz = 2.2M
diff.xml.gz = 2.2M

Dec. 24th - Jan 12th

 primary.xml | 233681 +++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 145512 insertions(+), 88169 deletions(-)

primary.xml.gz = 2.5M
diff.xml.gz = 1.6M

Dec. 10th to Jan 12th

 primary.xml | 416497 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 275458 insertions(+), 141039 deletions(-)

diff.xml.gz = 2.8M

@cgwalters
Copy link
Member Author

@james-antill
Copy link
Contributor

@cjao
Copy link

cjao commented Aug 20, 2018

One idea I had is to "presolve" - a lot of this data is completely redundant dependencies...But those are all provides of the libstdc++ package

This is essentially what debian does. Each packaged library (like libc) includes a "symbols" file that tracks which version of the package provides which symbol, and the (many-to-one) mapping from library symbols to library package names/versions is performed when the package is built. Thus instead of declaring dependencies on many GLIBC_XYZ ABIs a package would simply list in its metadata the appropriate minimum version of libc6.

@Conan-Kudo
Copy link

Debian is only able to do that because of a strict enforcing of package names, though. In the RPM ecosystem, we use common auto-generated Provides for matching instead. This adds a bit more data, but allows packages to be built in a portable manner (and yes, such packages do exist).

@james-antill
Copy link
Contributor

This adds a bit more data

That is an interesting definition of "a bit". Actual data from a blank docker container, metadata needed to install "tree":

Debian Stretch: ~7.8 MB
Fedora 28: ~84 MB

but allows packages to be built in a portable manner

Cool story.

@Conan-Kudo
Copy link

@james-antill You don't have to be a jerk about this. It may surprise you, but there are plenty of people who do rely on that capability.

@mattdm
Copy link

mattdm commented Aug 22, 2018

Yes please, let's keep the snark to friendly levels.

I think the point is: even if people rely on it, the cost is real and quite high. I'd really suggest working towards alternate solutions for the plugin cases you've highlighted. This will actually come more to the surface when we get yum-style "lazy download" for filesystem metadata in DNF, since the cost will be paid when any package with non-primary filename deps comes up, rather than by everyone up front.

@james-antill
Copy link
Contributor

@Conan-Kudo This may sound nasty, but I've tried to stay with non-emotional factual language.

You are arguing that this is a significant feature that people are relying on, but:

  • I don't know of any people within any rpm distribution that thinks it's an achievable goal to have an rpm compile on distro. X and run on distro. Y. Even having portable specfiles is controversial.

  • All third parties that I can think of ship different rpms for different distros. (Eg. postgresql.org)

  • Most of the major distributions have glaring differences that would make this near impossible (Eg. Fedora doesn't have /bin but OpenSuSE does).

  • Fedora and EL are very close distributions, even having the same developers working on the same core parts, but EPEL is explicitly a different repo. and a non-trivial number of Fedora maintainers turn off distributing packages for it.

A lot of the above is "IMO" and I don't know everything, so maybe somebody does produce rpms that: install in a self contained space (like /opt) and thus. has minimal well defined dependencies AND uses rpm requires/provides in a way that breaks @cgwalters idea of presolving.
But even assuming that's true, I would bet a significant amount of money that the people who have the power to decide such things would take any feature that approximates "Fedora metadata is now 8MB but we broke portable rpms" and I don't think it would be a close vote.

@Conan-Kudo
Copy link

Conan-Kudo commented Aug 22, 2018

I don't know of any people within any rpm distribution that thinks it's an achievable goal to have an rpm compile on distro. X and run on distro. Y. Even having portable specfiles is controversial.

Weirdly enough, portable specs are only controversial in Fedora, which is ironic because they are somewhat common there. In openSUSE, portable specs do exist and are even more common because unlike Fedora, openSUSE has had cross-distro building capability for much, much, much longer.

All third parties that I can thin of ship different rpms for different distros. (Eg. postgresql.org)

I'm disappointed in how weak your searching is. It's pretty easy to find examples.

Some "quick" examples: Chrome, Skype, Slack, Hipchat, Opera, VMware Workstation, Visual Studio Code, Atom, any app package made by Electron Builder, any app package produced by CPack, etc. There are even more things that I know of that build packages in a similar manner. PostgreSQL is actually an exception, and that's mostly because they still deal with the older distros where the differences are still more prominent. I imagine that server software is more likely to have more than one RPM variant while desktop software rarely does.

Most of the major distributions have glaring differences that would make this near impossible (Eg. Fedora doesn't have /bin but OpenSuSE does).

That is patently false. Even your example is wrong. I'm actively involved in Fedora, Mageia, and openSUSE, as well as several other RPM based distributions. It's definitely possible to make portable packages, and it's much easier now than it ever was before.

Fedora and EL are very close distributions, even having the same developers working on the same core parts, but EPEL is explicitly a different repo. and a non-trivial number of Fedora maintainers turn off distributing packages for it.

The reason why people don't build for EPEL is because our build system actually makes it more painful than it should be. I fully acknowledge that we will never fix this for reasons that are not worth discussing here.

A lot of the above is "IMO" and I don't know everything, so maybe somebody does produce rpms that: install in a self contained space (like /opt) and thus. has minimal well defined dependencies AND uses rpm requires/provides in a way that breaks @cgwalters idea of presolving.
But even assuming that's true, I would bet a significant amount of money that the people who have the power to decide such things would take any feature that approximates "Fedora metadata is now 8MB but we broke portable rpms" and I don't think it would be a close vote.

I am willing to bet that you don't come out of your cave very often, because you seem to have ideas based on ~5-7 year old data.

Do you know why the Debian model is successful for Debian/Ubuntu? It's because there's virtually no variance in how Debian packages are constructed. Their policies are very strict on how packages are built and structured, which means that third-parties can rely on it. They can even predict the names of packages to some extent because of it. And Colin knows this very well, since he was a Debian Developer and created a package build mechanism for Debian. The downside of this model is that it's literally impossible to structure packages differently without breaking expectations.

The RPM model promotes the ability to support a variety of distro structures, the Debian model does not.

@cgwalters
Copy link
Member Author

I understand your concerns but I think they'd be better addressed by trying to standardize specific metadata keys via Provides. (Intersecting this whole thread of course is the movement of applications to containers but let's ignore that for now)

It's worth gathering data on this. Looking at Chrome:

$ rpm -qp --requires ~/Downloads/google-chrome-stable_current_x86_64.rpm  |egrep '^/'
warning: /home/walters/Downloads/google-chrome-stable_current_x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 7fac5991: NOKEY
/usr/bin/lsb_release
/usr/sbin/update-alternatives
/usr/sbin/update-alternatives
/bin/sh
/bin/sh
/bin/sh
/bin/sh

So it totally complies with the "only file requires on /bin-type-paths" policy.

@cjao
Copy link

cjao commented Jun 14, 2020

With Vagrant's default memory setting of 512mb, dnf makecache consistently runs out of memory on a fresh Fedora 32 cloud image. Memory usage balloons so much during processing of the downloaded metadata that oom-killer steps in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants