-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dive into fixing rpm-md #1127
Comments
I still don't see why we don't use something like git (or even ostree) to just download just the changes to the metadata (i.e. the diff) rather than downloading the entire thing every time. |
Changing the format is a larger topic; I wouldn't say it's out of scope here, but it feels like we need to do both? Part of the inefficiency here is feeding 100MB of data to libsolv, complete with redundant dependencies. AIUI @james-antill did some analysis versus Debian and he concluded that the "file dependencies" were a major part of the wire size. And yes holy cow, I just looked at a Briefly though, AIUI Rust's cargo actually does use git for metadata in the "crate index". I can't find any documentation on the formats in a quick search though. There's also a project from the dnf team on this somewhere, I forget the name. Using libostree for this isn't crazy; we could do something like split each XML entry into its own file. But it's quite a new nontrivial format to impose on the rpm side, and would also be quite ironic to do after jigdo... |
@cgwalters You're referring to the deltarepo project? https://github.com/rh-lab-q/deltametadata-prototype There are some realistic cases where people do file requires on more than Debian does have a file list metadata now (that's the Contents file), but it's downloaded on-demand rather than always like we do. The main trouble, as I understand it, is that it's not easy to "reset" the sack for on-demand fetching of filelists like Yum did. Once you're over that hurdle, it's really simple, as we can control what files are fetched for the solver cache in libdnf. The other thing is that people do want a way to access all of the rpm-md repodata, including the changelogs, but we don't yet offer that capability. It's one of the pain points mentioned in the devel mailing list a while back when @ignatenkobrain solicited feedback about dropping Yum. |
I quickly glanced and of course |
One thing we can do in jigdo mode is only download the primary metadata. Or at least avoid the filelists. There's also the (currently desktop specific I believe) appstream data we could skip. |
However, doing this quickly blocks on #1114 ... |
Let me briefly talk more about ostree vs git here. Where ostree is most valuable is when one uses its hardlink checkouts. Which is really most useful for OS/container filesystem trees. One can use libostree to store arbitrary data and read it via the API, but...it's just not what its primarily designed to do. For example, libostree is also explicitly not intended to be a backup system. Now obviously as I mentioned above Rust's cargo is using git to store metadata. I think that's generally a great idea, however, there are some details such as the fact that OSTree uses SHA-256 and not SHA-1. That's obviously going to be addressed in git...someday. The other major thing about git is that when using it as we'd want to do in libdnf, I think we'd really want a shared library. And libgit2 is great but it's also not ABI stable, which is somewhat problematic in the context of a base operating system updater. libostree is fully API/ABI stable. But my overall feeling here is that for metadata, git is probably a better choice. |
Is there a particular reason you'd want to have repodata checkouts via git? Wouldn't something like Generally speaking, git checkouts of metadata doesn't make as much sense because the point in time part isn't very helpful when it isn't part of the rest of the tree. |
I just explicitly said above that I didn't see a use case for repodata hardlink checkouts, right? Which was an argument against libostree. As far as improving network efficiency; yes, those would help, but so would ostree and git. I think a bifurcation here is - are we defining a new on-disk format? I would argue to do so because...XML is not great. The way cargo uses a lot of split-up JSON seems pretty sane to me. Anyways this isn't quite the right place to debate this, rpm-ecosystem list is. |
This started when I was on holiday, but I'm not sure how much it's worth me stating the obvious things ... mainly: everything is too big, something needs to change. Seven years ago we discussed and Seth wrote down: http://yum.baseurl.org/wiki/dev/NewRepoDataIdeas ... that doesn't take into account weak deps. but it's the same general idea: Only way to win is to ship less data. For older data there is: ...note that since those were written Fedora has gone from ~24k packages to ~55k, with the obvious increases in metadata size. |
ehh. I think only downloading data the client doesn't have is a win too. how much metadata really changes in two weeks? |
Not much changes, really, and it's even less these days with the batched thing. |
Yeh, that could result in shipping less data to the client.
This isn't easy to answer. Bodhi now has the graph things in https://bodhi.fedoraproject.org/releases/F27 which show updates over time, but deciding how much the metadata would change is much harder. |
So the data does seem to be usable for this, and it's easy-ish to get the daily F27 composes from Dec. 10th. In that time the primary.xml.gz has grown from 1.8M to 2.5M. Dec. 10th data:
primary.xml.gz = 1.8M Dec. 10th - Dec. 24th
primary.xml.gz = 2.2M Dec. 24th - Jan 12th
primary.xml.gz = 2.5M Dec. 10th to Jan 12th
diff.xml.gz = 2.8M |
This is essentially what debian does. Each packaged library (like libc) includes a "symbols" file that tracks which version of the package provides which symbol, and the (many-to-one) mapping from library symbols to library package names/versions is performed when the package is built. Thus instead of declaring dependencies on many GLIBC_XYZ ABIs a package would simply list in its metadata the appropriate minimum version of libc6. |
Debian is only able to do that because of a strict enforcing of package names, though. In the RPM ecosystem, we use common auto-generated Provides for matching instead. This adds a bit more data, but allows packages to be built in a portable manner (and yes, such packages do exist). |
That is an interesting definition of "a bit". Actual data from a blank docker container, metadata needed to install "tree": Debian Stretch: ~7.8 MB
Cool story. |
@james-antill You don't have to be a jerk about this. It may surprise you, but there are plenty of people who do rely on that capability. |
Yes please, let's keep the snark to friendly levels. I think the point is: even if people rely on it, the cost is real and quite high. I'd really suggest working towards alternate solutions for the plugin cases you've highlighted. This will actually come more to the surface when we get yum-style "lazy download" for filesystem metadata in DNF, since the cost will be paid when any package with non-primary filename deps comes up, rather than by everyone up front. |
@Conan-Kudo This may sound nasty, but I've tried to stay with non-emotional factual language. You are arguing that this is a significant feature that people are relying on, but:
A lot of the above is "IMO" and I don't know everything, so maybe somebody does produce rpms that: install in a self contained space (like /opt) and thus. has minimal well defined dependencies AND uses rpm requires/provides in a way that breaks @cgwalters idea of presolving. |
Weirdly enough, portable specs are only controversial in Fedora, which is ironic because they are somewhat common there. In openSUSE, portable specs do exist and are even more common because unlike Fedora, openSUSE has had cross-distro building capability for much, much, much longer.
I'm disappointed in how weak your searching is. It's pretty easy to find examples. Some "quick" examples: Chrome, Skype, Slack, Hipchat, Opera, VMware Workstation, Visual Studio Code, Atom, any app package made by Electron Builder, any app package produced by CPack, etc. There are even more things that I know of that build packages in a similar manner. PostgreSQL is actually an exception, and that's mostly because they still deal with the older distros where the differences are still more prominent. I imagine that server software is more likely to have more than one RPM variant while desktop software rarely does.
That is patently false. Even your example is wrong. I'm actively involved in Fedora, Mageia, and openSUSE, as well as several other RPM based distributions. It's definitely possible to make portable packages, and it's much easier now than it ever was before.
The reason why people don't build for EPEL is because our build system actually makes it more painful than it should be. I fully acknowledge that we will never fix this for reasons that are not worth discussing here.
I am willing to bet that you don't come out of your cave very often, because you seem to have ideas based on ~5-7 year old data. Do you know why the Debian model is successful for Debian/Ubuntu? It's because there's virtually no variance in how Debian packages are constructed. Their policies are very strict on how packages are built and structured, which means that third-parties can rely on it. They can even predict the names of packages to some extent because of it. And Colin knows this very well, since he was a Debian Developer and created a package build mechanism for Debian. The downside of this model is that it's literally impossible to structure packages differently without breaking expectations. The RPM model promotes the ability to support a variety of distro structures, the Debian model does not. |
I understand your concerns but I think they'd be better addressed by trying to standardize specific metadata keys via It's worth gathering data on this. Looking at Chrome:
So it totally complies with the "only file requires on /bin-type-paths" policy. |
With Vagrant's default memory setting of 512mb, |
This is and has for a long time been a big problem:
It's insane. And the problem is that with rojig ♲📦 even for Atomic Host/Silverblue users who don't use package layering, we'll go back to "download 30+ MB of repodata, uncompress to 100+MB" just to check whether there are any updates at all.
The plus side is us working on this will also benefit containers and traditional, if we can do it right.
One idea I had is to "presolve" - a lot of this data is completely redundant dependencies. Take this chunk from the very first package I looked at,
0ad
:But those are all provides of the
libstdc++
package - and I don't think we're ever going to have different symbol versions provided by separate packages.So doing a pass where we just drop redundant requires would probably make a notable difference.
Taking that idea slightly farther, do "presolve" -
0ad
has a requires on/bin/sh
, but that's already going to be a dependency of one of its library dependencies I bet.A user-visible result of this would be that
yum info
would no longer show all of the dependencies, but...eh.Of course another thing to do is simply to split up the repos. For Atomic Host in particular we can omit all of the desktop apps, all of the
-devel
packages etc.The text was updated successfully, but these errors were encountered: