-
-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: ensure
-style options in NixOS modules
#206467
Comments
Maybe this is best submitted to the https://github.com/NixOS/rfcs repository? |
I do not plan to write an RFC right now on the subject, it's an RFC as in "request for comments" from the whole community, if it turns out, we actually want to write a proper RFC, we can do it later. :) |
You can write a silly, uninformed, non-community-consensus driven RFC all you want. I have done so three times :) |
|
I agree that on the user-facing side of NixOS we should try to avoid convergent options where possible. |
I personally find myself wanting that extra small bit from the NixOS modules especially around managing the state (ansible and terraform feel way to heavy of a dependency especially since I already have NixOS). I get that trying to declaratively manage something that stores state is not perfect but as others have mentioned it is already done throughout nix. As a developer I often need to link |
Whether we want it or not, having some sort of state is inevitable. We still want to push as much of the configuration to be congruent in the /nix/store, but it would be nice if we had tighter control over that state. On my machine, |
Similar to the |
Wow, that's horribly broken. We cannot base the decision on whether we should (not) have What's the alternative if we don't have |
Putting in the effort and making the module implementations more stateful, just like Nix isn't magic. Stateless or declarative doesn't mean no state. What Nix does is it reduces the deployed software variables (from traditional package management entropy) into a single variable: a profile. It does not magically reduce the other variables; that would be called "catastrophic data loss". If NixOS only manages the profile, we've done a shit job. And that's ok. This is OSS, and something is better than nothing, but damn. We take away control over individual services by making the whole system software into a single variable, but then we don't help out with the actual setup and migrations? How's that supposed to be any better? Now it's not all bad. We do have a good example: |
I totally agree with @roberth and I would go further: in my personal opinion, this is an open area of research to get the things right and Nix is uniquely positioned to experiment something new, in regard to this (like designing ways to compose small primitives to converge such state). Of course, it's a matter of time and a strike balancing between "this option is hard, requires extra carefulness and scrutiny and tests" and "this option is easy and can be added in a harmless way" and that's why I am a bit picky in which to accept and not because also the data loss is really annoying. |
It does and it is. 2 points:
|
Thinking about it, not even Maybe it's time to include the update process in the "what is", that @grahamc mentioned? What about if we approach the problem as a generalization of the bootloader problem, and have nixos-rebuild declare a sort of "letter of intent" of pending stateful actions, each one monitored to hell and back as well as preferrably undoable? One huge benefit would be that this would allow to check for more errors that wouldn't show up during eval, before actually eating them at runtime, see https://matrix.to/#/!aGqRytqbCECitOFhbt:nixos.org/$-Luxmact8b2jLtsAnJxBIZM9FDi2C7Gmx0ypVPLnF2E?via=matrix.org&via=lpc.events |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/39 |
…sql15 Closes NixOS#216989 First of all, a bit of context: in PostgreSQL, newly created users don't have the CREATE privilege on the public schema of a database even with `ALL PRIVILEGES` granted via `ensurePermissions` which is how most of the DB users are currently set up "declaratively"[1]. This means e.g. a freshly deployed Nextcloud service will break early because Nextcloud itself cannot CREATE any tables in the public schema anymore. The other issue here is that `ensurePermissions` is a mere hack. It's effectively a mixture of SQL code (e.g. `DATABASE foo` is relying on how a value is substituted in a query. You'd have to parse a subset of SQL to actually know which object are permissions granted to for a user). After analyzing the existing modules I realized that in every case with a single exception[2] the UNIX system user is equal to the db user is equal to the db name and I don't see a compelling reason why people would change that in 99% of the cases. In fact, some modules would even break if you'd change that because the declarations of the system user & the db user are mixed up[3]. So I decided to go with something new which restricts the ways to use `ensure*` options rather than expanding those[4]. Effectively this means that * The DB user _must_ be equal to the DB name. * Permissions are granted via `ensureDBOwnerhip` for an attribute-set in `ensureUsers`. That way, the user is actually the owner and can perform `CREATE`. * For such a postgres user, a database must be declared in `ensureDatabases`. For anything else, a custom state management should be implemented. This can either be `initialScript`, doing it manual, outside of the module or by implementing proper state management for postgresql[5], but the current state of `ensure*` isn't even declarative, but a convergent tool which is what Nix actually claims to _not_ do. Regarding existing setups: there are effectively two options: * Leave everything as-is (assuming that system user == db user == db name): then the DB user will automatically become the DB owner and everything else stays the same. * Drop the `createDatabase = true;` declarations: nothing will change because a removal of `ensure*` statements is ignored, so it doesn't matter at all whether this option is kept after the first deploy (and later on you'd usually restore from backups anyways). The DB user isn't the owner of the DB then, but for an existing setup this is irrelevant because CREATE on the public schema isn't revoked from existing users (only not granted for new users). [1] not really declarative though because removals of these statements are simply ignored for instance: NixOS#206467 [2] `services.invidious`: I removed the `ensure*` part temporarily because it IMHO falls into the category "manage the state on your own" (see the commit message). See also NixOS#265857 [3] e.g. roundcube had `"DATABASE ${cfg.database.username}" = "ALL PRIVILEGES";` [4] As opposed to other changes that are considered a potential fix, but also add more things like collation for DBs or passwords that are _never_ touched again when changing those. [5] As suggested in e.g. NixOS#206467
In my experience at work, state management is hard. That Nix has a hard time dealing with it does not mean in the slightest that Nix is not good. With that out of the way, here are a few interesting state migration examples and principles I read about and rediscovered at work:
I’m sure I’m forgetting a lot here as I’m writing from memory. But I want to convey that IME database updates are rarely free and snappy and easy. Im not sure how all these steps can happen in one deploy command. Nor how Nix can solve this problem space. The main issue IMO is Nix automatically produces a diff of old state to new state which works great to deploy new stateless binaries. But to deploy stateful stuff, we should be able to let the user override the internal automatic diffing and describe how the migration should happen, in multiple discrete steps. Also, it doesn’t help that when you update to a new nixpkgs commit, you update everything at once. It would be nice to be able to update to a new nixpkgs commit but be able to apply changes with other strategies than « all at once », like one binary at a time. Another tangential comment, which is in no way a criticism, is we should invest more in our observability tooling. Most of the big open source projects I host do not provide a good story there. We should embrace extensive tracing and metrics on top structured logging. This would help tremendously IMO. |
Related: NixOS/rfcs#155 |
Thank you for your lengthy feedback, I mostly agree with you (except in some points where I don't see a compelling case where I can agree).
Yep, it's a classical problem and why we should probably invest in standardizing
That is true, but this is a self-inflicted limitation. You don't need to pay the backup cost (though you should always pay it for disaster recovery recipes anyway, right?), you can simply reuse your filesystem or rely on application-specific backup/snapshots technologies. It's not always clear to a user, but there's a treasure trove of technologies we are usually sitting and not using, massaging filesystem snapshots to make use of them in this context is a trivial example of that. (and yes you can make the database cooperate with flushing the pages, etc. It requires work, it's not hard.) The delta lost is unfortunate but also inevitable, what would you even expose your application to your users if you didn't finish validating the deployment? And if you do so because there is a weird bug in the application, we are in the set of cases where it will be almost impossible to automate any meaningful answer to this problem and requires manual work every time, so I would say that losing a delta of your state by roll backing a half-broken application is out of scope. We can tolerate broken applications, but it is very complicated if not impossible to tolerate half broken applications. Failure mode is part of engineering, and failing hard is important. Failure to do so, well… will create disasters. Disaster require disaster recovery plans and no automation can save us of that, we can just make it easier at most.
I respectfully disagree on "rarely free, snappy and easy", in my experience, they are almost always free, snappy and easy, they rarely fail! The problem is that when they fail, it's rarely actionable because people are not used to it failing. People are not aware of everything that can go wrong. That's understandable, it mostly goes right. Solving a problem space makes little sense to me. The Nix expression language trivially decreases (as demonstrated by now) the difficulty to intertwine complex application dependencies at the meta-level in a reusable fashion, via the NixOS module system (so called an expert system sometimes). The Nix expression language can also trivially decrease the difficulty to tame the state convergence situation by providing abstractions to describe state convergence at the NixOS module level and let it be an emergent (complex) system. Of course, this is not for the faint of the hearts and will probably never be useful for A/Z replication sharding blablabla use cases for now, as we don't even have "remote systemd" (which is key to enable Kubernetes-style use cases natively with Nix), but I bet this can tremendously help for the rare cases where it fails, because those cases are usually simple and easy. Even backupping automatically before performing a state transition would be largely welcome by many of us because we have the backup storage, and we just don't have the opportunity to enable such measures.
Nix is able to diff closures (of .drvs or realized store paths). Not state. It is us who decide to give meaning to a diff of closures. For what is worth, we can totally introduce more steps to the switch-to-configuration.pl logic to let them override the automatic diffing and perform policy based deployments, I will mention this again NixOS/nixops#1245. There is a tension between NixOS being a normal operating system and NixOS embracing completely operation-style logic and offer them by default with empty policy, which makes it trivial for people like us to use them for implementing our own policies on the top of that. More advanced convergence engines could be built on the top of the existing things, there's no reason to use the NixOS default provided one, there's no reason to have a unique implementation. What we need to do though is to be able to capture the expressivity we need to understand what does it mean to perform state convergence and express database migrations or anything as simply an act of state convergence.
I don't think "one binary at a time" will ever make sense for NixOS, partial updates are physically impossible for a good reason. What you are looking for though is a way to keep old systemd units running with their old paths and swap them with their new version, one at a time. But you need to bring your own rollback policy in case of failures to roll out.
I am not sure that I understand how is it related to the matter, though. This is the problem of the software you are running and should be tracked somewhere else (even in the issue tracker of the software you are using!). OTEL and whatnot are things that are available in nixpkgs in the reasonable limits, we are not really the place to hold upstreams accountable on this. :) |
Thanks for answering and opening my eyes, see below. Btw, I didn’t copy here what I agree with.
Not disagreeing but just wanted to clarify the cost I was thinking about at the time of writing was the time it takes to make a backup which can be long. This renders deployment cumbersome to make if they happen before deploying.
I never really considered using the file system for this. You referring to snapshots like can be seen in ZFS or LVM, right? I’ve been reading about that now and I imagine we could just create a snapshot everytime we deploy. Which is a very quick operation. This makes me want to redo my whole backup strategy 😁 I wonder if we even could have a dataset (in ZFS terminology) per application we deploy which could even allow pretty seamless relocation of the app.
Makes sense.
I’m really curious how you do the kind of deploy I mentioned. Like you said, the issue is when they fail. We split each step in its separate deploy so if something goes wrong, we know without doubt what step failed. But also, reasoning about failure modes is easier if the deploy step does not have too many moving parts.
Ah yes. I meant the same. We can do whatever we want but it’s not there yet. I badly expressed myself, I didn’t mean the current behavior is set in stone.
I mostly agree. I could imagine deploying updates to Nextcloud and Home Assistant independently because they don’t relate to each other. At first, I tried to deploy my server using disnix which does what you describe pretty well. It has some shortcomings I couldn’t push through so I switched to a more classic deploy system. But I really liked disnix.
Yes I’m not sure how my last rant related to the issue at hand. I think I was thinking having proper monitoring helps to know if the new deploy behaves correctly and inform if we need to rollback or not. I’m sad so few apps I use have any introspection features. |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/what-about-state-management/37082/1 |
Context & motivation
In NixOS, a desirable property is that the current state of the system configuration is a pure function of the Nix expression evaluated.
For example, NGINX virtual hosts are directly a pure function of the Nix expressions describing them.
Another example would be that, under
users.mutableUsers = false;
, UNIX users are directly a pure a function of the Nix expressions describing them, including their attributes. (please correct me if this is wrong.)A non-example of this is
services.postgresql.ensureUsers
, it is possible to manually remove a PostgreSQL user and perform multiple rebuild switch without reviving the user in question, therefore, creating a drift between NixOS expression and the actual PostgreSQL configuration.Same thing for
hardware.ensurePrinters
too (which attempt to reconcile the expression with the reality, without removing any printer though.)This can be generalized into all kind of options that "prefill" data or state which could also be seen as "static configuration state" (e.g. what are my users, what are their permissions, etc.) but could also have been dynamic.
Recently, there have been some activity to extend
ensure
-style options to existing NixOS modules for the sake of usability and, that under "nice assumptions" (no manual removal, not too much buggy software, etc.), they would even respect the "purity" predicate given above (i.e. the configuration state is a pure function of NixOS expression), see:Arguments against the proliferation of these options
In #164235, @aanderse argued against these kinds of options because they break many assumptions people tend to have on a NixOS system and recommended using a tooling which would actually try to reconcile the state, e.g. Terraform (or NixOps I would say).
Open questions
Personally, I would argue that there are some practical advantages to limit the amount of tooling used to deploy a system and Terraform/NixOS integration is not necessarily optimal, therefore, I think this matter should be more discussed.
(1) Who is using NixOS with strong assumptions based on the fact they can derive more or less static configuration state based on NixOS expression? Is there any term to describe this property we can start using in the community and document it?
(2) Should we introduce a tainting mechanism whenever a property-breaking option is being used so that this group of users can isolate these systems? Should we do nothing and just try actively to remove these mechanisms? What is a good story for these competing needs?
(3) What is an acceptable way to perform these "reconciliation" operations à la Kubernetes/Terraform/Ansible in NixOS? Should we start work on a framework to contribute those to nixpkgs?
Another connected problem which is related but not directly is the "automatic migration" mechanism that tend to be present for NixOS module for simplicity, but creates real issues when in combination with rollback feature, e.g. upgrading Gitea, Gitea is broken, then rollbacking and Gitea revision N - 1 is not forward compatible with the new DB schema, therefore, Gitea is broken on previous revision too. I do think answering questions here would provide some insights regarding this problem too.
The text was updated successfully, but these errors were encountered: