-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Settle on set of top level routes that are known to be outside the context of a silo #1383
Comments
I like the suggested approach. I hope there won't be much isolated to a single rack. This is kind of a side note (and definitely not specific to this issue) but "fleet" isn't really the right term here. We use that in much of Nexus today to mean "global to this instance of the control plane". (This is my fault.) That's different than what RFD 24 calls a Fleet. This is more likely either a Region or an Availability Zone. I would love if we can punt on resolving the question of whether it's "Region" or "Availability Zone", since it's a complicated question and in the MVP those will be the same thing. This is basically unrelated to this issue except that if we put all this global stuff under "/fleet", we're baking this wrong term deeper into the product. So I just wonder if we should pick a different term here. Terminology aside, I'm also less excited about the fact that users don't really need to know about "fleet", but with this proposal, a bunch of things that end users need seem like they'll be under "/fleet" (or whatever other term we pick). |
I'm definitely fine with changing the terminology to whatever would make sense. I just don't have a great idea of what that would be. It seems like there's a layer that's higher than silos that spans potentially N many racks, then the silos, then the contents of a silo, and potentially (but not necessarily) a single rack. The latter probably would only be useful to operators but I'm fine with leaving it out for the context of this conversation.
I think it's work iterating the bunch of things and evaluating if they are indeed at the right level. Let's take |
One question I can answer is:
The intention is to scope these to what's currently called the fleet (AZ or Region). The only other thing I have to add is that I agree with Dave's point: this seems to imply most users will end up with routes scoped to |
This makes me wonder who most users are. I think Steve mentioned updating RFD 78 🔒, but it'd probably be best to be more specific about who we're referring to. Tags are definitely a great idea and I think we should absolutely do that. I'm not necessarily convinced that alone is sufficient though. Part of this is just understanding the scope of your change. This is my mental model for it (which may be wrong) Hitting top level things can impact the whole region/az/fleet and have potential trickle down effects to every silo and therefore every org. Likewise hitting silo level things can impact multiple silos. A change isolated to a silo can only impact that silo. I could very easily see say... getting an audit log of every API request made to |
Further, I do think there is a cognitive cost to having to ask yourself "what is the context in which this thing happens?" It's incredibly likely that we'll have endpoints that do things at multiple levels. For example |
I agree with the underlying problem of confusion that can arise here @zephraph and thanks for bringing this up. I think you've better articulated why I've been disliking the top-level /images or /ip-pools (as our current examples) but not been able to articulate it well. I'd like to throw out a different way of thinking about this and a way to potentially structurally approach this than trying to use the silo/fleet/etc. At the heart, I find trying to communicate the scope in this way somewhat confusing because for example, a project / VPC may span a region. We may have Subnets in VPCs, etc. Conversely, while we're using /fleet to be global, when we are multi-region, I expect an image to be scoped at most to a region and you'd have to import it probably into different regions (or have some automation there). I do want to address a couple of different points here around what the scope of a silo is because I think that's important to understand. In particular:
So, I think it's important to tease apart how silos would fit into a multi-rack world a little bit. To me, the actual silo config and log-in scope is one of the few things that we really think of as actually fleet-wide. That is the set of silo configurations wouldn't vary based on region, AZ, or even rack. It's basically almost and kind of orthogonal to the actual scope that something operates in from a sense of region/az/etc. My main point with this is that a silo already spans the entire fleet. A specific project is limited to at most a region right now (with specifics to be tbd). But when we talk to prospects, there isn't an idea right now that the auth / user bits of this would be limited this way I think. Regardless of that, the point about the cognitive cost of what is the scope of this is something I agree with. I believe we should have less top-level routes and instead we should think about this based on the role rather than the scope of impact. This kind of ties into the general things I've been kind of ranting about with respect to images and others. But let me get into more detail on the rationale and why as I think this is important. If we think about different classes of users I would break it down into three rough groups:
Today, everything under To me there are two different things that we're conflating in a single endpoint at the top-level today. We want something for (3) that you can create, destroy, and manipulate. However, for groups in (1) and (2) these things are read-only or worse, the set of "global" things may include things that you can't even use. The actual set of what IP Pools, images, and other features that we're going to want to use are ultimately specific to a project. I really think that for groups (1) and (2) the only way this should have visibility is under For operators, there's a lot of things that we haven't created yet. I would actually consider instead of trying to create leave everything at the top-level, instead consider putting things under something like a
The nice thing about something like this is that it makes it very clear that you're now doing operational work and it will have a different scope. It also means that folks who are developers can just ignore an entire subtree here and it's declaring who it's for in a very obvious way. It'll be generally clear I think that anything under One of the big reasons for this is that because we always talk about resources being owned by a project and not individuals is that it means that Nexus doesn't have enough context to figure out what's usable when hitting a global list for a given situation. It means the console and others would have to later ask the question can I use this for this project when I provision separately for images and I think that gets more expensive though I'll admit the full cost of authz isn't something I really know how to reason about. I know that the idea of having project specific routes that end up being where you get all your images, ip pools, or other shared resources isn't one that a lot of people share, but I think in this case when combined with things like a The other thing I like about using the role-based way of phrasing this is that we're basically going to be able to better communicate impact without changing routes around to describe if they impact an AZ, region, or the entire fleet. I think a large chunk of what the scope of an API call will be will be much clearer this way. The one caveat here is that I haven't thought much about the general silo/user bit that we're dealing with, but anyways I'd be curious to others thoughts here and would like to talk more about it at some point. Maybe a discussion would help? Anyways, I think this is a real problem so I appreciate you bringing it up @zephraph and I think you've better articulated why I've been disliking the top-level /images or /ip-pools (as our current examples) but not been able to articulate it well. |
@rmustacc thanks for helping correct my mental model around silos. I think the responsibility based approach of grouping things under |
@rmustacc As an outsider looking in, your approach brings much clarity. I have no context for a fleet or silo (which I will soon read up on) but scoping images and ip-pools to projects and ops is rational and organizes these resources quite nicely. I appreciate the thought given to developers and operators that simply provides them with the resources they need based on their current context. Well done. |
tl;dr: Regarding the specific idea of per-project routes like This is kind of stream-of-consciousness but I haven't figured out how to better structure this: suppose a new (external) LTS image is released for a popular distro. How do developers get access to it? In some organizations, the Ops team that Robert mentioned might validate the image and then want to make it available. Would the Ops team want to update every Project (annoying, but maybe we're saying that's not what a Project Administrator wants anyway?)? Or maybe they'd send a notification to Project Administrators to let them know that they can add it if they want. How does the Project Administrator add it? Presumably there's some flow for updating the images of a Project. Where does the list of options come from? Is it: a shared list of externally-provided images that Ops has approved (this would be what "/images" is today, I think)? a shared list of externally-provided images that anybody has imported? or maybe the only way to add an externally-provided image to a Project is to provide parameters for importing the image directly from the external source, in which case each Project would separately import the image? This does make everything a lot simpler, especially the accounting, but I'd be afraid people would bristle at the duplication of work and storage space. Suppose a developer decides they want to use some new externally-provided image -- maybe a new distro that's not being used yet. How do they get access to provision VMs with it? Would we expect that they already have access to add new images to the Project? I would think so but I certainly have no market data on this. If not, who in the real world would they be asking to do this for them? Are we not worried this might be seen as red tape? (Even if I had the privileges, I think I'd be annoyed if I found I couldn't provision an Instance with an Image [that I had already used in some other Project] because I hadn't added it to the correct allow-list. But I can also see why security teams might want this.) What if an administrator wants to remove an image from this list? Presumably they can't do that if there are any running Instances with that image? What if there are destroyed Instances that still reference the Image? How would someone looking at the destroyed Instance find that image's metadata? I think this is solvable but it suggests decoupling "list [all] images and their metadata" from "list allowed images". In terms of the concrete problem we're solving: is it that developers can be confused about what to deploy and we want to guide them? If so, this could be a suggestion instead of a restriction. That has the advantage that it doesn't require somebody to curate the list if they don't want to, and it doesn't create work for a developer who wants to try a new image. Or is the problem really that someone wants to carefully control which images can be used? Who is that, in practice? |
I think there's consensus on having a top-level thing like "/ops" or "/fleet" that would include things like:
Then there are resources that are "virtualized" within the Silo:
Are there other categories? |
Just catching up on this issue. I'd just like to mention support bundles (or whatever they end up being called). I'm assuming they'd be added to the ops/fleet list? |
Yes, in general most rack / core software support would probably be under that. The actual specifics will get a little murky as fault management evolves and we have things that are specific to getting information and errors about projects and how that ties into in-console documentation and data. |
Let's migrate this conversation over to https://github.com/oxidecomputer/rfd/pull/457 |
Problem
It's not clear if a given route is scoped to a silo or fleet level
Context
A route naming question came up in #1329 that highlighted some current confusion in Nexus' public API. The crux of the issue is that it isn't clear if a top level route is scoped to the context of a silo or the context of a fleet.
/organizations
is well understood to be scoped to a silo. Based on this section in RFD 234 it's intentional that silo scoped resources don't have a silo prefix in the URL.So we've stated that we do want isolation, i.e. silo routes at the top level. In the operator API discussion 🔒 it was stated that we don't want a separate openapi spec for public but non-silo scoped routes. That matches with the state we're in today, but still leaves confusion about how top level resources are scoped.
For example:
/policy
... fleet or scoped to a silo? There'ssilos/{silo}/policy
which from a particular silo./ip-pools
fleet or scoped to a silo?/images
scoped to the fleetSuggested solution
I believe we should settle on a finite amount of consistent top level routes that can be specifically denoted as being outside the context of a silo. Ideally the route prefix would be an indication of their scope (and we can enhance this with tags, extra docs, etc).
Here's a starting suggestion of some top level route prefixes:
/fleet/
- Anything fleet wide/silos/
- Anything spanning/targeting a specific silo/rack/
- Anything isolated to a single rackThe current changes would be something like
/images
->/fleet/images
.My hope is that this will increase clarity around scoping, be more intuitive to customers, ease documentation burdens, and (maybe) even simplify permissions or permission testing.
See also
The text was updated successfully, but these errors were encountered: