Settle on set of top level routes that are known to be outside the context of a silo #1383

zephraph · 2022-07-08T23:46:05Z

Problem

It's not clear if a given route is scoped to a silo or fleet level

Context

A route naming question came up in #1329 that highlighted some current confusion in Nexus' public API. The crux of the issue is that it isn't clear if a top level route is scoped to the context of a silo or the context of a fleet.

/organizations is well understood to be scoped to a silo. Based on this section in RFD 234 it's intentional that silo scoped resources don't have a silo prefix in the URL.

Identities for users, groups, and service accounts are always scoped to a particular Silo. As a result, there’s no need for Silos to appear in the API URL for Organizations, Projects, etc. This is an important isolation measure: there’s no way for users to even ask about resources in other Silos — they don’t have a way to refer to them.

So we've stated that we do want isolation, i.e. silo routes at the top level. In the operator API discussion 🔒 it was stated that we don't want a separate openapi spec for public but non-silo scoped routes. That matches with the state we're in today, but still leaves confusion about how top level resources are scoped.

For example:

GET /policy... fleet or scoped to a silo? There's silos/{silo}/policy which from a particular silo.
GET /ip-pools fleet or scoped to a silo?
GET /images scoped to the fleet

Suggested solution

I believe we should settle on a finite amount of consistent top level routes that can be specifically denoted as being outside the context of a silo. Ideally the route prefix would be an indication of their scope (and we can enhance this with tags, extra docs, etc).

Here's a starting suggestion of some top level route prefixes:

/fleet/ - Anything fleet wide
/silos/ - Anything spanning/targeting a specific silo
/rack/ - Anything isolated to a single rack

The current changes would be something like /images -> /fleet/images.

Note
As conversations below suggest, these are likely the wrong top level endpoints. As @rmustacc pointed out, it might be better to add user context scoping like /ops/. Regardless of the top level endpoints chosen, the idea still remains that anything outside the scope of a silo would be grouped in a finite set of top level endpoints.

My hope is that this will increase clarity around scoping, be more intuitive to customers, ease documentation burdens, and (maybe) even simplify permissions or permission testing.

See also

The text was updated successfully, but these errors were encountered:

davepacheco · 2022-07-08T23:57:51Z

I like the suggested approach.

I hope there won't be much isolated to a single rack.

This is kind of a side note (and definitely not specific to this issue) but "fleet" isn't really the right term here. We use that in much of Nexus today to mean "global to this instance of the control plane". (This is my fault.) That's different than what RFD 24 calls a Fleet. This is more likely either a Region or an Availability Zone. I would love if we can punt on resolving the question of whether it's "Region" or "Availability Zone", since it's a complicated question and in the MVP those will be the same thing. This is basically unrelated to this issue except that if we put all this global stuff under "/fleet", we're baking this wrong term deeper into the product. So I just wonder if we should pick a different term here.

Terminology aside, I'm also less excited about the fact that users don't really need to know about "fleet", but with this proposal, a bunch of things that end users need seem like they'll be under "/fleet" (or whatever other term we pick).

zephraph · 2022-07-09T03:08:34Z

I'm definitely fine with changing the terminology to whatever would make sense. I just don't have a great idea of what that would be. It seems like there's a layer that's higher than silos that spans potentially N many racks, then the silos, then the contents of a silo, and potentially (but not necessarily) a single rack. The latter probably would only be useful to operators but I'm fine with leaving it out for the context of this conversation.

Terminology aside, I'm also less excited about the fact that users don't really need to know about "fleet", but with this proposal, a bunch of things that end users need seem like they'll be under "/fleet" (or whatever other term we pick).

I think it's work iterating the bunch of things and evaluating if they are indeed at the right level. Let's take /images for example. I really do wonder if providing a set of images that span across silos is the right granularity. Or at least, if that's the granularity customers will generally operate at. It feels more likely that someone who can administer the silo might setup or customize a group of images for their business group. It doesn't feel like someone who is a silo admin should be touching anything at this "/fleet" (for lack of a better term) level.

bnaecker · 2022-07-09T03:08:54Z

One question I can answer is:

GET /ip-pools fleet or scoped to a silo?

The intention is to scope these to what's currently called the fleet (AZ or Region).

The only other thing I have to add is that I agree with Dave's point: this seems to imply most users will end up with routes scoped to /fleet or /silos when they may not know or care much about those concepts. I worry it'll end up becoming just noise to most users. As an alternative, could we include something akin to a tag, which does describe the scope explicitly, but not as part of the route itself? It's documentation, but explicit and can be use programmatically.

zephraph · 2022-07-09T03:32:37Z

this seems to imply most users

This makes me wonder who most users are. I think Steve mentioned updating RFD 78 🔒, but it'd probably be best to be more specific about who we're referring to.

Tags are definitely a great idea and I think we should absolutely do that. I'm not necessarily convinced that alone is sufficient though. Part of this is just understanding the scope of your change.

This is my mental model for it (which may be wrong)

Hitting top level things can impact the whole region/az/fleet and have potential trickle down effects to every silo and therefore every org. Likewise hitting silo level things can impact multiple silos. A change isolated to a silo can only impact that silo. I could very easily see say... getting an audit log of every API request made to /fleet/ or /silos/ for instance. The cost of the extra typing in this case seems to me to be worth the clarification of the scope of the action you're taking. Having it in the documentation is great, having a tag is good, but also having it in the route reinforces the whole notion.

zephraph · 2022-07-09T03:36:33Z

Further, I do think there is a cognitive cost to having to ask yourself "what is the context in which this thing happens?" It's incredibly likely that we'll have endpoints that do things at multiple levels. For example /silos/{silo}/users and /users where one is listing the users of a specific silo and the other is listing the users of the current silo. Without the explicit knowledge that anything without particular prefixes is executed in the context of a silo you're left always having to question what context something is executed in. Again, tags will definitely help... I'm just not yet convinced they're enough.

rmustacc · 2022-07-09T16:30:51Z

I agree with the underlying problem of confusion that can arise here @zephraph and thanks for bringing this up. I think you've better articulated why I've been disliking the top-level /images or /ip-pools (as our current examples) but not been able to articulate it well. I'd like to throw out a different way of thinking about this and a way to potentially structurally approach this than trying to use the silo/fleet/etc. At the heart, I find trying to communicate the scope in this way somewhat confusing because for example, a project / VPC may span a region. We may have Subnets in VPCs, etc. Conversely, while we're using /fleet to be global, when we are multi-region, I expect an image to be scoped at most to a region and you'd have to import it probably into different regions (or have some automation there).

I do want to address a couple of different points here around what the scope of a silo is because I think that's important to understand. In particular:

I'm definitely fine with changing the terminology to whatever would make sense. I just don't have a great idea of what that would be. It seems like there's a layer that's higher than silos that spans potentially N many racks, then the silos, then the contents of a silo, and potentially (but not necessarily) a single rack. The latter probably would only be useful to operators but I'm fine with leaving it out for the context of this conversation.

So, I think it's important to tease apart how silos would fit into a multi-rack world a little bit. To me, the actual silo config and log-in scope is one of the few things that we really think of as actually fleet-wide. That is the set of silo configurations wouldn't vary based on region, AZ, or even rack. It's basically almost and kind of orthogonal to the actual scope that something operates in from a sense of region/az/etc. My main point with this is that a silo already spans the entire fleet. A specific project is limited to at most a region right now (with specifics to be tbd). But when we talk to prospects, there isn't an idea right now that the auth / user bits of this would be limited this way I think.

Regardless of that, the point about the cognitive cost of what is the scope of this is something I agree with. I believe we should have less top-level routes and instead we should think about this based on the role rather than the scope of impact. This kind of ties into the general things I've been kind of ranting about with respect to images and others. But let me get into more detail on the rationale and why as I think this is important.

If we think about different classes of users I would break it down into three rough groups:

"Developers" -- These are folks who are working within a given project generally or multiple ones that have they have been granted access to.
"SRE" / Project Admin -- This is really in my opinion a subset of (1). I call it out mostly because we often talk about it and I think it's worth thinking about the project admin here as someone who has slightly more perms than a developer, but on a given project. I realize that not all of this is realized or in scope initially as distinct, but I expect this is something that'll continue.
"Operators" -- In this case I'm referring to folks who are managing all the infrastructure and who are responsible for things like managing capacity, hardware expansion, quota allocation, etc.

Today, everything under /projects is great as a way to encapsulate everything that (1) and (2) should need to do their job and it's specific to that project. Working under a project gives you a good amount of intuition on what the role is and what it's working with. However, this is why I dislike the current use of top-level /images, /ip-pools, etc.

To me there are two different things that we're conflating in a single endpoint at the top-level today. We want something for (3) that you can create, destroy, and manipulate. However, for groups in (1) and (2) these things are read-only or worse, the set of "global" things may include things that you can't even use. The actual set of what IP Pools, images, and other features that we're going to want to use are ultimately specific to a project. I really think that for groups (1) and (2) the only way this should have visibility is under /projects.

For operators, there's a lot of things that we haven't created yet. I would actually consider instead of trying to create leave everything at the top-level, instead consider putting things under something like a /ops endpoint. For example, I would structure these things more like:

/ops/images
/opts/networking/ip-pools (This is under a networking because we have a dozen more things here)
/.../projects/.../images
/.../projects/.../networking/ip-pools

The nice thing about something like this is that it makes it very clear that you're now doing operational work and it will have a different scope. It also means that folks who are developers can just ignore an entire subtree here and it's declaring who it's for in a very obvious way. It'll be generally clear I think that anything under /ops has the broader impact that you were trying to get at @zephraph without needing to try and teach people what a fleet or silo is. This does mean that the /projects scoped images would only allow you to create custom images from volumes or perhaps be read-only because those are created a different way (e.g. by posting on a snapshot). It makes it clear which things you can use in this project and makes it clear that if you have operator level access that you can go to a /ops endpoint to import it where it has the side-effect of (today) impacting all projects and making it visible. It also simplifies the console and CLI. It means there's only one url they need to hit to get the valid list of things that can be used for this project, rather than us eventually having to figure out how to filter the global lists based on the context that you're acting in, e.g. a particular project.

One of the big reasons for this is that because we always talk about resources being owned by a project and not individuals is that it means that Nexus doesn't have enough context to figure out what's usable when hitting a global list for a given situation. It means the console and others would have to later ask the question can I use this for this project when I provision separately for images and I think that gets more expensive though I'll admit the full cost of authz isn't something I really know how to reason about.

I know that the idea of having project specific routes that end up being where you get all your images, ip pools, or other shared resources isn't one that a lot of people share, but I think in this case when combined with things like a /ops prefix (or whatever you want to call it maybe even sysops which I don't like or something to further distinguish this), will make it clearer and provide the goals that we're looking for about impact.

The other thing I like about using the role-based way of phrasing this is that we're basically going to be able to better communicate impact without changing routes around to describe if they impact an AZ, region, or the entire fleet. I think a large chunk of what the scope of an API call will be will be much clearer this way.

The one caveat here is that I haven't thought much about the general silo/user bit that we're dealing with, but anyways I'd be curious to others thoughts here and would like to talk more about it at some point. Maybe a discussion would help? Anyways, I think this is a real problem so I appreciate you bringing it up @zephraph and I think you've better articulated why I've been disliking the top-level /images or /ip-pools (as our current examples) but not been able to articulate it well.

zephraph · 2022-07-09T16:42:31Z

@rmustacc thanks for helping correct my mental model around silos.

I think the responsibility based approach of grouping things under /ops/ would work. That meets the baseline criteria of what I really want to see which is a clear top-level route (or routes) denoting anything outside the context of the user's current silo.

ryaeng · 2022-07-11T14:01:56Z

@rmustacc As an outsider looking in, your approach brings much clarity. I have no context for a fleet or silo (which I will soon read up on) but scoping images and ip-pools to projects and ops is rational and organizes these resources quite nicely.

I appreciate the thought given to developers and operators that simply provides them with the resources they need based on their current context. Well done.

davepacheco · 2022-07-11T21:21:56Z

tl;dr: Regarding the specific idea of per-project routes like $project/images and $project/ip-pools: I see the appeal of giving users a one-stop-shop for listing their options. I don't yet understand how this would work from a product perspective -- more on this below. I worry too that that convenience comes with costs: for the administrator who has to maintain that list, for the user who wants to provision something that's not in that list, and (more short-term) for us implementing a more complicated mechanism on a short timeline.

This is kind of stream-of-consciousness but I haven't figured out how to better structure this: suppose a new (external) LTS image is released for a popular distro. How do developers get access to it? In some organizations, the Ops team that Robert mentioned might validate the image and then want to make it available. Would the Ops team want to update every Project (annoying, but maybe we're saying that's not what a Project Administrator wants anyway?)? Or maybe they'd send a notification to Project Administrators to let them know that they can add it if they want. How does the Project Administrator add it? Presumably there's some flow for updating the images of a Project. Where does the list of options come from? Is it: a shared list of externally-provided images that Ops has approved (this would be what "/images" is today, I think)? a shared list of externally-provided images that anybody has imported? or maybe the only way to add an externally-provided image to a Project is to provide parameters for importing the image directly from the external source, in which case each Project would separately import the image? This does make everything a lot simpler, especially the accounting, but I'd be afraid people would bristle at the duplication of work and storage space.

Suppose a developer decides they want to use some new externally-provided image -- maybe a new distro that's not being used yet. How do they get access to provision VMs with it? Would we expect that they already have access to add new images to the Project? I would think so but I certainly have no market data on this. If not, who in the real world would they be asking to do this for them? Are we not worried this might be seen as red tape? (Even if I had the privileges, I think I'd be annoyed if I found I couldn't provision an Instance with an Image [that I had already used in some other Project] because I hadn't added it to the correct allow-list. But I can also see why security teams might want this.)

What if an administrator wants to remove an image from this list? Presumably they can't do that if there are any running Instances with that image? What if there are destroyed Instances that still reference the Image? How would someone looking at the destroyed Instance find that image's metadata? I think this is solvable but it suggests decoupling "list [all] images and their metadata" from "list allowed images".

In terms of the concrete problem we're solving: is it that developers can be confused about what to deploy and we want to guide them? If so, this could be a suggestion instead of a restriction. That has the advantage that it doesn't require somebody to curate the list if they don't want to, and it doesn't create work for a developer who wants to try a new image. Or is the problem really that someone wants to carefully control which images can be used? Who is that, in practice?

davepacheco · 2022-07-12T17:19:03Z

I think there's consensus on having a top-level thing like "/ops" or "/fleet" that would include things like:

viewing information about hardware, controlling hardware (e.g., remove things from service)
viewing and managing open problems
settings (e.g., SMTP server, NTP config)
configuration of upstream network connectivity
software upgrade (viewing what components are at what versions and managing upgrade plans)
knobs and switches around things like live migration, if any
silo configuration itself probably belongs here (list of silos and any resources for viewing or managing anything in them)
probably (?): IP pools
less clear: images

Then there are resources that are "virtualized" within the Silo:

/users
/organizations
probably any other self-service things that end users are likely to use?

Are there other categories?

karencfv · 2022-07-12T22:10:03Z

Just catching up on this issue. I'd just like to mention support bundles (or whatever they end up being called). I'm assuming they'd be added to the ops/fleet list?

rmustacc · 2022-07-12T23:22:38Z

Yes, in general most rack / core software support would probably be under that. The actual specifics will get a little murky as fault management evolves and we have things that are specific to getting information and errors about projects and how that ties into in-console documentation and data.

zephraph · 2022-07-15T18:47:22Z

Let's migrate this conversation over to https://github.com/oxidecomputer/rfd/pull/457

zephraph added nexus Related to nexus api Related to the API. labels Jul 8, 2022

zephraph mentioned this issue Jul 11, 2022

Add routes to lookup resources by id #1266

Merged

zephraph closed this as completed Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Settle on set of top level routes that are known to be outside the context of a silo #1383

Settle on set of top level routes that are known to be outside the context of a silo #1383

zephraph commented Jul 8, 2022 •

edited

Loading

davepacheco commented Jul 8, 2022

zephraph commented Jul 9, 2022

bnaecker commented Jul 9, 2022

zephraph commented Jul 9, 2022

zephraph commented Jul 9, 2022

rmustacc commented Jul 9, 2022

zephraph commented Jul 9, 2022 •

edited

Loading

ryaeng commented Jul 11, 2022

davepacheco commented Jul 11, 2022

davepacheco commented Jul 12, 2022

karencfv commented Jul 12, 2022

rmustacc commented Jul 12, 2022

zephraph commented Jul 15, 2022

Settle on set of top level routes that are known to be outside the context of a silo #1383

Settle on set of top level routes that are known to be outside the context of a silo #1383

Comments

zephraph commented Jul 8, 2022 • edited Loading

Problem

Context

Suggested solution

davepacheco commented Jul 8, 2022

zephraph commented Jul 9, 2022

bnaecker commented Jul 9, 2022

zephraph commented Jul 9, 2022

zephraph commented Jul 9, 2022

rmustacc commented Jul 9, 2022

zephraph commented Jul 9, 2022 • edited Loading

ryaeng commented Jul 11, 2022

davepacheco commented Jul 11, 2022

davepacheco commented Jul 12, 2022

karencfv commented Jul 12, 2022

rmustacc commented Jul 12, 2022

zephraph commented Jul 15, 2022

zephraph commented Jul 8, 2022 •

edited

Loading

zephraph commented Jul 9, 2022 •

edited

Loading