Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[public-api] Initial scaffolding and draft #8683

Merged
merged 8 commits into from
Mar 26, 2022
Merged

[public-api] Initial scaffolding and draft #8683

merged 8 commits into from
Mar 26, 2022

Conversation

csweichel
Copy link
Contributor

@csweichel csweichel commented Mar 9, 2022

Description

This PR produces an initial draft of a public API for Gitpod. We will merge this PR once we're happy with a first draft design of the workspaces side of the public API.

Related Issue(s)

Fixes #8681
Part of #7900

Open Questions

We have a bunch of open questions we should explore/answer as part of this PR.

Error Handling

By default gRPC offers error codes. While the categories they introduce work well in practice, categorical errors alone (even with a message) often isn't enough.

Errors as part of the protocol definition

Failure modes and "unhappy paths" are very APIs get interesting. Callers need to know what to expect, and which error cases to handle. If we found a way how we could communicate those failure modes as part of the API spec, we'd enable future automatic tests and better documentation.

Right now the draft uses comments to describe possible errors. We could consider using custom "option"s for this use-case. Either way, we'd have to build tooling to lint and test this stuff.

Co-locating Workspace and Workspace Instance

In some places it's convenient to retrieve the workspace and last active instance at the same time, e.g. ListWorkspaces. Without this combination we're likely to see 1+n access patterns, where users ListWorkspaces and then GetLastActiveInstance for each workspace. Retrieving both at the same time can be reasonably expensive though and is not always necessary.

FieldMasks allow clients to specify what they expect in the response, akin to GraphQL. They are however a bit cumbersome to work with, and support for them in all languages is unclear.

Pagination

All List* calls currently support pagination, following a simple pagination model: in the response we return a next_page_token which can be passed back into the List* call to advance to the next page of the paginated result set.

It's unclear if we need such capabilities today - we might just as well get away with a result limit and not be able to read "further pages".

Linting

We'll have many authors working on this API in the future. Keeping a consistent style, and making sure we're not breaking things down the road is important. Buf offers such capabilities OOTB. I've added buf liniting to the code generation script. We could just as well consider making that part of our CI build.

We will want to add means to detect breaking changes (using buf) in our CI pipeline sooner than later.

Rich Errors

Instead of relying on the official error model, we could sport a richer error model. Google APIs which rely heavily on gRPC sport such rich error model. We could re-use this rich error model, which to some extend is supported by client libraries.

We will want to decide on the model early on, because introducing it later on (while maintaining backwards compatibility) would be painful.

👍 Pro:

  • gives us a way to detail the categorical errors and let's us introduce more fine-grained handling

👎 Con:
(copied from the gRPC docs)

  • Library implementations of the extended error model may not be consistent across languages in terms of requirements for and expectations of the error details payload
  • Existing proxies, loggers, and other standard HTTP request processors don’t have visibility into the error details and thus wouldn’t be able to leverage them for monitoring, metrics or other purposes
  • Additional error detail in the trailers interferes with head-of-line blocking, and will decrease HTTP/2 header compression efficiency due to more frequent cache misses
  • Larger error detail payloads may run into protocol limits (like max headers size), effectively losing the original error

How to test

Not much to test - more to discuss

Release Notes

Added first draft of a public API

@csweichel csweichel requested review from aledbf and akosyakov and removed request for aledbf March 9, 2022 16:06
@easyCZ
Copy link
Member

easyCZ commented Mar 10, 2022

Looking pretty good! Below are couple of ideas (and reasoning for them)

Domain model

I would recommend using the kubernetes resource style for structure the API - having a first class spec - this is what the user/caller specifies and configures and a status which the system uses to report information back. Here's an example

message Workspace {
   // House-keeping information, like first class ownership, project it belongs to etc..
   string id = 1;
   string owner_id = 1;
   string project_id = 2;
   Timestamp created = 1;
  
    // spec defines the requested specification of the Workspace - it is only modifiable by the caller (user/system) and epxresses the desired state of the resource
    WorkspaceSpec spec = 2;

	// status is only modifiable/settable by the system to communicate the currently observed (not necessarily the most up to date - but eventually consistent) status of the system
    WorkspaceStatus status = 3;
}

message WorkspaceSpec {
    // context_url reports the original context URL the workspace was created from
    string context_url = 4;

    // description is a human readable description of the workspace
    string description = 5;

    // Only start if set to true
    bool start = 6;

   // when set, expresses preference for regional affinity etc
   string region = 6;

   // other
   ...
}

message WorkspaceStatus {
    // The workspace instance should be what the system "assigns" to the workspace, user shouldn't set it so goes in the status
    WorkspaceInstance instance = 1;

    // Encodes the currently observed state of the Workspace - NEW, QUEUED, IMAGE_BUILDING etc...
    WorkspaceState = 2;

   ...
}

Following this model does a couple of things:

  1. Makes it very explicit what the user can/should set
  2. Delineates responsibilities for resource reconciliation in the system
  3. The specification is very declarative, with the system doing magic to get to that state
  4. Enables you to push configurability to the spec - the system is then responsible for making it happen - for example spec specifies start = false so this only creates a record but doens't actually provision resources.
  5. Allows you to isolate the service from the rest of the system - largely it has most of the information it needs to operate in it's own domain model.
  6. Explicitly communicates to the user that this resource is long running and that the resources progresses through states as it runs.

This approach also addresses the errors question. The error is always encoded in the status with details for both humans and computers.

RPCs

I would encourage to trim the RPC surface a bit more. Concretely:

service WorkspacesService {
	// Basic CRUD operations on the resource spec - this is quite easy to grasp for most people familiar with REST

	// ListWorkspaces lists workspaces. You can filter for currently running ones by expressing constraints/filters in the ListWorkspacesRequest
    rpc ListWorkspaces(ListWorkspacesRequest) returns ()
    rpc CreateWorkspace() returns ()

    // Some updates may not be possible - like moving a region within a given Workspace ID (at least for now), in that case this would be rejected with an error
    rpc UpdateWorkspace() returns ()

   // Delete should be a soft delete - mark it as "to-Delete" and the system begins termination, using status to update the consumer
    rpc DeleteWorkspace() returns ()

	// Subscribe
	rpc SubscribeToStatusUpdates(SubscribeReq) returns (WorkspaceStatus)

    rpc GetLogs(..) returns (...)
	rpc StreamLogs(...) returns (...)
}

Keeping the RPC surface trim eases integration - I spend less time figuring out which method to use (for example Create vs Start) and makes it easier to transactionise it.

Happy to go into more details or discuss this further. I still have limited context of the whole system so definitely looking for context and details.

@csweichel
Copy link
Contributor Author

Thank you for the feedback - much appreciated :)

I had a read and tried to map this back.

  • spec vs status: for workspaces there isn't much of a spec that users could modify. It's all derived from the context_url. There is the workspace config which comes closest to the spec, and which can make sense to modify, but I've deliberately left that out of the API for now (it's not needed in 90% of use-cases).
  • for workspace instances that's a slightly different story. Here a spec makes sense at start time, but modifying it at runtime doesn't. That spec would contain things like the IDE images, environment variables, or the region to start in. We could well wrap that in a Spec message.

Re CRUD/RPC:

  • I'm with you, we'll want to model things closely to known nomenclature (basically follow the principles outlined in the DD, which are CRUD). We have interesting edge cases though: StartWorkspace for example. In CRUD terms it's CreateWorkspaceInstance and then some, but with a host of constraints. StartWorkspace is much more descriptive of the operation.

  • I'm a proponent of following the rpc Foo (FooRequest) returns (FooResponse) {} convention. If that FooResponse just contains a single field that's fine. Benefits: those FooResponse messages can grow down the line; this way we can do that and remain backwards compatible. Also, it is very regular, i.e. as a user of the API I always know how the request/response types are called. Buf nicely enforces this convention with its linter.

  • we should find a naming convention for listening/subscribing to events and logs. Options are:

    1. ListenTo... which is the naming convention used in the old API
    2. SubscribeTo... which is the naming convention used in ws-manager
    3. Get... which would follow a more CRUD like convention, but fails to convey the streaming response nature of the call.

    I have no strong preference for any of those names, so long as we pick one and use it consistently.

@akosyakov
Copy link
Member

I think it looks good for first attempt. We will need APIs around user IDE settings management, but can be added later.

@csweichel
Copy link
Contributor Author

After a discussion/conversation with @easyCZ I've added structured errors using google.rpc.Status to the calls, and added linting rules which enforce those for all response messages.

@easyCZ
Copy link
Member

easyCZ commented Mar 11, 2022

Regarding the structured errors, my thinking has been to use a custom status represetnation rather than the google RPC. Let me illustrate in an example.
I would expect that an RPC returns a standard grpc error (possibly mappable to HTTP status codes) when the underlying request cannot be serviced, for example due to:

  • Permissions
  • Resource not existing
  • Rate limits
  • Invalid Arguments

However, for requests which are valid, for example for something like Resource with ID: 123 I would expect the following response:

message ResourceSpec { 
	...
	bool delete = 10;
 }

message ResourceStatus {
  // state is a machine readable high level state of the resource - last observed state by the controller
  string state = 1; // NEW, QUEUED, RUNNING, DELETING, COMPLETE, ERROR
  string details = 2;  // human redable details of each status - when state == ERROR, this can provide hints what to do to fix it
  ...
}

message Response {
  ResourceSpec spec = 1;
  ResourceStatus status = 2;
}

All valid requests return this as the underlying Resource - that is Get, Create, Delete, Update and List returns a repeated of this.
So now for an example:
GetResource(123) -> Returns OK, even when the Resource is in an ErrorState
DeleteResource(999) -> Returns NotFound
DeleteResource(123) - > Returns OK, { spec: { delete = true, ...}, status: { state: DELETING }

So instead of using the default google.rpc.Status, we define whichever states make sense for our particular resource. Just like k8s Pods define their statuses.
Of course, we can use oneof definitions to better encapsulate an ErrorState with extra messaging etc.
I think using google.rpc.Status constraints us too much with respect to encoding the relevant domain of gitpod into the API.

idempotency_token: "<random_string>",
context_url: "https://github.com/gitpod-io/gitpod",
prebuild: {
if_available: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be the default behavior.

### Prebuild aware with log support
```go
contextURL := "https://github.com/gitpod-io/gitpod"
prb := prebuilds.GetRunningPrebuild({ context_url: contextURL })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe starting prebuilds and starting workspaces have so much in common, that we should rather generalize over them and make 'prebuild' a property on the workspaces API rather than duplicating everything in a prebuilds API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for listening to logs and status updates, we could use a prebuild's workspace instance. Also there might be an image build ongoing.

components/public-api/README.md Outdated Show resolved Hide resolved
bool if_available = 3;
// uses a particular prebuild - the prebuild ID is scoped by the context URL.
// Use the PrebuildService to get ahold of the prebuild_id.
string prebuild_id = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would I expicitly pass a prebuild_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's createWorkspace call is rather complicated because it mixes prebuild selection and workspace creation to provide a form of "transactional behaviour". By making prebuilds an explicit conern this becomes a lot easier, and we can factor prebuild selection out of workspace creation.

Copy link
Member

@svenefftinge svenefftinge Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I could create a workspace with a context URL and prebuild id that don't belong to each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original design the prebuild_id would be matched against the context URL (basically scoped by the context URL). That feels very complicated.

In the most recent draft the "source" of the workspace is a oneof context_url or prebuild_id.

Copy link
Member

@easyCZ easyCZ Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Workspace APIs should NOT accept a prebuild ID. Here's why:

The API shouldn't necessarily encode how things work today, but much more the experience we want a user to have. We can use the "time to first 200" as a metric for the API UX.

With this in mind, we should eliminate ambiguity and concepts from the API calls which are not needed, or which the system can compute/discover for itself. In this case, I believe the system should be able to lookup if a prebuild for a context exists, and use it when it does - without explicitly defining the prebuild ID. This also means that the user doesn't need to learn the details of prebuilds, to receive their benefits - great ux is invisible.

As a user, it's easier for me to think of workspaces in terms of contexts, and not prebuilds. This means that it will be easier for me to only supply the context and not the prebuild id when I'm integrating. This in turn reflects badly on us because the user is not getting the benefits of prebuilds.

The added benefit is that if we discover a better way to optimize the experience that would replace prebuilds, then the API does not need to change - we're optimizing the workspace startup internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with the sentiment: focus on API UX. In this particular case, I think there are good reasons for making CreateAndStartWorkspace accept a prebuild_id:

  • we want to make prebuilds a first-class citizen within the API, hence making the use of prebuilds explicit in the API is desirable. The API itself forms a language which describes the concepts it operates on. In this case it makes it obvious that workspaces can be created from context URLs or prebuilds. And yes, prebuilds are created from context URLs, too :)
  • hiding the prebuilds away for this call makes the operation rather complicated: the API becomes "modal" like it is today where the result of an operation is not always a created/started workspace (or error), but there are different operating modes which in turn requires different responses.
  • we want to make sure users actually get the prebuild they just watched the logs for. If we just made the prebuild "implicit", we'd need to introduce "modal responses" (workspace created vs prebuild running) where the latter needs to provide info on how to listen to the logs (short of providing a prebuild ID :D ); and we'd need to maintain a "transaction" where we tie the prebuild log listener to the workspace start. Making those two concerns (log listening and prebuild use) explicit is easier.

components/public-api/gitpod/v1/workspaces.proto Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 19, 2022

Codecov Report

Merging #8683 (f8edb19) into main (3ef1e86) will decrease coverage by 1.13%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #8683      +/-   ##
==========================================
- Coverage   12.31%   11.17%   -1.14%     
==========================================
  Files          20       18       -2     
  Lines        1161      993     -168     
==========================================
- Hits          143      111      -32     
+ Misses       1014      880     -134     
+ Partials        4        2       -2     
Flag Coverage Δ
components-gitpod-cli-app 11.17% <ø> (ø)
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@geropl
Copy link
Member

geropl commented Mar 23, 2022

wrt to "encoding errors in the type system": Did we explore encoding error messages as return types? Like, treating protobuf as if it would support algebraic data types? as an example GetWorkspaceResponse could become:

message GetWorkspaceResponse {
    oneof response {
        Workspace result = 1;
        NotFoundError not_found = 2;
        PermissionError permission_error = 3;
    }
}

message NotFoundError {
    string resource_type = 1;
    string id = 2;
}

message PermissionError {
    string resource_type = 1;
    string id = 2;
    string message = 3;
}

On client side we would always have to explicitly check all errors before accessing the "actual" response. And that would have to also be checked to be present for forward compatibility: in case we added an error later that means old definitions have all fields empty. IMO that nicely encodes the "unknown error" case and establishes good client-side code patterns. 🙃
example

resp, err := client.GetWorkspace(...)
if err != nil {
  // connection error case
}

if !resp.has_result() {
  if resp.has_not_found() {
    // handle domain error
  } else if resp.has_permission_error() {
    // handle domain error
  } else {
    // unknown domain error (forward caompatibility)
  }
}

// use resp.

What I like as well is that we clearly separate "domain errors" ("the id you requests does not return a result") from the generic protobuf errors which are more geared towards service availability / a lower abstraction layer.

}

message CreateAndStartWorkspaceRequest {
string idempotency_token = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ THE single most important change that bring immediate benefit. Maybe I'm exaggerating... but maybe not. 😛


oneof source {
string context_url = 2;
string prebuild_id = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this feels sligthly odd... 🤔 cannot pinpoint, though. Have to sleep over it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so?
As we now expose the prebuild ids on the dashboard, it would be nice to actually make use of them instead of going though all of the indirections.

message WorkspaceContext {
// Explicit Git context
message Git {
string normalized_context_url = 1;
Copy link
Member

@AlexTugarev AlexTugarev Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the normalized context url?
let's use clone_url instead.
I just found out, that some providers might use different url patterns in comparison to the context url, so you cannot just compute the clone url from it, instead it's some result from external api services which we need to store.

@csweichel csweichel marked this pull request as ready for review March 25, 2022 14:32
@csweichel csweichel requested a review from aledbf as a code owner March 25, 2022 14:32
Copy link
Member

@easyCZ easyCZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on todays discussion between @csweichel, @geropl and myself, I'm happy to land this as is. The context of the discussion is below.

We'll land the PR as is and will iterate. The APIs and structure is likely going to change somewhat by the time this makes it an actual public API.

On the topic of Errors, once this lands we'll put up 2 draft PRs and compare the options, we'll choose based on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define gRPC interface files for a workspace API
9 participants