[public-api] Initial scaffolding and draft #8683

csweichel · 2022-03-09T11:16:11Z

Description

This PR produces an initial draft of a public API for Gitpod. We will merge this PR once we're happy with a first draft design of the workspaces side of the public API.

Related Issue(s)

Fixes #8681
Part of #7900

Open Questions

We have a bunch of open questions we should explore/answer as part of this PR.

Error Handling

By default gRPC offers error codes. While the categories they introduce work well in practice, categorical errors alone (even with a message) often isn't enough.

Errors as part of the protocol definition

Failure modes and "unhappy paths" are very APIs get interesting. Callers need to know what to expect, and which error cases to handle. If we found a way how we could communicate those failure modes as part of the API spec, we'd enable future automatic tests and better documentation.

Right now the draft uses comments to describe possible errors. We could consider using custom "option"s for this use-case. Either way, we'd have to build tooling to lint and test this stuff.

Co-locating Workspace and Workspace Instance

In some places it's convenient to retrieve the workspace and last active instance at the same time, e.g. ListWorkspaces. Without this combination we're likely to see 1+n access patterns, where users ListWorkspaces and then GetLastActiveInstance for each workspace. Retrieving both at the same time can be reasonably expensive though and is not always necessary.

FieldMasks allow clients to specify what they expect in the response, akin to GraphQL. They are however a bit cumbersome to work with, and support for them in all languages is unclear.

Pagination

All List* calls currently support pagination, following a simple pagination model: in the response we return a next_page_token which can be passed back into the List* call to advance to the next page of the paginated result set.

It's unclear if we need such capabilities today - we might just as well get away with a result limit and not be able to read "further pages".

Linting

We'll have many authors working on this API in the future. Keeping a consistent style, and making sure we're not breaking things down the road is important. Buf offers such capabilities OOTB. I've added buf liniting to the code generation script. We could just as well consider making that part of our CI build.

We will want to add means to detect breaking changes (using buf) in our CI pipeline sooner than later.

Rich Errors

Instead of relying on the official error model, we could sport a richer error model. Google APIs which rely heavily on gRPC sport such rich error model. We could re-use this rich error model, which to some extend is supported by client libraries.

We will want to decide on the model early on, because introducing it later on (while maintaining backwards compatibility) would be painful.

👍 Pro:

gives us a way to detail the categorical errors and let's us introduce more fine-grained handling

👎 Con:
(copied from the gRPC docs)

Library implementations of the extended error model may not be consistent across languages in terms of requirements for and expectations of the error details payload
Existing proxies, loggers, and other standard HTTP request processors don’t have visibility into the error details and thus wouldn’t be able to leverage them for monitoring, metrics or other purposes
Additional error detail in the trailers interferes with head-of-line blocking, and will decrease HTTP/2 header compression efficiency due to more frequent cache misses
Larger error detail payloads may run into protocol limits (like max headers size), effectively losing the original error

How to test

Not much to test - more to discuss

Release Notes

Added first draft of a public API

easyCZ · 2022-03-10T06:39:10Z

Looking pretty good! Below are couple of ideas (and reasoning for them)

Domain model

I would recommend using the kubernetes resource style for structure the API - having a first class spec - this is what the user/caller specifies and configures and a status which the system uses to report information back. Here's an example

message Workspace {
   // House-keeping information, like first class ownership, project it belongs to etc..
   string id = 1;
   string owner_id = 1;
   string project_id = 2;
   Timestamp created = 1;
  
    // spec defines the requested specification of the Workspace - it is only modifiable by the caller (user/system) and epxresses the desired state of the resource
    WorkspaceSpec spec = 2;

	// status is only modifiable/settable by the system to communicate the currently observed (not necessarily the most up to date - but eventually consistent) status of the system
    WorkspaceStatus status = 3;
}

message WorkspaceSpec {
    // context_url reports the original context URL the workspace was created from
    string context_url = 4;

    // description is a human readable description of the workspace
    string description = 5;

    // Only start if set to true
    bool start = 6;

   // when set, expresses preference for regional affinity etc
   string region = 6;

   // other
   ...
}

message WorkspaceStatus {
    // The workspace instance should be what the system "assigns" to the workspace, user shouldn't set it so goes in the status
    WorkspaceInstance instance = 1;

    // Encodes the currently observed state of the Workspace - NEW, QUEUED, IMAGE_BUILDING etc...
    WorkspaceState = 2;

   ...
}

Following this model does a couple of things:

Makes it very explicit what the user can/should set
Delineates responsibilities for resource reconciliation in the system
The specification is very declarative, with the system doing magic to get to that state
Enables you to push configurability to the spec - the system is then responsible for making it happen - for example spec specifies start = false so this only creates a record but doens't actually provision resources.
Allows you to isolate the service from the rest of the system - largely it has most of the information it needs to operate in it's own domain model.
Explicitly communicates to the user that this resource is long running and that the resources progresses through states as it runs.

This approach also addresses the errors question. The error is always encoded in the status with details for both humans and computers.

RPCs

I would encourage to trim the RPC surface a bit more. Concretely:

service WorkspacesService {
	// Basic CRUD operations on the resource spec - this is quite easy to grasp for most people familiar with REST

	// ListWorkspaces lists workspaces. You can filter for currently running ones by expressing constraints/filters in the ListWorkspacesRequest
    rpc ListWorkspaces(ListWorkspacesRequest) returns ()
    rpc CreateWorkspace() returns ()

    // Some updates may not be possible - like moving a region within a given Workspace ID (at least for now), in that case this would be rejected with an error
    rpc UpdateWorkspace() returns ()

   // Delete should be a soft delete - mark it as "to-Delete" and the system begins termination, using status to update the consumer
    rpc DeleteWorkspace() returns ()

	// Subscribe
	rpc SubscribeToStatusUpdates(SubscribeReq) returns (WorkspaceStatus)

    rpc GetLogs(..) returns (...)
	rpc StreamLogs(...) returns (...)
}

Keeping the RPC surface trim eases integration - I spend less time figuring out which method to use (for example Create vs Start) and makes it easier to transactionise it.

Happy to go into more details or discuss this further. I still have limited context of the whole system so definitely looking for context and details.

csweichel · 2022-03-10T08:59:20Z

Thank you for the feedback - much appreciated :)

I had a read and tried to map this back.

spec vs status: for workspaces there isn't much of a spec that users could modify. It's all derived from the context_url. There is the workspace config which comes closest to the spec, and which can make sense to modify, but I've deliberately left that out of the API for now (it's not needed in 90% of use-cases).
for workspace instances that's a slightly different story. Here a spec makes sense at start time, but modifying it at runtime doesn't. That spec would contain things like the IDE images, environment variables, or the region to start in. We could well wrap that in a Spec message.

Re CRUD/RPC:

I'm with you, we'll want to model things closely to known nomenclature (basically follow the principles outlined in the DD, which are CRUD). We have interesting edge cases though: StartWorkspace for example. In CRUD terms it's CreateWorkspaceInstance and then some, but with a host of constraints. StartWorkspace is much more descriptive of the operation.
I'm a proponent of following the rpc Foo (FooRequest) returns (FooResponse) {} convention. If that FooResponse just contains a single field that's fine. Benefits: those FooResponse messages can grow down the line; this way we can do that and remain backwards compatible. Also, it is very regular, i.e. as a user of the API I always know how the request/response types are called. Buf nicely enforces this convention with its linter.
we should find a naming convention for listening/subscribing to events and logs. Options are:
1. ListenTo... which is the naming convention used in the old API
2. SubscribeTo... which is the naming convention used in ws-manager
3. Get... which would follow a more CRUD like convention, but fails to convey the streaming response nature of the call.
I have no strong preference for any of those names, so long as we pick one and use it consistently.

components/public-api/gitpod/v1/workspaces.proto

akosyakov · 2022-03-11T09:55:53Z

I think it looks good for first attempt. We will need APIs around user IDE settings management, but can be added later.

csweichel · 2022-03-11T20:30:04Z

After a discussion/conversation with @easyCZ I've added structured errors using google.rpc.Status to the calls, and added linting rules which enforce those for all response messages.

easyCZ · 2022-03-11T20:57:39Z

Regarding the structured errors, my thinking has been to use a custom status represetnation rather than the google RPC. Let me illustrate in an example.
I would expect that an RPC returns a standard grpc error (possibly mappable to HTTP status codes) when the underlying request cannot be serviced, for example due to:

Permissions
Resource not existing
Rate limits
Invalid Arguments

However, for requests which are valid, for example for something like Resource with ID: 123 I would expect the following response:

message ResourceSpec { 
	...
	bool delete = 10;
 }

message ResourceStatus {
  // state is a machine readable high level state of the resource - last observed state by the controller
  string state = 1; // NEW, QUEUED, RUNNING, DELETING, COMPLETE, ERROR
  string details = 2;  // human redable details of each status - when state == ERROR, this can provide hints what to do to fix it
  ...
}

message Response {
  ResourceSpec spec = 1;
  ResourceStatus status = 2;
}

All valid requests return this as the underlying Resource - that is Get, Create, Delete, Update and List returns a repeated of this.
So now for an example:
GetResource(123) -> Returns OK, even when the Resource is in an ErrorState
DeleteResource(999) -> Returns NotFound
DeleteResource(123) - > Returns OK, { spec: { delete = true, ...}, status: { state: DELETING }

So instead of using the default google.rpc.Status, we define whichever states make sense for our particular resource. Just like k8s Pods define their statuses.
Of course, we can use oneof definitions to better encapsulate an ErrorState with extra messaging etc.
I think using google.rpc.Status constraints us too much with respect to encoding the relevant domain of gitpod into the API.

svenefftinge · 2022-03-14T12:36:26Z

components/public-api/README.md

+    idempotency_token: "<random_string>",
+    context_url: "https://github.com/gitpod-io/gitpod",
+    prebuild: {
+        if_available: true,


this should be the default behavior.

svenefftinge · 2022-03-14T12:38:42Z

components/public-api/README.md

+### Prebuild aware with log support
+```go
+contextURL := "https://github.com/gitpod-io/gitpod"
+prb := prebuilds.GetRunningPrebuild({ context_url: contextURL })


I believe starting prebuilds and starting workspaces have so much in common, that we should rather generalize over them and make 'prebuild' a property on the workspaces API rather than duplicating everything in a prebuilds API.

At least for listening to logs and status updates, we could use a prebuild's workspace instance. Also there might be an image build ongoing.

components/public-api/README.md

svenefftinge · 2022-03-14T12:46:14Z

components/public-api/gitpod/v1/workspaces.proto

+        bool if_available = 3;
+        // uses a particular prebuild - the prebuild ID is scoped by the context URL.
+        // Use the PrebuildService to get ahold of the prebuild_id.
+        string prebuild_id = 4;


Why would I expicitly pass a prebuild_id?

Today's createWorkspace call is rather complicated because it mixes prebuild selection and workspace creation to provide a form of "transactional behaviour". By making prebuilds an explicit conern this becomes a lot easier, and we can factor prebuild selection out of workspace creation.

So I could create a workspace with a context URL and prebuild id that don't belong to each other?

In the original design the prebuild_id would be matched against the context URL (basically scoped by the context URL). That feels very complicated.

In the most recent draft the "source" of the workspace is a oneof context_url or prebuild_id.

I think the Workspace APIs should NOT accept a prebuild ID. Here's why:

The API shouldn't necessarily encode how things work today, but much more the experience we want a user to have. We can use the "time to first 200" as a metric for the API UX.

With this in mind, we should eliminate ambiguity and concepts from the API calls which are not needed, or which the system can compute/discover for itself. In this case, I believe the system should be able to lookup if a prebuild for a context exists, and use it when it does - without explicitly defining the prebuild ID. This also means that the user doesn't need to learn the details of prebuilds, to receive their benefits - great ux is invisible.

As a user, it's easier for me to think of workspaces in terms of contexts, and not prebuilds. This means that it will be easier for me to only supply the context and not the prebuild id when I'm integrating. This in turn reflects badly on us because the user is not getting the benefits of prebuilds.

The added benefit is that if we discover a better way to optimize the experience that would replace prebuilds, then the API does not need to change - we're optimizing the workspace startup internally.

I totally agree with the sentiment: focus on API UX. In this particular case, I think there are good reasons for making CreateAndStartWorkspace accept a prebuild_id:

we want to make prebuilds a first-class citizen within the API, hence making the use of prebuilds explicit in the API is desirable. The API itself forms a language which describes the concepts it operates on. In this case it makes it obvious that workspaces can be created from context URLs or prebuilds. And yes, prebuilds are created from context URLs, too :)

hiding the prebuilds away for this call makes the operation rather complicated: the API becomes "modal" like it is today where the result of an operation is not always a created/started workspace (or error), but there are different operating modes which in turn requires different responses.

we want to make sure users actually get the prebuild they just watched the logs for. If we just made the prebuild "implicit", we'd need to introduce "modal responses" (workspace created vs prebuild running) where the latter needs to provide info on how to listen to the logs (short of providing a prebuild ID :D ); and we'd need to maintain a "transaction" where we tie the prebuild log listener to the workspace start. Making those two concerns (log listening and prebuild use) explicit is easier.

components/public-api/gitpod/v1/workspaces.proto

components/public-api/README.md

codecov · 2022-03-19T17:41:55Z

Codecov Report

Merging #8683 (f8edb19) into main (3ef1e86) will decrease coverage by 1.13%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #8683      +/-   ##
==========================================
- Coverage   12.31%   11.17%   -1.14%     
==========================================
  Files          20       18       -2     
  Lines        1161      993     -168     
==========================================
- Hits          143      111      -32     
+ Misses       1014      880     -134     
+ Partials        4        2       -2

Flag	Coverage Δ
components-gitpod-cli-app	`11.17% <ø> (ø)`
components-local-app-app-darwin-amd64	`?`
components-local-app-app-darwin-arm64	`?`
components-local-app-app-linux-amd64	`?`
components-local-app-app-linux-arm64	`?`
components-local-app-app-windows-386	`?`
components-local-app-app-windows-amd64	`?`
components-local-app-app-windows-arm64	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

geropl · 2022-03-23T00:16:39Z

wrt to "encoding errors in the type system": Did we explore encoding error messages as return types? Like, treating protobuf as if it would support algebraic data types? as an example GetWorkspaceResponse could become:

message GetWorkspaceResponse {
    oneof response {
        Workspace result = 1;
        NotFoundError not_found = 2;
        PermissionError permission_error = 3;
    }
}

message NotFoundError {
    string resource_type = 1;
    string id = 2;
}

message PermissionError {
    string resource_type = 1;
    string id = 2;
    string message = 3;
}

On client side we would always have to explicitly check all errors before accessing the "actual" response. And that would have to also be checked to be present for forward compatibility: in case we added an error later that means old definitions have all fields empty. IMO that nicely encodes the "unknown error" case and establishes good client-side code patterns. 🙃
example

resp, err := client.GetWorkspace(...)
if err != nil {
  // connection error case
}

if !resp.has_result() {
  if resp.has_not_found() {
    // handle domain error
  } else if resp.has_permission_error() {
    // handle domain error
  } else {
    // unknown domain error (forward caompatibility)
  }
}

// use resp.

What I like as well is that we clearly separate "domain errors" ("the id you requests does not return a result") from the generic protobuf errors which are more geared towards service availability / a lower abstraction layer.

geropl · 2022-03-23T00:19:25Z

components/public-api/gitpod/v1/workspaces.proto

+}
+
+message CreateAndStartWorkspaceRequest {
+    string idempotency_token = 1;


❤️ THE single most important change that bring immediate benefit. Maybe I'm exaggerating... but maybe not. 😛

geropl · 2022-03-23T00:20:23Z

components/public-api/gitpod/v1/workspaces.proto

+
+    oneof source {
+        string context_url = 2;
+        string prebuild_id = 3;


Hm, this feels sligthly odd... 🤔 cannot pinpoint, though. Have to sleep over it.

Why so?
As we now expose the prebuild ids on the dashboard, it would be nice to actually make use of them instead of going though all of the indirections.

AlexTugarev · 2022-03-23T10:52:28Z

components/public-api/gitpod/v1/workspaces.proto

+message WorkspaceContext {
+    // Explicit Git context
+    message Git {
+        string normalized_context_url = 1;


what is the normalized context url?
let's use clone_url instead.
I just found out, that some providers might use different url patterns in comparison to the context url, so you cannot just compute the clone url from it, instead it's some result from external api services which we need to store.

easyCZ

Based on todays discussion between @csweichel, @geropl and myself, I'm happy to land this as is. The context of the discussion is below.

We'll land the PR as is and will iterate. The APIs and structure is likely going to change somewhat by the time this makes it an actual public API.

On the topic of Errors, once this lands we'll put up 2 draft PRs and compare the options, we'll choose based on that.

[public-api] Initial scaffolding

0a5172a

roboquat added do-not-merge/work-in-progress release-note size/XXL labels Mar 9, 2022

csweichel added 2 commits March 9, 2022 15:10

[public-api] First workspaces draft

e4aee70

[public-api] Add linting

d0991f7

csweichel force-pushed the cw/public-api branch from 5bac3ab to d0991f7 Compare March 9, 2022 15:21

[public-api] Better workspace instance support

81fc69d

csweichel requested review from aledbf and akosyakov and removed request for aledbf March 9, 2022 16:06

akosyakov reviewed Mar 11, 2022

View reviewed changes

components/public-api/gitpod/v1/workspaces.proto Outdated Show resolved Hide resolved

akosyakov reviewed Mar 11, 2022

View reviewed changes

components/public-api/gitpod/v1/workspaces.proto Show resolved Hide resolved

csweichel added 3 commits March 11, 2022 13:13

[public-api] Extend workspace context

63ca2a4

[public-api] Make owner token access explicit

d955fc6

[public-api] Add and enforce application-level error status

46f5ab6

svenefftinge reviewed Mar 14, 2022

View reviewed changes

jankeromnes reviewed Mar 14, 2022

View reviewed changes

components/public-api/README.md Outdated Show resolved Hide resolved

csweichel force-pushed the cw/public-api branch from 094dde3 to d3b73d4 Compare March 14, 2022 19:02

[public-api] Draft more of the prebuild API

f8edb19

csweichel force-pushed the cw/public-api branch from d3b73d4 to f8edb19 Compare March 19, 2022 17:33

geropl reviewed Mar 23, 2022

View reviewed changes

AlexTugarev reviewed Mar 23, 2022

View reviewed changes

csweichel marked this pull request as ready for review March 25, 2022 14:32

csweichel requested a review from aledbf as a code owner March 25, 2022 14:32

easyCZ approved these changes Mar 25, 2022

View reviewed changes

geropl approved these changes Mar 25, 2022

View reviewed changes

aledbf approved these changes Mar 26, 2022

View reviewed changes

csweichel removed the do-not-merge/work-in-progress label Mar 26, 2022

roboquat merged commit de3abf2 into main Mar 26, 2022

roboquat deleted the cw/public-api branch March 26, 2022 10:01

geropl mentioned this pull request Apr 29, 2022

[public-api] Remove google.Status from proto responses #9606

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[public-api] Initial scaffolding and draft #8683

[public-api] Initial scaffolding and draft #8683

csweichel commented Mar 9, 2022 •

edited

Loading

easyCZ commented Mar 10, 2022 •

edited

Loading

csweichel commented Mar 10, 2022

akosyakov commented Mar 11, 2022

csweichel commented Mar 11, 2022

easyCZ commented Mar 11, 2022

svenefftinge Mar 14, 2022

svenefftinge Mar 14, 2022

svenefftinge Mar 14, 2022

svenefftinge Mar 14, 2022

csweichel Mar 14, 2022

svenefftinge Mar 15, 2022 •

edited

Loading

csweichel Mar 15, 2022

easyCZ Mar 15, 2022 •

edited

Loading

csweichel Mar 16, 2022

codecov bot commented Mar 19, 2022 •

edited

Loading

geropl commented Mar 23, 2022 •

edited

Loading

geropl Mar 23, 2022

geropl Mar 23, 2022

AlexTugarev Mar 23, 2022

AlexTugarev Mar 23, 2022 •

edited

Loading

easyCZ left a comment

[public-api] Initial scaffolding and draft #8683

[public-api] Initial scaffolding and draft #8683

Conversation

csweichel commented Mar 9, 2022 • edited Loading

Description

Related Issue(s)

Open Questions

Error Handling

Errors as part of the protocol definition

Co-locating Workspace and Workspace Instance

Pagination

Linting

Rich Errors

How to test

Release Notes

easyCZ commented Mar 10, 2022 • edited Loading

Domain model

RPCs

csweichel commented Mar 10, 2022

akosyakov commented Mar 11, 2022

csweichel commented Mar 11, 2022

easyCZ commented Mar 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svenefftinge Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easyCZ Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 19, 2022 • edited Loading

Codecov Report

geropl commented Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexTugarev Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

easyCZ left a comment

Choose a reason for hiding this comment

csweichel commented Mar 9, 2022 •

edited

Loading

easyCZ commented Mar 10, 2022 •

edited

Loading

svenefftinge Mar 15, 2022 •

edited

Loading

easyCZ Mar 15, 2022 •

edited

Loading

codecov bot commented Mar 19, 2022 •

edited

Loading

geropl commented Mar 23, 2022 •

edited

Loading

AlexTugarev Mar 23, 2022 •

edited

Loading