feat: Expose gRPC service to handle provisioning #758

morgsmccauley · 2024-05-29T03:52:23Z

This PR adds the DataLayerService, which allows for provisioning the "Data Layer" of a QueryAPI Indexer. Data Layer in this case refers to the Postgres Database and Hasura GraphQL endpoint. I wanted to move away from the more general "provisioning" term as it is a bit overloaded, and may get conflated with other provisioning steps related to Executors (which may become more involved with Firecracker). This PR doesn't replace the existing provisioning flow - yet, I'll do that in a follow up PR.

As provisioning can take some time, rather than keeping the request/connection open while we wait, the Provision command triggers the process, and returns immediately. The CheckProvsioningStatus command will then be used to poll its completion.

On top the new service, the following changes have been made:

the gRPC server/ directory has been slightly restructured to accommodate additional services, i.e. DataLayerService, as a result you'll see lots of runner files being moved around
Provisioner now takes ProvisioningConfig, which is a subset of IndexerConfig. On the gRPC side we don't have a full IndexerConfig, nor do we need it, so I created the subset to make it possible to provision with only the data we need. IndexerConfig extends ProvisioningConfig so it's not a breaking change to the existing use-cases.
I've removed all the unused code in Provisioner

runner/src/logger.ts

gabehamilton · 2024-05-30T20:55:39Z

Where did all the provisioner.ts code move to? Provisioning the logs & metadata.

morgsmccauley · 2024-05-30T22:31:31Z

Where did all the provisioner.ts code move to? Provisioning the logs & metadata.

The removed code was a provision hook for existing Indexers, and is no longer needed. We still provision logs/metadata for new indexers in the main provisionUserApi() method.

darunrs

Most of my comments are conversational. Happy to proceed with the current PR as its mainly establishing the foundation, but I'd love to get a clearer picture of where we're headed with the service with schema editing and future QueryApi migrations in mind.

Also, having multiple gRPC dispatch destinations from the same server running on one port was a surprise. Very cool!

darunrs · 2024-06-02T01:39:33Z

runner/protos/data-layer.proto

+    FAILED = 3;
+}
+
+message ProvisionResponse {


So my assumption is that the DataLayer service will also be the location where we intend to action schema editing. So, jsut wanted to know if that's indeed that we would end up doing? If so, perhaps it makes sense to also have some standardized DataAction field which states different things we might be doing:

Publishing new indexer

Actioning schema update

Migrating indexer from one version to another (e.g. An update like adding logs tables)

I believe our intended behavior in Coordinator would be different based on if there is a failure in any of the above cases.

Yes, my intention is for this service to manage all "Data Layer" related tasks; provisioning, deprovisioning, deleting data etc.

I'm not sure I follow what this DataAction field would do? Regardless, I'm usually against adding things we don't need right now. I'd prefer to do that when we add those features, as we'll probably get the shape wrong if we do it pre-maturely.

darunrs · 2024-06-02T01:41:59Z

runner/src/indexer-config/indexer-config.ts

@@ -2,6 +2,53 @@ import crypto from 'crypto';
 import { type StartExecutorRequest__Output } from '../generated/runner/StartExecutorRequest';
 import { LogLevel } from '../indexer-meta/log-entry';

+class BaseConfig {


Nice idea! We can easily enrich the original config for different purposes.

runner/src/logger.ts

darunrs · 2024-06-02T02:06:32Z

runner/protos/data-layer.proto

+    string function_name = 2;
+}
+
+enum ProvisioningStatus {


Will this status be used for any other data layer tasks we will need to do in the future? Such as schema edits or migrating Indexers between versions. We can rename this to TaskStatus or something. Or, if this status is specific to provisioning, should we distinguish between retryable failure modes and manual intervention failure modes? Such as failing to create a PG DB vs Hasura failed to track foreign keys?

TaskStatus is a good idea, but I'm not quite sure if we'd re-use this for the other methods. I can rename if if/when it does get re-used :).

In regards to failure modes, yes, we should eventually have that. But I think that also can be added later. I wanted to go for a 1:1 refactor, just so I can have Coordinator control provisioning, we can build on the robustness later.

darunrs · 2024-06-02T02:10:05Z

runner/src/server/services/data-layer/data-layer-service.test.ts

+      createDataLayerService(undefined, tasks).Provision(call, callback);
+    });
+
+    it('should return ALREADY_EXISTS if the task has already completed', (done) => {


What kind of scenario is this testing? Is this Coordinator tries to start the same task on its next loop or Coordinator trying to start this task when it first connects to the DataLayer service? We are going to store the state of an indexer's data in Redis still as a source of truth right?

Not really testing any specific scenario. It's unlikely this will happen, but provides a fail safe in case we do something wrong in Coordinator :)

darunrs · 2024-06-02T02:12:18Z

runner/src/server/services/data-layer/data-layer-service.ts

+  public failed: boolean;
+  public pending: boolean;
+  public completed: boolean;


Why are there three different booleans exposed instead of one Status enum or something?

Oh yeah true, I could just use an enum.

I started the refactor, but think the current state actually reads better, i.e. we can just do task.completed or task.pending. But with a enum we need to pass the type around and do task.status === TaskStatus.COMPLETED.

Which do you think is better? I'll merge as is but can refactor later if you prefer the other :)

Gotcha. I guess it depends on what we end up doing with this list of statuses. If we want to add another one, it would be more difficult. Similarly, if we are passing them around, the parameter type will be a simple boolean, which is fine, but I think TaskStatus more readily describes what this boolean actually is, without needing to encode that in the variable name itself. If we intend to make it a part of another enum, that will also help. Basically, if the usage is simple, I think I also am cool with multiple values. But if we want to do more with them or use them frequently, I feel an enum makes more sense.

darunrs · 2024-06-02T02:14:07Z

runner/src/server/services/data-layer/data-layer-service.ts

+
+type ProvisioningTasks = Record<string, ProvisioningTask>;
+
+const generateTaskId = (accountId: string, functionName: string): string => `${accountId}:${functionName}`;


This will ensure that DataLayer cannot spin up more than one task for one indexer. I believe this should make sense even going forward when we add more tasks.

darunrs · 2024-06-02T02:24:35Z

runner/src/server/services/data-layer/data-layer-service.ts

+  tasks: ProvisioningTasks = {}
+): DataLayerHandlers {
+  return {
+    CheckProvisioningStatus (call: ServerUnaryCall<CheckProvisioningStatusRequest__Output, ProvisionResponse>, callback: sendUnaryData<ProvisionResponse>): void {


Should this return if a task exists or if the indexer was ever provisioned? If it is the former, maybe we can rename this to CheckProvisioningTaskStatus?

The way I currently envision how this service is used is that Coordinator reads the permanent Indexer provisioning status from Redis. It then starts a provisioning task if needed. We need to handle partial provisioning failures somehow since not all provisioning errors are retryable and the service could also just restart in the middle of provisioning something successfully. Previously, Runner was doing this unilaterally by checking Postgres and Hasura. I'm wondering if that logic should still exist in DataLayer. It's obviously easier to write, but it would also be nice if we had that information stored in Redis for Coordinator to instead be the one to make that decision.

Then we need to extend this to have other tasks like I mentioned earlier. So, we should probably figure out a JSON or something else format to store in Redis which makes sense.

This only returns relevant information if the task exists, so good point, I will rename to CheckProvisioningTaskStatus.

Perhaps, we should be returning be returning a random taskId, which Coordinator then uses to check the status of. That would be more clear, but there wouldn't be any reason to have multiple tasks for a given indexer so maybe not 🤔 I'll take a deeper look at this when I integrate with Coordinator.

Runner was doing this unilaterally by checking Postgres and Hasura. I'm wondering if that logic should still exist in DataLayer

Yes this still exists, if you call Provision on an Indexer which has already been provisioned, it will check Pg/Hasura and return ALREADY_EXISTS.

darunrs · 2024-06-02T02:30:25Z

runner/src/server/services/data-layer/data-layer-service.ts

+        return;
+      };
+
+      provisioner.fetchUserApiProvisioningStatus(provisioningConfig).then((isProvisioned) => {


So seeing this, it seems like the design is to have Coordinator call provision on all indexers and have DataLayer check if it did exist, and cache the result of that as a successful run of the task. Do you think storing the status of an indexer's data in Redis makes sense over using our current function? My issue with the current function is that reading from Hasura is slow and prone to transient failures. Especially when many requests to metadata are made. I think Hasura must be doing something behind the scenes to ensure metadata state across all instances are mathcing, which slows down any calls to it. It makes sense to have Hasura be our source of truth though. What do you think?

Yes we should definitely store the provisioning state in Redis, but we should still have this here as a safe guard incase there is a bug in Coordinator which makes us call this again.

Even though we are the only consumers of this API, we may still make mistakes, so we should design it so it can't be abused 😅.

darunrs · 2024-06-02T02:34:00Z

Also, sorry I took so long with the review. Really wanted to look at it during the trip but when I saw the PR content, I wanted to be thorough with it. 😅

morgsmccauley added 18 commits May 24, 2024 10:46

feat: Add initial proto for data layer provisioning

e95c349

feat: Add initial DataLayer service proto

52eb5b4

refactor: Modify grpc fs to make adding more services easier

8f26fde

refactor: Encapsulate local state within grpc server file

1b6447d

refactor: Remove deprecated .start() call in grpc server

553bfa9

feat: Stub data layer grpc service

27a81f4

feat: Provision data layer via grpc

a706091

chore: Add some notes for this PR

4695590

feat: Add watch script

7d0c600

fix: Dont use unspecified enum index in proto

99134b3

feat: Trigger provisioning tasks asynchronously

5808106

feat: Handle existing/completed provisioning tasks

1042c25

refactor: Trigger provisioning for IndexerConfig subset

2302733

refactor: Remove unused provisioning code

f0be311

fix: Prevent fall through after grpc callback

e560c49

test: DataLayerService

7afde0b

chore: Remove notes

b25685b

refactor: Move executor initialisation to service

f922195

morgsmccauley linked an issue May 29, 2024 that may be closed by this pull request

Expose endpoint from Runner to control provisioning #759

Closed

morgsmccauley added 2 commits May 29, 2024 16:33

fix: Use correct generated paths

0640cf1

test: Silence logger in test

5daab08

morgsmccauley force-pushed the feat/data-layer-management branch from e8d71b5 to 5daab08 Compare May 29, 2024 04:38

morgsmccauley marked this pull request as ready for review May 29, 2024 04:40

morgsmccauley requested a review from a team as a code owner May 29, 2024 04:40

morgsmccauley commented May 29, 2024

View reviewed changes

runner/src/logger.ts Outdated Show resolved Hide resolved

darunrs approved these changes Jun 2, 2024

View reviewed changes

refactor: Use builtin silent in logger

12012ec

refactor: Rename CheckProvioningStatus > CheckProvioningTaskStatus

6b67afd

morgsmccauley merged commit c38fe9c into main Jun 9, 2024
3 checks passed

morgsmccauley deleted the feat/data-layer-management branch June 9, 2024 21:25

morgsmccauley mentioned this pull request Jun 13, 2024

Prod Release 13/06/24 #802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Expose gRPC service to handle provisioning #758

feat: Expose gRPC service to handle provisioning #758

morgsmccauley commented May 29, 2024 •

edited

Loading

gabehamilton commented May 30, 2024

morgsmccauley commented May 30, 2024

darunrs left a comment

darunrs Jun 2, 2024

morgsmccauley Jun 9, 2024

darunrs Jun 2, 2024

darunrs Jun 2, 2024

morgsmccauley Jun 9, 2024

darunrs Jun 2, 2024

morgsmccauley Jun 9, 2024

darunrs Jun 2, 2024

morgsmccauley Jun 9, 2024

darunrs Jun 10, 2024

darunrs Jun 2, 2024

darunrs Jun 2, 2024

morgsmccauley Jun 9, 2024

darunrs Jun 2, 2024

morgsmccauley Jun 9, 2024

darunrs commented Jun 2, 2024


		type ProvisioningTasks = Record<string, ProvisioningTask>;

		const generateTaskId = (accountId: string, functionName: string): string => `${accountId}:${functionName}`;

feat: Expose gRPC service to handle provisioning #758

feat: Expose gRPC service to handle provisioning #758

Conversation

morgsmccauley commented May 29, 2024 • edited Loading

gabehamilton commented May 30, 2024

morgsmccauley commented May 30, 2024

darunrs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darunrs commented Jun 2, 2024

morgsmccauley commented May 29, 2024 •

edited

Loading