Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expose gRPC service to handle provisioning #758

Merged
merged 22 commits into from
Jun 9, 2024

Conversation

morgsmccauley
Copy link
Collaborator

@morgsmccauley morgsmccauley commented May 29, 2024

This PR adds the DataLayerService, which allows for provisioning the "Data Layer" of a QueryAPI Indexer. Data Layer in this case refers to the Postgres Database and Hasura GraphQL endpoint. I wanted to move away from the more general "provisioning" term as it is a bit overloaded, and may get conflated with other provisioning steps related to Executors (which may become more involved with Firecracker). This PR doesn't replace the existing provisioning flow - yet, I'll do that in a follow up PR.

As provisioning can take some time, rather than keeping the request/connection open while we wait, the Provision command triggers the process, and returns immediately. The CheckProvsioningStatus command will then be used to poll its completion.

On top the new service, the following changes have been made:

  • the gRPC server/ directory has been slightly restructured to accommodate additional services, i.e. DataLayerService, as a result you'll see lots of runner files being moved around
  • Provisioner now takes ProvisioningConfig, which is a subset of IndexerConfig. On the gRPC side we don't have a full IndexerConfig, nor do we need it, so I created the subset to make it possible to provision with only the data we need. IndexerConfig extends ProvisioningConfig so it's not a breaking change to the existing use-cases.
  • I've removed all the unused code in Provisioner

@morgsmccauley morgsmccauley linked an issue May 29, 2024 that may be closed by this pull request
@morgsmccauley morgsmccauley force-pushed the feat/data-layer-management branch from e8d71b5 to 5daab08 Compare May 29, 2024 04:38
@morgsmccauley morgsmccauley marked this pull request as ready for review May 29, 2024 04:40
@morgsmccauley morgsmccauley requested a review from a team as a code owner May 29, 2024 04:40
runner/src/logger.ts Outdated Show resolved Hide resolved
@gabehamilton
Copy link
Collaborator

Where did all the provisioner.ts code move to? Provisioning the logs & metadata.

@morgsmccauley
Copy link
Collaborator Author

Where did all the provisioner.ts code move to? Provisioning the logs & metadata.

The removed code was a provision hook for existing Indexers, and is no longer needed. We still provision logs/metadata for new indexers in the main provisionUserApi() method.

Copy link
Collaborator

@darunrs darunrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are conversational. Happy to proceed with the current PR as its mainly establishing the foundation, but I'd love to get a clearer picture of where we're headed with the service with schema editing and future QueryApi migrations in mind.

Also, having multiple gRPC dispatch destinations from the same server running on one port was a surprise. Very cool!

FAILED = 3;
}

message ProvisionResponse {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my assumption is that the DataLayer service will also be the location where we intend to action schema editing. So, jsut wanted to know if that's indeed that we would end up doing? If so, perhaps it makes sense to also have some standardized DataAction field which states different things we might be doing:

  1. Publishing new indexer
  2. Actioning schema update
  3. Migrating indexer from one version to another (e.g. An update like adding logs tables)

I believe our intended behavior in Coordinator would be different based on if there is a failure in any of the above cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my intention is for this service to manage all "Data Layer" related tasks; provisioning, deprovisioning, deleting data etc.

I'm not sure I follow what this DataAction field would do? Regardless, I'm usually against adding things we don't need right now. I'd prefer to do that when we add those features, as we'll probably get the shape wrong if we do it pre-maturely.

@@ -2,6 +2,53 @@ import crypto from 'crypto';
import { type StartExecutorRequest__Output } from '../generated/runner/StartExecutorRequest';
import { LogLevel } from '../indexer-meta/log-entry';

class BaseConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! We can easily enrich the original config for different purposes.

runner/src/logger.ts Outdated Show resolved Hide resolved
string function_name = 2;
}

enum ProvisioningStatus {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this status be used for any other data layer tasks we will need to do in the future? Such as schema edits or migrating Indexers between versions. We can rename this to TaskStatus or something. Or, if this status is specific to provisioning, should we distinguish between retryable failure modes and manual intervention failure modes? Such as failing to create a PG DB vs Hasura failed to track foreign keys?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskStatus is a good idea, but I'm not quite sure if we'd re-use this for the other methods. I can rename if if/when it does get re-used :).

In regards to failure modes, yes, we should eventually have that. But I think that also can be added later. I wanted to go for a 1:1 refactor, just so I can have Coordinator control provisioning, we can build on the robustness later.

createDataLayerService(undefined, tasks).Provision(call, callback);
});

it('should return ALREADY_EXISTS if the task has already completed', (done) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of scenario is this testing? Is this Coordinator tries to start the same task on its next loop or Coordinator trying to start this task when it first connects to the DataLayer service? We are going to store the state of an indexer's data in Redis still as a source of truth right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really testing any specific scenario. It's unlikely this will happen, but provides a fail safe in case we do something wrong in Coordinator :)

Comment on lines +13 to +15
public failed: boolean;
public pending: boolean;
public completed: boolean;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there three different booleans exposed instead of one Status enum or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah true, I could just use an enum.

I started the refactor, but think the current state actually reads better, i.e. we can just do task.completed or task.pending. But with a enum we need to pass the type around and do task.status === TaskStatus.COMPLETED.

Which do you think is better? I'll merge as is but can refactor later if you prefer the other :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I guess it depends on what we end up doing with this list of statuses. If we want to add another one, it would be more difficult. Similarly, if we are passing them around, the parameter type will be a simple boolean, which is fine, but I think TaskStatus more readily describes what this boolean actually is, without needing to encode that in the variable name itself. If we intend to make it a part of another enum, that will also help. Basically, if the usage is simple, I think I also am cool with multiple values. But if we want to do more with them or use them frequently, I feel an enum makes more sense.


type ProvisioningTasks = Record<string, ProvisioningTask>;

const generateTaskId = (accountId: string, functionName: string): string => `${accountId}:${functionName}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will ensure that DataLayer cannot spin up more than one task for one indexer. I believe this should make sense even going forward when we add more tasks.

tasks: ProvisioningTasks = {}
): DataLayerHandlers {
return {
CheckProvisioningStatus (call: ServerUnaryCall<CheckProvisioningStatusRequest__Output, ProvisionResponse>, callback: sendUnaryData<ProvisionResponse>): void {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return if a task exists or if the indexer was ever provisioned? If it is the former, maybe we can rename this to CheckProvisioningTaskStatus?

The way I currently envision how this service is used is that Coordinator reads the permanent Indexer provisioning status from Redis. It then starts a provisioning task if needed. We need to handle partial provisioning failures somehow since not all provisioning errors are retryable and the service could also just restart in the middle of provisioning something successfully. Previously, Runner was doing this unilaterally by checking Postgres and Hasura. I'm wondering if that logic should still exist in DataLayer. It's obviously easier to write, but it would also be nice if we had that information stored in Redis for Coordinator to instead be the one to make that decision.

Then we need to extend this to have other tasks like I mentioned earlier. So, we should probably figure out a JSON or something else format to store in Redis which makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only returns relevant information if the task exists, so good point, I will rename to CheckProvisioningTaskStatus.

Perhaps, we should be returning be returning a random taskId, which Coordinator then uses to check the status of. That would be more clear, but there wouldn't be any reason to have multiple tasks for a given indexer so maybe not 🤔 I'll take a deeper look at this when I integrate with Coordinator.

Runner was doing this unilaterally by checking Postgres and Hasura. I'm wondering if that logic should still exist in DataLayer

Yes this still exists, if you call Provision on an Indexer which has already been provisioned, it will check Pg/Hasura and return ALREADY_EXISTS.

return;
};

provisioner.fetchUserApiProvisioningStatus(provisioningConfig).then((isProvisioned) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So seeing this, it seems like the design is to have Coordinator call provision on all indexers and have DataLayer check if it did exist, and cache the result of that as a successful run of the task. Do you think storing the status of an indexer's data in Redis makes sense over using our current function? My issue with the current function is that reading from Hasura is slow and prone to transient failures. Especially when many requests to metadata are made. I think Hasura must be doing something behind the scenes to ensure metadata state across all instances are mathcing, which slows down any calls to it. It makes sense to have Hasura be our source of truth though. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should definitely store the provisioning state in Redis, but we should still have this here as a safe guard incase there is a bug in Coordinator which makes us call this again.

Even though we are the only consumers of this API, we may still make mistakes, so we should design it so it can't be abused 😅.

@darunrs
Copy link
Collaborator

darunrs commented Jun 2, 2024

Also, sorry I took so long with the review. Really wanted to look at it during the trip but when I saw the PR content, I wanted to be thorough with it. 😅

@morgsmccauley morgsmccauley merged commit c38fe9c into main Jun 9, 2024
3 checks passed
@morgsmccauley morgsmccauley deleted the feat/data-layer-management branch June 9, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose endpoint from Runner to control provisioning
3 participants