-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Instrument Runner Service #602
Conversation
try { | ||
this.database_connection_parameters = this.database_connection_parameters ?? | ||
await this.deps.provisioner.getDatabaseConnectionParameters(hasuraRoleName); | ||
} catch (e) { | ||
const error = e as Error; | ||
simultaneousPromises.push(this.writeLog(LogLevel.ERROR, functionName, blockHeight, 'Failed to get database connection parameters', error.message)); | ||
throw error; | ||
} finally { | ||
credentialsFetchSpan.end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must end the spans ourselves. Spans that aren't explicitly ended are ended upon garbage collection. In the cases where we might expect a failure in an uncommon case, I used finally to ensure these spans end. In some cases, it's ok letting it be garbage collected since the trace itself isn't valuable.
await this.writeLog(LogLevel.ERROR, functionName, blockHeight, 'Error running IndexerFunction', error.message); | ||
throw e; | ||
} | ||
await this.tracer.startActiveSpan('run indexer code', async (runIndexerCodeSpan: Span) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use setActiveSpan when I anticipate more traces to be made inside the code block that I want to establish a child status for. In this case, any context.db or hasura calls made in the vm should be children of the run indexer code
span.
traceExporter: new TraceExporter(), | ||
spanProcessors: [new BatchSpanProcessor(new TraceExporter( | ||
{ | ||
projectId: process.env.GCP_PROJECT_ID ?? '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty project ID will send the traces nowhere. I figured this was the safer default to use.
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node'; | ||
|
||
export default function setUpTracerExport (): void { | ||
switch (process.env.TRACING_EXPORTER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change here is what exporter we set below. The rest of the code is the same. So, swapping out the trace endpoint is actually quite cheap.
runner/src/stream-handler/worker.ts
Outdated
const postRunSpan = tracer.startSpan('Delete redis message and shift queue', {}, context.active()); | ||
parentPort?.postMessage({ type: WorkerMessageType.STATUS, data: { status: Status.RUNNING } }); | ||
await workerContext.redisClient.deleteStreamMessage(streamKey, streamMessageId); | ||
workerContext.queue.shift(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this errors will it escape this function's try/catch since it's not awaited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the queue shift? If so, yep I spotted that and put an await on it. startSpan is sync. It doesn't need an await. I'll be pushing the change up. I'm actually reverting to not modify the QueueMessage. Fewer changes to worker and putting the block height in the root span doesn't help much since we sample only 10% anyway. The block height is listed in a child span later.
- feat: Instrument Runner Service (#602) - Support WHERE col IN (...) in context.db.table.select and delete (#606) - feat: Include indexer name in context db build failure warning (#611) - Cache provisioning status (#607) - Fix ESLint on DmlHandler (#612) - fix: Substitution 'node-sql-parser' with a forked version until Apri 1st (#597) - feat: Add pgBouncer to QueryApi (#615) - feat: Expose near-lake-primitives to VM (#613) --------- Co-authored-by: Pavel Kudinov <[email protected]> Co-authored-by: Pavel Kudinov <[email protected]> Co-authored-by: Kevin Zhang <[email protected]> Co-authored-by: Morgan McCauley <[email protected]>
Runner is lacking instrumentation. It is responsible for many things and it's become hard to understand what tasks contribute to the overall latency of an indexer. In addition, we are now at a point where we need to drive down latencies to facilitate new * indexer use cases such as access keys.
I've chosen to instrument Runner with OpenTelemetry. Tracing generally requires 3 items: An instrumented service, a trace collector, and a trace visualizer. The service is responsible for collecting and transmitting trace data to the collector. The collector should be able to receive trace data with little fuss to prevent performance impacts to the instrumented service. The collector then processes the trace data and transmits the processed data to the visualizer. The visualizer visualizes trace data and allows for filtering on traces.
The benefit of OpenTelemetry over other options like Zipkin and Jaeger is that GCP already supports ingesting OpenTelemetry data. As such, we don't need to provision a collector ourselves, and can instead leverage GCP's existing collector & visualizer Tracing service. For local development, traces can be output to console, a Zipkin all-in-one container or to GCP (Requires Cloud Trace Agent role and specifying project ID). This is done by simply initializing the NodeSDK differently.
In addition, we do not want to enable traces in prod yet, so by not specifying any exporter. This creates a No-Op Trace Exporter which won't attempt to record traces.
No code changes were made changing code execution path. All tests pass with no changes, aside from having to replace snapshots due to changes in tabbing of mutation strings. I have manually verified mutation strings are still the same by stripping whitespace and checking against original.