feat: Add Metrics for Memory Statistics #554

darunrs · 2024-02-09T01:28:36Z

Runner has crashed previously due to Out of Memory errors. It's unclear what settings need to be increased to resolve the error, and if the adjustments properly address the underlying problems. To help understand the limits of the Runner service, I've added metrics to capture individual worker heap sizes, which contribute largely to any OOM related problems. In addition to getting each worker's usage, we also can sum them to get overall heap allocation and utilization. I've also logged metrics for the prefetch queue size so that we can understand the relation between prefetch and the memory foot print.

In addition some smaller tasks were addressed as well:

Grafana was also displaying a value for the labels on their own, instead of only under "indexer type".
Coordinator V1 would fail to init if DENY list was missing.
The error that caused an indexer to crash was not being logged to the indexer's logs.
Docker compose'd Runner was failing due to Region missing

…gion Missing

darunrs · 2024-02-09T01:31:59Z

runner/src/stream-handler/stream-handler.ts

@@ -61,12 +73,22 @@ export default class StreamHandler {
    indexer.setStatus(functionName, 0, Status.STOPPED).catch((e) => {
      console.log(`Failed to set status STOPPED for stream: ${this.streamKey}`, e);
    });
+    indexer.writeLog(functionName, this.executorContext.block_height, `Encountered error processing stream: ${this.streamKey}, terminating thread\n${error.toString()}`).catch((e) => {


It is possible for an indexer to crash before it even gets a block to process. At which point a V1 indexer will end up logging the failure under block height 0. A V2 indexer would log it under start block height, which would be accurate.

darunrs · 2024-02-09T01:32:44Z

runner/src/stream-handler/worker.ts

-      await sleep(100);
-      continue;
-    }
+    try {


Added a try catch just in case. Redis hasn't failed us yet but if it does ever timeout, I would like to have this code retry and not crash.

darunrs · 2024-02-09T01:34:40Z

runner/src/stream-handler/worker.ts

  }
 }

 async function blockQueueConsumer (workerContext: WorkerContext, streamKey: string): Promise<void> {
  const indexer = new Indexer();
  const isHistorical = workerContext.streamType === 'historical';
  let streamMessageId = '';
-  let indexerName = '';
+  let indexerName = streamKey.split(':')[0];


This was sneaky. We set the indexerName only after we read indexer config for the first time, which itself only happens when a block is ready to be processed. So, metrics under finally were being written under indexerName of empty, for the short period before a block is ready (During fresh start of Runner). Setting an initial value off the streamKey works, since its the same value. Plus, we will get to refactor this to just take in the input config after.

darunrs · 2024-02-09T01:36:12Z

runner/src/stream-handler/worker.ts

@@ -114,6 +119,8 @@ async function blockQueueConsumer (workerContext: WorkerContext, streamKey: stri
      }
      const block = queueMessage.block;
      currBlockHeight = block.blockHeight;
+      const blockHeightMessage: WorkerMessage = { type: WorkerMessageType.BLOCK_HEIGHT, data: currBlockHeight };
+      parentPort?.postMessage(blockHeightMessage);


I wanted to put this under finally but decided to put this before runFunctions and as soon as we get the block height so that if runFunctions somehow crashes the executor, we put the error on the correct block height.

This could be me overcorrecting though. We can also put it in finally and change the structure of WorkerMessage to get both a block height AND metric data.

…matting

darunrs · 2024-02-09T18:19:04Z

Hey @morgsmccauley I need to get these in to test fixes for the memory errors today, since it keeps breaking prod. I'm gonna merge them since the changes are minor. Please look over it again whenever you can!

darunrs added 3 commits February 7, 2024 16:53

fix: Labels showing up in Grafana and Composed Runner fails due to Re…

13d5a6a

…gion Missing

feat: Log Worker crashes in Indexer logs

161b31f

feat: Add metrics for memory footprint

3c24beb

darunrs requested a review from a team as a code owner February 9, 2024 01:28

darunrs commented Feb 9, 2024

View reviewed changes

fix: Set current block height for listed V1 indexers and fix rust for…

a86206b

…matting

darunrs merged commit 8864a2a into main Feb 9, 2024
7 checks passed

darunrs deleted the fix-memory-error branch February 9, 2024 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Metrics for Memory Statistics #554

feat: Add Metrics for Memory Statistics #554

darunrs commented Feb 9, 2024 •

edited

Loading

darunrs Feb 9, 2024

darunrs Feb 9, 2024

darunrs Feb 9, 2024

darunrs Feb 9, 2024

darunrs commented Feb 9, 2024

feat: Add Metrics for Memory Statistics #554

feat: Add Metrics for Memory Statistics #554

Conversation

darunrs commented Feb 9, 2024 • edited Loading

darunrs Feb 9, 2024

Choose a reason for hiding this comment

darunrs Feb 9, 2024

Choose a reason for hiding this comment

darunrs Feb 9, 2024

Choose a reason for hiding this comment

darunrs Feb 9, 2024

Choose a reason for hiding this comment

darunrs commented Feb 9, 2024

darunrs commented Feb 9, 2024 •

edited

Loading