feat(monitoring): add scheduler functionality #1383

kishore03109 · 2024-05-17T05:38:42Z

Problem

This is the second part of the monitoring feature that we want to build. This PR only cares about adding a scheduler + the related infra needed for this to function. this will make the monitor run once every 5 mins, for oncalls to pick any related alarms from this.

Adding the alarms is done in the downstream PR .

Solution

Using bullmq to conveniently create a queue, a worker and a repeatable job over multiple instances. We do some level of exponential backoff retries since it is a nice to have and easy to implement. The original /site-up code has since been refactored to return an err or a ok, depending on whether the configuration is ideal.
Unfortunately, this caused quite a number of edge cases to pop up. Due to the nature of this, a more loose check of whether the isomer logo is present is being used to determine if a site is up.
Even with this loose check, we have a workplacelearning.gov.sg who have modified their site to not have the Isomer logo. Have used gb to code white list this weird site. Potentially, if tomorrow we have an alarm of a site going down, but this is expected to prolong, we can go to growthbook and change the config for this to be whitelisted.

Breaking Changes

Yes - this PR contains breaking changes
- Details ...
No - this PR is backwards compatible with ALL of the following feature flags in this doc

Tests

on deployment, assert that you see these logs. it is ok for there to be multiple instances of this log (it directly corresponds to the number of instances that we have) since bullmq is smart enough to only create one queue, and one repeatable job over multiple instances.

Deploy Notes

corresponding infra pr should be deployed to production and only then should the redis host value be populated into the 1pw for production.

Additionally, post approval of this pr, add two alarms, one for
Error running monitoring service and another for Monitoring service has failed. These are errors when the job has failed to be initalised, and when there is a new error.

New environment variables:

REDIS_HOST : Redis host
- added env var to 1PW + SSM script (fetch_ssm_parameters.sh)

New dependencies:

bullmq : scheduler of choice

kishore03109 · 2024-05-17T05:38:59Z

feat(monitoring): add scheduler functionality #1383 👈
feat(monitoring): add dns reporter #1376
develop

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @kishore03109 and the rest of your teammates on Graphite

seaerchin · 2024-05-21T07:11:45Z

test failure from a missing keyCdn.apiKey! could i trouble you to fix this @kishore03109

kishore03109 · 2024-05-24T02:14:03Z

test failure from a missing keyCdn.apiKey! could i trouble you to fix this @kishore03109

added from downstream pr

seaerchin · 2024-05-27T08:18:45Z

.aws/deploy/backend-task-definition.prod.json

@@ -133,6 +133,7 @@
          "name": "REDIRECT_URI",
          "valueFrom": "PROD_REDIRECT_URI"
        },
+        { "name": "REDIS_HOST", "valueFrom": "PROD_REDIS_HOST" },


just wanted to check here! setting the environment on both backend and support implies that backend would also require this? and this is because MonitoringService got folded into LaunchesService right

this has since been modified. only support needs this.

doing a U-Turn here.

this seems to throw an error since they share the same .env code, so by default, this would lead to keys being undefined.

There are a couple of ways to solve this.

mark the envs optional

create seperate env rules for support and backend

just import the secrets for both

mock some env var in line for redis host and the keycdn key in line in the package json (worried about the obscurification of env var here, where a developer might expect the keycdn-api key to be valid when used in the backend code, only to find that it is some mock value)

to unblock this pr, going to keep it simple

sorry not quite understanding! why would they share the same env code?

they share the same config.ts code? either we would separate them or make the env var as optional right?

ah ok, sounds good

i realised we give a default in config.ts so i think there's still value in removing. leave the choice to you but fwiw, removing here is clearer

ok, i make var optional, then remove in task def

i don't think we need to make it optional! i thought required w/ defaults will cause it to fall back to the default if it isn't specified?

if this isn't the case, both are about the same imo

src/monitoring/index.ts

seaerchin · 2024-05-27T08:56:14Z

src/monitoring/index.ts

+      }
+    )
+
+    const dailyCron = "0 0 9 * *"


sanity check - this runs at 0900 on the host machine. do we know if the host machine's timezone is our timezone?

wah. good catch, ecs timezone is utc. BUT, actually following our incident actually, our actual monitor (uptime robot) sucks. am going to use 5 min interval instead, lmk your thoughts on that.

i think we should not. our status page is based off uptime robot + that's what we use to determine our metrics, so i think we should stick with it.

additionally, there's drift between what's on uptime robot and redirection/indirection (uptime robot has some that's not present on the other) so i think we should just go with uptime robot unless there are breaking issues w/ uptime robot

seaerchin · 2024-05-27T08:57:41Z

src/monitoring/index.ts

+    },
+  })
+
+  private readonly worker: Worker<unknown, string, string> | undefined


we should aim to give MyData as the type (here it's given as unknown)

do you know how to? the way that i see this, this is a limitation of the library, as is allows jobs of any shape to be scheduled. in this case, we want to keep the type to unknown and actually we are not using the data at all (since now the invariant is that any job is a monitoring job, so start the scheduled driver)

the library exposes typings on Worker and Job.

private readonly worker: Worker<number, number, string> this.worker = new Worker<number, number> job: Job<number, number>

these allowed me to type it correctly. since we're just using the queue as a cronjob, the value add from unknown -> void might not be that much. fine w/ either

src/monitoring/index.ts

seaerchin · 2024-06-21T05:12:58Z

src/middleware/featureFlag.ts

@@ -6,7 +6,7 @@ import { getNewGrowthbookInstance } from "@root/utils/growthbook-utils"

 // Keep one GrowthBook instance at module level
 // The instance will handle internal cache refreshes via a SSE connection
-const gb = getNewGrowthbookInstance({
+export const gb = getNewGrowthbookInstance({


could i be annoying and suggest not exporting this? intentional choice so we have to go through the helper method

seaerchin · 2024-06-21T05:17:42Z

src/monitoring/MonitoringWorker.ts

@@ -177,6 +179,10 @@ export default class MonitoringWorker {
  }

  driver() {
+    const gb = getNewGrowthbookInstance({
+      clientKey: config.get("growthbook.clientKey"),
+      subscribeToChanges: true,


just a note here, because we set subscribeToChanges: true, the same issue where the config changes in-processing still remains.

this isn't really an issue for me because it only concerns whitelisted sites + whether this is enabled. leave it to you to make the final judgement. how it behaves now is that it initialises 2 new instances and gets the config at that time (whilst listening to changes); if we want to say that the config doesn't change at all in-processing, we could instead initialise growthbook once and get the config at the start and pass it downwards.

tl;dr: behaviour still doesn't change, which means that config can change in-processing. not a biggie for me but leave it to you to decide if you wanna make this change

ok, was not aware of this implication!
changed to false

seaerchin · 2024-06-21T05:27:38Z

@seaerchin

this seems like a single producer, potentially multiple consumer case but it should be ok since we don't care about ordering.

hmm wait, no this is why we are using bullmq, right? it implements some sort of locking mechanism to ensure that only one consumer (instance) is doing the job at one time?

how bullmq works is (according to their docs, at least) that a worker must continually inform the queue that it's still working on the job. if this doesn't happen (thread is always busy on worker), then the job will be stalled and it's possible to have another worker working on it.

but for clarification, this isn't what i'm concerned about! my comment was wrt out of order consumption of jobs. i think in this case, we don't really care since the job is just to monitor all the site and ordering doesn't really matter

kishore03109

@seaerchin
As discussed, thowing is deprior for now.
after upgrading the types, all work, verified that the breaking changes were node upgades, and since we are already using node18, this should no be an issue.

kishore03109 · 2024-06-24T09:15:26Z

src/monitoring/MonitoringWorker.ts

@@ -177,6 +179,10 @@ export default class MonitoringWorker {
  }

  driver() {
+    const gb = getNewGrowthbookInstance({
+      clientKey: config.get("growthbook.clientKey"),
+      subscribeToChanges: true,


ok, was not aware of this implication!
changed to false

kishore03109 · 2024-06-27T06:08:20Z

Merge activity

Jun 27, 2:08 AM EDT: @kishore03109 started a stack merge that includes this pull request via Graphite.
Jun 27, 2:10 AM EDT: Graphite rebased this pull request as part of a merge.
Jun 27, 2:14 AM EDT: @kishore03109 merged this pull request with Graphite.

make env var optional, add scoped gb instance

kishore03109 mentioned this pull request May 17, 2024

feat(monitoring): add dns reporter #1376

Merged

3 tasks

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch from e84c100 to 2b30191 Compare May 17, 2024 05:50

kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from 071ff12 to 9df257d Compare May 17, 2024 05:57

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch 11 times, most recently from 7d7545a to cc61787 Compare May 21, 2024 03:19

kishore03109 marked this pull request as ready for review May 21, 2024 03:27

kishore03109 requested a review from a team May 21, 2024 03:27

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch 3 times, most recently from 19063bb to ba69105 Compare May 21, 2024 07:59

kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from fb7233d to d1fa7ea Compare May 24, 2024 02:15

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch 2 times, most recently from abe2be2 to 77738c8 Compare May 24, 2024 02:18

seaerchin requested changes May 27, 2024

View reviewed changes

kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from d1fa7ea to 381ac3b Compare June 6, 2024 07:04

kishore03109 requested a review from a team June 6, 2024 07:04

kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from eb096b3 to 61553f6 Compare June 18, 2024 02:34

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch from 77738c8 to 783e8b0 Compare June 18, 2024 02:34

seaerchin reviewed Jun 21, 2024

View reviewed changes

kishore03109 commented Jun 24, 2024

View reviewed changes

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch from a40011c to 818488b Compare June 25, 2024 02:24

kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from 8f6e62d to e16de47 Compare June 25, 2024 05:58

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch from 818488b to 1e3ddc1 Compare June 25, 2024 05:58

kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from e16de47 to 98001ce Compare June 27, 2024 01:37

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch 4 times, most recently from 863fa37 to e708663 Compare June 27, 2024 04:51

Base automatically changed from 05-15-feat_monitoring_add_dns_reporter to develop June 27, 2024 06:08

kishore03109 added 11 commits June 27, 2024 06:09

feat(monitoring): add scheduler functionality

62950be

feat(monitoring): integration with siteup

475bf97

chore(env.test): add new line

0d722e0

chore(env-var): remove redis host frm be

2c3484b

chore(monitoring): refactor service & worker

9354716

chore(env): add back env

285741d

fix(bullmq): add error handler

f9a25e6

make env var optional, add scoped gb instance

fix(octokit): fix type mismatch

f915b04

chore(featureFlag): rm export

5a1276f

chore(monitoring): remove on retry

fd00e27

fix(cron): run monitoring every 5 mins

32ffc49

kishore03109 force-pushed the 05-17-feat_monitoring_add_scheduler_functionality branch from e708663 to 32ffc49 Compare June 27, 2024 06:09

kishore03109 merged commit b565d58 into develop Jun 27, 2024
11 of 12 checks passed

kishore03109 deleted the 05-17-feat_monitoring_add_scheduler_functionality branch June 27, 2024 06:14

This was referenced Jun 27, 2024

release v0.90.0 #1417

Closed

release v0.90.0 #1419

Closed

dcshzj mentioned this pull request Jun 27, 2024

release v0.90.0 #1421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitoring): add scheduler functionality #1383

feat(monitoring): add scheduler functionality #1383

kishore03109 commented May 17, 2024 •

edited

Loading

kishore03109 commented May 17, 2024 •

edited

Loading

seaerchin commented May 21, 2024

kishore03109 commented May 24, 2024

seaerchin May 27, 2024

kishore03109 Jun 6, 2024

kishore03109 Jun 18, 2024

seaerchin Jun 20, 2024

kishore03109 Jun 20, 2024

seaerchin Jun 20, 2024

seaerchin Jun 20, 2024

kishore03109 Jun 21, 2024

seaerchin Jun 21, 2024

seaerchin May 27, 2024

kishore03109 Jun 6, 2024

seaerchin Jun 20, 2024

seaerchin May 27, 2024

kishore03109 Jun 6, 2024

seaerchin Jun 20, 2024

seaerchin Jun 21, 2024

seaerchin Jun 21, 2024

kishore03109 Jun 24, 2024

seaerchin commented Jun 21, 2024

kishore03109 left a comment

kishore03109 Jun 24, 2024

kishore03109 commented Jun 27, 2024 •

edited

Loading

feat(monitoring): add scheduler functionality #1383

feat(monitoring): add scheduler functionality #1383

Conversation

kishore03109 commented May 17, 2024 • edited Loading

Problem

Solution

Tests

Deploy Notes

kishore03109 commented May 17, 2024 • edited Loading

seaerchin commented May 21, 2024

kishore03109 commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seaerchin commented Jun 21, 2024

kishore03109 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kishore03109 commented Jun 27, 2024 • edited Loading

Merge activity

kishore03109 commented May 17, 2024 •

edited

Loading

kishore03109 commented May 17, 2024 •

edited

Loading

kishore03109 commented Jun 27, 2024 •

edited

Loading