-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(monitoring): add scheduler functionality #1383
feat(monitoring): add scheduler functionality #1383
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @kishore03109 and the rest of your teammates on Graphite |
e84c100
to
2b30191
Compare
071ff12
to
9df257d
Compare
7d7545a
to
cc61787
Compare
test failure from a missing |
19063bb
to
ba69105
Compare
added from downstream pr |
fb7233d
to
d1fa7ea
Compare
abe2be2
to
77738c8
Compare
@@ -133,6 +133,7 @@ | |||
"name": "REDIRECT_URI", | |||
"valueFrom": "PROD_REDIRECT_URI" | |||
}, | |||
{ "name": "REDIS_HOST", "valueFrom": "PROD_REDIS_HOST" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wanted to check here! setting the environment on both backend
and support
implies that backend
would also require this? and this is because MonitoringService
got folded into LaunchesService
right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has since been modified. only support needs this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing a U-Turn here.
this seems to throw an error since they share the same .env code, so by default, this would lead to keys being undefined.
There are a couple of ways to solve this.
- mark the envs optional
- create seperate env rules for support and backend
- just import the secrets for both
- mock some env var in line for redis host and the keycdn key in line in the package json (worried about the obscurification of env var here, where a developer might expect the keycdn-api key to be valid when used in the backend code, only to find that it is some mock value)
to unblock this pr, going to keep it simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry not quite understanding! why would they share the same env code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they share the same config.ts
code? either we would separate them or make the env var as optional right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i realised we give a default in config.ts
so i think there's still value in removing. leave the choice to you but fwiw, removing here is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i make var optional, then remove in task def
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need to make it optional! i thought required w/ defaults will cause it to fall back to the default if it isn't specified?
if this isn't the case, both are about the same imo
src/monitoring/index.ts
Outdated
} | ||
) | ||
|
||
const dailyCron = "0 0 9 * *" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sanity check - this runs at 0900 on the host machine. do we know if the host machine's timezone is our timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wah. good catch, ecs timezone is utc. BUT, actually following our incident actually, our actual monitor (uptime robot) sucks. am going to use 5 min interval instead, lmk your thoughts on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should not. our status page is based off uptime robot + that's what we use to determine our metrics, so i think we should stick with it.
additionally, there's drift between what's on uptime robot and redirection/indirection (uptime robot has some that's not present on the other) so i think we should just go with uptime robot unless there are breaking issues w/ uptime robot
src/monitoring/index.ts
Outdated
}, | ||
}) | ||
|
||
private readonly worker: Worker<unknown, string, string> | undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should aim to give MyData
as the type (here it's given as unknown
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know how to? the way that i see this, this is a limitation of the library, as is allows jobs of any shape to be scheduled. in this case, we want to keep the type to unknown and actually we are not using the data at all (since now the invariant is that any job is a monitoring job, so start the scheduled driver)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the library exposes typings on Worker
and Job
.
private readonly worker: Worker<number, number, string>
this.worker = new Worker<number, number>
job: Job<number, number>
these allowed me to type it correctly. since we're just using the queue as a cronjob, the value add from unknown -> void
might not be that much. fine w/ either
d1fa7ea
to
381ac3b
Compare
eb096b3
to
61553f6
Compare
77738c8
to
783e8b0
Compare
src/middleware/featureFlag.ts
Outdated
@@ -6,7 +6,7 @@ import { getNewGrowthbookInstance } from "@root/utils/growthbook-utils" | |||
|
|||
// Keep one GrowthBook instance at module level | |||
// The instance will handle internal cache refreshes via a SSE connection | |||
const gb = getNewGrowthbookInstance({ | |||
export const gb = getNewGrowthbookInstance({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could i be annoying and suggest not exporting this? intentional choice so we have to go through the helper method
src/monitoring/MonitoringWorker.ts
Outdated
@@ -177,6 +179,10 @@ export default class MonitoringWorker { | |||
} | |||
|
|||
driver() { | |||
const gb = getNewGrowthbookInstance({ | |||
clientKey: config.get("growthbook.clientKey"), | |||
subscribeToChanges: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note here, because we set subscribeToChanges: true
, the same issue where the config changes in-processing still remains.
this isn't really an issue for me because it only concerns whitelisted sites + whether this is enabled. leave it to you to make the final judgement. how it behaves now is that it initialises 2 new instances and gets the config at that time (whilst listening to changes); if we want to say that the config doesn't change at all in-processing, we could instead initialise growthbook once and get the config at the start and pass it downwards.
tl;dr: behaviour still doesn't change, which means that config can change in-processing. not a biggie for me but leave it to you to decide if you wanna make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, was not aware of this implication!
changed to false
how bullmq works is (according to their docs, at least) that a worker must continually inform the queue that it's still working on the job. if this doesn't happen (thread is always busy on worker), then the job will be stalled and it's possible to have another worker working on it. but for clarification, this isn't what i'm concerned about! my comment was wrt out of order consumption of jobs. i think in this case, we don't really care since the job is just to monitor all the site and ordering doesn't really matter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seaerchin
As discussed, thowing is deprior for now.
after upgrading the types, all work, verified that the breaking changes were node upgades, and since we are already using node18, this should no be an issue.
src/monitoring/MonitoringWorker.ts
Outdated
@@ -177,6 +179,10 @@ export default class MonitoringWorker { | |||
} | |||
|
|||
driver() { | |||
const gb = getNewGrowthbookInstance({ | |||
clientKey: config.get("growthbook.clientKey"), | |||
subscribeToChanges: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, was not aware of this implication!
changed to false
a40011c
to
818488b
Compare
8f6e62d
to
e16de47
Compare
818488b
to
1e3ddc1
Compare
e16de47
to
98001ce
Compare
863fa37
to
e708663
Compare
Merge activity
|
make env var optional, add scoped gb instance
e708663
to
32ffc49
Compare
Problem
This is the second part of the monitoring feature that we want to build. This PR only cares about adding a scheduler + the related infra needed for this to function. this will make the monitor run once every 5 mins, for oncalls to pick any related alarms from this.
Adding the alarms is done in the downstream PR .
Solution
Using bullmq to conveniently create a queue, a worker and a repeatable job over multiple instances. We do some level of exponential backoff retries since it is a nice to have and easy to implement. The original
/site-up
code has since been refactored to return anerr
or aok
, depending on whether the configuration is ideal.Unfortunately, this caused quite a number of edge cases to pop up. Due to the nature of this, a more loose check of whether the isomer logo is present is being used to determine if a site is up.
Even with this loose check, we have a
workplacelearning.gov.sg
who have modified their site to not have the Isomer logo. Have used gb to code white list this weird site. Potentially, if tomorrow we have an alarm of a site going down, but this is expected to prolong, we can go to growthbook and change the config for this to be whitelisted.Breaking Changes
Tests
on deployment, assert that you see these logs. it is ok for there to be multiple instances of this log (it directly corresponds to the number of instances that we have) since bullmq is smart enough to only create one queue, and one repeatable job over multiple instances.Deploy Notes
corresponding infra pr should be deployed to production and only then should the redis host value be populated into the 1pw for production.
Additionally, post approval of this pr, add two alarms, one for
Error running monitoring service
and another forMonitoring service has failed
. These are errors when the job has failed to be initalised, and when there is a new error.New environment variables:
REDIS_HOST
: Redis hostfetch_ssm_parameters.sh
)New dependencies:
bullmq
: scheduler of choice