-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(monitoring): add dns reporter #1376
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @kishore03109 and the rest of your teammates on Graphite |
48fd2c3
to
071ff12
Compare
071ff12
to
9df257d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the structure here suffers abit from mixing concerns of different levels. what could be better is if these were split up and it's clear what each function is focusing on.
src/monitoring/index.ts
Outdated
} | ||
|
||
interface KeyCdnResponse { | ||
data: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this data
key seems abit extra? maybe the method returning this should just prune data
so that callers downstream don't have to prune it themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interface RedirectionDomain { | ||
source: string | ||
target: string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but we do already have something similar (DNSRecord
) in siteInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this one maps directly to what our csv format is in our github directory, so if we add target, papaparse throws an error
src/monitoring/index.ts
Outdated
const keyCdnApiKey = config.get("keyCdn.apiKey") | ||
|
||
return ResultAsync.fromPromise( | ||
fetch(`https://api.keycdn.com/zonealiases.json`, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use fetch
over axios
? take note that fetch is experimental and the associated status page recommends not using experimental APIs in production applications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, swapped to using axios
src/monitoring/index.ts
Outdated
]).andThen(([amplifyDeployments, redirectionDomains, keyCdnDomains]) => { | ||
this.monitoringServiceLogger.info("Fetched all domains") | ||
return okAsync( | ||
[...amplifyDeployments, ...redirectionDomains, ...keyCdnDomains].sort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole chunk is pretty confusing to read as there's alot of repetition and magic numbers being used; wdyt about using sortBy
and just stripping www.
from both domainA
and domainB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this one might want to push back.
the point of this was to remove blind spots in our monitoring, which means we want to see all the domains that we host, this includes the www and the non www. This is just a utility function to group the root the subdomain and the apex domain together, and the rest sort alphabetically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might be a miscommunication - since our comparison is always based on the stripped length (ie, www.isomer.gov.sg
is taken as equivalent to isomer.gov.sg
), why not just use sortBy(array, val => val.startsWith("www.") ? val.slice(4) : val
.
in this case, our comparator compares via the stripped value so both www.isomer.gov.sg
will be treated identically to isomer.gov.sg
. lmk if this makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oo this is neater code, thanks
.map((value) => | ||
reportCard.push({ | ||
...value, | ||
}) | ||
) | ||
.andThen(() => okAsync(reportCard)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if i'm reading this correctly but if we're doing a map
that's just a push
, what's stopping us from just returning this directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm concretely what are you suggesting ah? if we just return we will return the array directly, we will get an empty array since we resolve before the promise resolves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we just return without the map
, it'll be in a ResultAsync
which we can ask downstream to consume isn't it? not sure if i'm misunderstanding you here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait ah value
is of type ReportCard
, but what we want is a list.
The distinction to make here is that ReportCard
is for one domain, and we want to map over every domains to get this.
src/monitoring/index.ts
Outdated
.andThen(() => okAsync(reportCard)) | ||
}) | ||
|
||
return ResultAsync.combineWithAllErrors(domainResolvers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually inspecting the signature for domainResolvers
, the error type is never
.
this is because we've discarded the error earlier (.orElse(() => okAsync([]))
). this seems to suggest that either
- this won't error out or that
- all errors are equivalent to an empty response?
wondering if this assumption is true and if it's not, whether there's value in actually logging the error rather than just giving up and returning empty arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we go with the second assumption. We are later intending to map this over /siteup for correctness, this is just a from of reporting for some level of sanity check that we have x number of domains under our control and they resolve to these dns values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i'm understanding you correctly here! if the intent is as a form of reporting, i think preserving the errors and formatting them -> displaying has value. discarding the errors seems to be counter to the intent of reporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we are not discarding any errors here. we are simply reporting that we could not resolve to anything by returning an empty string. this is like a dns dig on the terminal and getting that it resolves to nothing mah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could i just check how this feature is meant to be used? in my head, the flow is something like this:
- user queries
/siteup <domain>
- we hit the backend here
- we query for all the records
- if any error
- discard error and return []
- if no errors
- return the resolved records?
i think it's also worth nothing here that the code structure adopted here doesn't quite fit in with neverthrow
- we're meant to recover and fix errors (you can observe this through the example for combineWithAllErrors
here). this method actually discards successful results and returns only the errors.
i don't think there's any actionable wrt updating this to be idiomatic neverthrow
but it's something we should consider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user queries /siteup
we hit the backend here
we query for all the records
if any error
discard error and return []
if no errors
return the resolved records?
nope, siteup functionality is to remain as is. once we get a list of domains, the idea is to map through the list of domains via the site up so that every morning, we know if anything is not as what we expect it to be.
all this does is gets a list of domains -> then dig on them. note that the figuring out if the values returned form the dig are errors is not in this pr as of yet. again, the reason why we are ignoring this is because node:dig throws an error if it resolves to nothing, which does not mean that it is an error. it is valid, and expected for for eg isomer.gov.sg to not resolve anything for CAA results.
src/server.ts
Outdated
const monitoringService = new MonitoringService({ | ||
launchesService, | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could i check why's this initialised here and not the support
container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm that would require creating launches service on the support container. no strong opinions on this, since this is a pretty light operation anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to probe - when would this latency/cost matter? the support container is not user facing so latency doesn't really matter as opposed to creating it on the app
container. (not that it's a big cost. i think the separation between user-facing and internal is more important here and what we want to maintain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that it's a big cost. i think the separation between user-facing and internal is more important here and what we want to maintain)
so i am assuming we dont care about singleton pattern in our codebase ya?
since in this case, launchesService will be created twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might be due to a mistake in rebase! all shared services should be exported from "common/"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, moved already
generateReportCard(domains: IsomerHostedDomain[]) { | ||
const reportCard: ReportCard[] = [] | ||
|
||
const domainResolvers = domains.map(({ domain, type }) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this function appears to be doing domain resolution and reporting the resolved records.
should we be splitting the method up so it's clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm what are you concretely suggesting ah? the reporting is just a logging, so the function when seperated becomes just a one liner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, i think in my head, the flow is something like:
generateReportcard = (domains) => {
generateARecords().andThen(generateQuadA).andThen(generateCaa).andThen(generateCname).orElse(someError)
}
but this would potentially be expensive to do. ok w/ just shifting the domains out:
return generateDomains().andThen(() => {
this.monitoringServiceLogger.info({
message: "Report card generated",
meta: {
reportCard,
date: new Date(),
},
})
return okAsync(reportCard)
})
.orElse(() => okAsync([]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shifted
) | ||
) | ||
.andThen((data: unknown) => { | ||
if (!isKeyCdnResponse(data)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the data here since we want to type cast directly from an unknown
fb7233d
to
d1fa7ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to submit as review (sorry) - i think i'm not entirely sure how the errors will play out so i'm holding off on approval. if i could understand how you'd intend this code to function, i think it might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential breakage with the plugin - should either update tsconfig
or drop
src/monitoring/index.ts
Outdated
// seems to be a bug in typing, this is a direct | ||
// copy paste from the octokit documentation | ||
// https://octokit.github.io/rest.js/v20#automatic-retries | ||
const OctokitRetry = Octokit.plugin(retry as any) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this is required! saw your comment but going to the linked site doesn't seem to suggest this.
this might be due to our tsconfig
not changing, which seems to be required by the retry
library
i think we might want to update our tsconfig
and check again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @seaerchin, this has since been update by increasing the version number, thanks for the catch
@@ -112,6 +112,7 @@ | |||
"valueFrom": "STAGING_INCOMING_QUEUE_URL" | |||
}, | |||
{ "name": "JWT_SECRET", "valueFrom": "STAGING_JWT_SECRET" }, | |||
{ "name": "KEYCDN_API_KEY", "valueFrom": "STAGING_KEYCDN_API_KEY" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this just be required on support containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
|
||
getAllDomains = () => | ||
ResultAsync.fromPromise( | ||
this.launchesRepository.findAll(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there domains outside of the launches table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, which is why we also fetch domains from keycdn
d1fa7ea
to
381ac3b
Compare
eb096b3
to
61553f6
Compare
@@ -98,6 +98,7 @@ | |||
"valueFrom": "PROD_ISOMERPAGES_REPO_PAGE_COUNT" | |||
}, | |||
{ "name": "JWT_SECRET", "valueFrom": "PROD_JWT_SECRET" }, | |||
{ "name": "KEYCDN_API_KEY", "valueFrom": "PROD_KEYCDN_API_KEY" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember to add iether on pulumi or directly to ssm
src/monitoring/index.ts
Outdated
import LaunchesService from "@root/services/identity/LaunchesService" | ||
import promisifyPapaParse from "@root/utils/papa-parse" | ||
|
||
interface MonitoringServiceInterface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: MonitoringServiceProps
8f6e62d
to
e16de47
Compare
e16de47
to
98001ce
Compare
Merge activity
|
Problem
This is a first pr that is up to add some level of sane reporting.
While scheduling is part of this feature, it is not within the scope of this pr. This pr only adds (currently dead code) logic to grab the domains that we own in isomer, and do a dns dig. This is meant to be verbose, and in the future alarms can be added based on the results of this.
This is not meant to replace monitoring, it is just meant to fine tune some blind spots that uptime robot currently has + some sane checker during incident response to show history of dns records for a site that we manage.
I am opting to log it directly in our backend to keep things simple. will add alarms + the scheduler in subsequent prs.
Solution
grab ALL domains from keycdn + amplify + redirection records + log dns records on them.
Breaking Changes
Tests
in server.ts add:
monitoringService.driver()
should see this in the logs:
Deploy Notes
New environment variables:
KEYCDN_API_KEY
: to get all the zones that we own in keycdnREDIRECTION_REPO_GITHUB_TOKEN
: gh token to view redir repo(
fetch_ssm_parameters.sh
)New scripts:
script
: script detailsNew dependencies:
dependency
: dependency detailsNew dev dependencies:
dependency
: dependency details