Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Optimize canSendEmailOtp #1301

Merged
merged 7 commits into from
Apr 17, 2024
Merged

Conversation

timotheeg
Copy link
Contributor

@timotheeg timotheeg commented Apr 13, 2024

Problem

UserService.canSendEmailOtp() uses a whitelist to determine if emails are allowed to receive OTPs.

The whitelist contains

  1. domains that the input can be matched against by suffix match
  2. full email that the input can be matched again with exact match

At the moment the code basically download the full content of the whitelist table, and does a filter and check in code locally.

It would be much better to leverage the query capability of the DB, to ask if a whitelist entry exists and get a small yes/no type result.

Solution

Replace app-level code filtering by a single DB query.

The DB query will return: either:

  • the first whitelist entry that matches
  • nothing if there is no match

The PR will log the whitelist match, so we can verify if anything is wrong.

Behaviour and Results

Old query (for reference)

Sample trace
image

Sample trace for success
image
image
image

Sample trace for failure
image
image

Performance

This particularly is not very heavy, returning most of the time in less than 2 milliseconds (link to performance tracking graph here). The purpose of the PR is not to get a great performance gain but instead to 1) correct a bad pattern of fetching a large resultset to do filtering in code when that could be done in the query itself; 2) show that we can use raw queries to implement efficient queries.

But so concretely:

When no whitelist entry matches, the new query does a full table scan and returns nothing.

The previous query also did a full table scan to find active whitelist rows, but also incurred the transport cost to return the data, whether or not there was a whitelist match.

The new query is slightly heavier on the DB because it makes it do more work, but at the size of the whitelist table, it's negligible. The network transit time of the resulset should still be a net benefit in total time spent.

Overall, we expect the raw query time to be equivalent, and this can hopefully be visualized after release in this tracker in the isomer dashboard.

Important note:

With the whitelisting filtering logic done in the query rather than code, the unit tests are not representative of the behaviour any more. Great care should be spent on reviewing/testing the query, and ideally we should have synthetic tests that validate the behaviour as e2e tests (some happy path tests and some expected failure mode).

Tests

Note the traces for the query will be at this link.

  • Try to get an OTP for your ogp email address -> should succeed.
    • Locate trace for isomer-postgres with the new whitelist search query, check the log for the match (e.g. Email valid for OTP by whitelist)
  • Try to get an OTP for [email protected] -> should appear to succeed but not send an email
    • Locate trace for isomer-postgres with the new whitelist search query, check the log for the error (e.g. Error occurred when attempting to login user <email>)

@timotheeg timotheeg requested a review from a team April 13, 2024 03:11
@timotheeg timotheeg force-pushed the optimize-canSendEmailOtp branch from 22d859e to c9cedb7 Compare April 13, 2024 03:15
@timotheeg timotheeg mentioned this pull request Apr 13, 2024
WHERE
(expiry is NULL OR expiry >= NOW())
AND
CASE WHEN email ~ '^.+@'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: compared to the previous code, this is a simpler regexp. It detects the "full email", while the previous more complicated regexp was detecting the domains. Obviously the match strategies have been swapped accordingly.

const expected = false
// NOTE: This ends with gov.sg not .gov.sg (lacks a dot after the @)
const emailWithoutDot = "[email protected]"
MockSequelize.query.mockResolvedValueOnce([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still be adding the mock for [{ email: ".gov.sg" }] - our test suite clears the db state between tests, so this test is checking for [email protected] against an empty whitelist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait what? the tests can populate the DB? I would love indeed to run the query for real rather than arbitrarily mocking a db response!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops sorry i meant the jest mocks are reset! So we should be mocking the return of query here as [{ email: ".gov.sg" }]

That said we do have population of DB for some of our tests in integration - we don't do this for the tests in each service which we consider unit tests

The /integration/Users.spec.ts test suite does cover some whitelist tests though (under the /email/otp suite), so if we want to test this against the db we could add this test there as well!

Copy link
Contributor Author

@timotheeg timotheeg Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I'll take a look at the integration tests (I could probably use some help for that 😅)

But so specifically in this case you are commenting on here, the mock does what we expect. For reference, the new query no longer return the full content of the table, it returns either:

  • no match if the input matches none of the whitelist entries
    OR
  • the first whitelist entry the input matches

These 2 distinct responses represent a boolean-ish yes/no state of whether the input is whitelisted.

And in this case, the input [email protected] is not expecting to match .gov.sg as a whitelist entry, so that cannot be in the mocked response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, I got the intent of the test wrong sorry! But yeah i think we can then introduce an integration test with this scenario (db contains .gov.sg, user tried to login via [email protected] which should not match)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comments to make it clearer why a raw query is used: db15356

@@ -1,5 +1,5 @@
import { ResultAsync, errAsync, okAsync } from "neverthrow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we share the before and after trace in the pr description to see the difference and also maybe if the latency improvements

Copy link
Contributor Author

@timotheeg timotheeg Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we share the before and after trace in the pr description

The after traces are already there, I'll add a before trace as well.

if the latency improvements

Aye, so on this a few things:

  • The current query currently runs under 2ms most of the time, so the optimization isn't all that interesting from a latency perspective to be honest, and we are not expecting a big performance boost or anything
  • A dev pattern of downloading the full table content is bad generally, so the PR serves as illustration of how more complex query that are not necessarily supported by the ORM can be done. The ORM is just a tool, it should not be a reason to run inefficient queries.
  • I already added in the IsomerCMS dashboard a block to monitor the performance of the old query versus the new query for comparison here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we share the before and after trace in the pr description

Done! 😄 I added a sample trace for the before in the PR description and I updated the performance section too.

So again, in the grand scheme of things, this is a very cheap query so this PR is not bringing visible impact to users. It's more to illustrate how to query the DB with good patterns :)

Also Just for reference, here is a screenshot on the variability of max run time of the current query in prod. We'll be comparing the performance of the new query against that (link to widget here)

image

expiry: {
[Op.or]: [{ [Op.is]: null }, { [Op.gt]: new Date() }],

// raw query because ORMs suck!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious we used raw query here because the ORM way was more complicated or?

Copy link
Contributor Author

@timotheeg timotheeg Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so this query has 2 query patterns which are "unusual" for typical ORM use-cases:

  1. a CASE WHEN in the WHERE clause
  2. a where condition of the form input like cell (with a concat() call thrown in), where most of the times, queries are the other way around cell like input

I wasn't sure if sequelize could even do 2), but even if it could, I reckon the sequelize syntax would make the query very hard to read (likely mixing object and literal syntax in one statement ?)

In addition to that, this query does not return entities as instance of models or fetches dependencies, we just want 1 whtelist record, if the input matches any, so the ORM brings no additional value here (minus maybe one level of indirection for the table name and field name).

There's also a bit of personal preference here too I suppose. Personally I think

(expiry is NULL OR expiry > NOW())

is much easier to read than:

[Op.or]: [{ [Op.is]: null }, { [Op.gt]: new Date() }],

and that's just a small part of the more complex statement in this case.

So I was keen to show that it's OK to have efficient readable raw queries in app code. ORM is just a tool, and should not get in the way of good query patterns.

But so with all that said, If you can write the query in sequelize syntax, and it reads well, I don't mind pushing that in at all. It's your code base after all! 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries Tim, agreed on the points you mentioned. I had some thoughts about mixing raw queries and ORMs, but understandability of code is more important than just have "standardised" code

Copy link
Contributor Author

@timotheeg timotheeg Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment to better explain why the raw query was used: db15356

Admittedly the "ORMs suck" coment was tongue in cheek and clearly not very informative 😅 . So in fact I copied some of the rationale in the comments from the reply I gave you 😄

@alexanderleegs alexanderleegs merged commit c2cad6e into develop Apr 17, 2024
12 checks passed
@mergify mergify bot deleted the optimize-canSendEmailOtp branch April 17, 2024 04:57
@dcshzj dcshzj mentioned this pull request Apr 17, 2024
9 tasks
@timotheeg
Copy link
Contributor Author

Code is released, I'll be adding some info here to monitor results. I take screenshots because these spans will soon disappear from DD:

For now, p99 and max of the old request for the past ~2 weeks:

old query

max (link)
image

p99 (link)
image

new query, we need to wait for a week or to aggregate more data

max (link)
image

p99 (link)
image

TODO: take screenshot for the graphs for the new query in 2 weeks.

Notes:
As mentioned in the commen threads, this optimization is not a big gain from the user's perspective (most of the times both old and new query runs under 2ms!), but fetching a whole table to do filtering in code is not a good pattern, so the PR illustrate how the DB filtering capabilities can be better utilized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants