Feat: Optimize canSendEmailOtp #1301

timotheeg · 2024-04-13T03:11:45Z

Problem

UserService.canSendEmailOtp() uses a whitelist to determine if emails are allowed to receive OTPs.

The whitelist contains

domains that the input can be matched against by suffix match
full email that the input can be matched again with exact match

At the moment the code basically download the full content of the whitelist table, and does a filter and check in code locally.

It would be much better to leverage the query capability of the DB, to ask if a whitelist entry exists and get a small yes/no type result.

Solution

Replace app-level code filtering by a single DB query.

The DB query will return: either:

the first whitelist entry that matches
nothing if there is no match

The PR will log the whitelist match, so we can verify if anything is wrong.

Behaviour and Results

Old query (for reference)

Sample trace

Sample trace for success

Sample trace for failure

Performance

This particularly is not very heavy, returning most of the time in less than 2 milliseconds (link to performance tracking graph here). The purpose of the PR is not to get a great performance gain but instead to 1) correct a bad pattern of fetching a large resultset to do filtering in code when that could be done in the query itself; 2) show that we can use raw queries to implement efficient queries.

But so concretely:

When no whitelist entry matches, the new query does a full table scan and returns nothing.

The previous query also did a full table scan to find active whitelist rows, but also incurred the transport cost to return the data, whether or not there was a whitelist match.

The new query is slightly heavier on the DB because it makes it do more work, but at the size of the whitelist table, it's negligible. The network transit time of the resulset should still be a net benefit in total time spent.

Overall, we expect the raw query time to be equivalent, and this can hopefully be visualized after release in this tracker in the isomer dashboard.

Important note:

With the whitelisting filtering logic done in the query rather than code, the unit tests are not representative of the behaviour any more. Great care should be spent on reviewing/testing the query, and ideally we should have synthetic tests that validate the behaviour as e2e tests (some happy path tests and some expected failure mode).

Tests

Note the traces for the query will be at this link.

Try to get an OTP for your ogp email address -> should succeed.
- Locate trace for isomer-postgres with the new whitelist search query, check the log for the match (e.g. Email valid for OTP by whitelist)
Try to get an OTP for [email protected] -> should appear to succeed but not send an email
- Locate trace for isomer-postgres with the new whitelist search query, check the log for the error (e.g. Error occurred when attempting to login user <email>)

…e whole table

timotheeg · 2024-04-13T03:45:39Z

src/services/identity/UsersService.ts

+        WHERE
+          (expiry is NULL OR expiry >= NOW())
+          AND
+          CASE WHEN email ~ '^.+@'


Note: compared to the previous code, this is a simpler regexp. It detects the "full email", while the previous more complicated regexp was detecting the domains. Obviously the match strategies have been swapped accordingly.

alexanderleegs · 2024-04-16T01:45:39Z

src/services/identity/__tests__/UsersService.spec.ts

+      const expected = false
+      // NOTE: This ends with gov.sg not .gov.sg (lacks a dot after the @)
+      const emailWithoutDot = "[email protected]"
+      MockSequelize.query.mockResolvedValueOnce([])


I think we should still be adding the mock for [{ email: ".gov.sg" }] - our test suite clears the db state between tests, so this test is checking for [email protected] against an empty whitelist

Oh wait what? the tests can populate the DB? I would love indeed to run the query for real rather than arbitrarily mocking a db response!

Whoops sorry i meant the jest mocks are reset! So we should be mocking the return of query here as [{ email: ".gov.sg" }]

That said we do have population of DB for some of our tests in integration - we don't do this for the tests in each service which we consider unit tests

The /integration/Users.spec.ts test suite does cover some whitelist tests though (under the /email/otp suite), so if we want to test this against the db we could add this test there as well!

Ah I see, I'll take a look at the integration tests (I could probably use some help for that 😅)

But so specifically in this case you are commenting on here, the mock does what we expect. For reference, the new query no longer return the full content of the table, it returns either:

no match if the input matches none of the whitelist entries
OR

the first whitelist entry the input matches

These 2 distinct responses represent a boolean-ish yes/no state of whether the input is whitelisted.

And in this case, the input [email protected] is not expecting to match .gov.sg as a whitelist entry, so that cannot be in the mocked response.

Ah you're right, I got the intent of the test wrong sorry! But yeah i think we can then introduce an integration test with this scenario (db contains .gov.sg, user tried to login via [email protected] which should not match)

I've updated the comments to make it clearer why a raw query is used: db15356

harishv7 · 2024-04-16T02:29:32Z

src/services/identity/UsersService.ts

@@ -1,5 +1,5 @@
 import { ResultAsync, errAsync, okAsync } from "neverthrow"


Could we share the before and after trace in the pr description to see the difference and also maybe if the latency improvements

Could we share the before and after trace in the pr description

The after traces are already there, I'll add a before trace as well.

if the latency improvements

Aye, so on this a few things:

The current query currently runs under 2ms most of the time, so the optimization isn't all that interesting from a latency perspective to be honest, and we are not expecting a big performance boost or anything

A dev pattern of downloading the full table content is bad generally, so the PR serves as illustration of how more complex query that are not necessarily supported by the ORM can be done. The ORM is just a tool, it should not be a reason to run inefficient queries.

I already added in the IsomerCMS dashboard a block to monitor the performance of the old query versus the new query for comparison here

Could we share the before and after trace in the pr description

Done! 😄 I added a sample trace for the before in the PR description and I updated the performance section too.

So again, in the grand scheme of things, this is a very cheap query so this PR is not bringing visible impact to users. It's more to illustrate how to query the DB with good patterns :)

Also Just for reference, here is a screenshot on the variability of max run time of the current query in prod. We'll be comparing the performance of the new query against that (link to widget here)

harishv7 · 2024-04-16T02:30:38Z

src/services/identity/UsersService.ts

-        expiry: {
-          [Op.or]: [{ [Op.is]: null }, { [Op.gt]: new Date() }],
+
+    // raw query because ORMs suck!


Just curious we used raw query here because the ORM way was more complicated or?

Well, so this query has 2 query patterns which are "unusual" for typical ORM use-cases:

a CASE WHEN in the WHERE clause

a where condition of the form input like cell (with a concat() call thrown in), where most of the times, queries are the other way around cell like input

I wasn't sure if sequelize could even do 2), but even if it could, I reckon the sequelize syntax would make the query very hard to read (likely mixing object and literal syntax in one statement ?)

In addition to that, this query does not return entities as instance of models or fetches dependencies, we just want 1 whtelist record, if the input matches any, so the ORM brings no additional value here (minus maybe one level of indirection for the table name and field name).

There's also a bit of personal preference here too I suppose. Personally I think

(expiry is NULL OR expiry > NOW())

is much easier to read than:

[Op.or]: [{ [Op.is]: null }, { [Op.gt]: new Date() }],

and that's just a small part of the more complex statement in this case.

So I was keen to show that it's OK to have efficient readable raw queries in app code. ORM is just a tool, and should not get in the way of good query patterns.

But so with all that said, If you can write the query in sequelize syntax, and it reads well, I don't mind pushing that in at all. It's your code base after all! 😄

No worries Tim, agreed on the points you mentioned. I had some thoughts about mixing raw queries and ORMs, but understandability of code is more important than just have "standardised" code

I've updated the comment to better explain why the raw query was used: db15356

Admittedly the "ORMs suck" coment was tongue in cheek and clearly not very informative 😅 . So in fact I copied some of the rationale in the comments from the reply I gave you 😄

timotheeg · 2024-04-18T03:04:33Z

Code is released, I'll be adding some info here to monitor results. I take screenshots because these spans will soon disappear from DD:

For now, p99 and max of the old request for the past ~2 weeks:

old query

max (link)

p99 (link)

new query, we need to wait for a week or to aggregate more data

max (link)

p99 (link)

TODO: take screenshot for the graphs for the new query in 2 weeks.

Notes:
As mentioned in the commen threads, this optimization is not a big gain from the user's perspective (most of the times both old and new query runs under 2ms!), but fetching a whole table to do filtering in code is not a good pattern, so the PR illustrate how the DB filtering capabilities can be better utilized.

fix(canSendEmail): use a query to find match, rather than fetching th…

d58a297

…e whole table

timotheeg requested a review from a team April 13, 2024 03:11

timotheeg added 3 commits April 13, 2024 11:14

fix: update tests

8687b13

feat: log whitelist match for OTP email

978ec9e

fix: upgrade query to match the recent security fix

c9cedb7

timotheeg force-pushed the optimize-canSendEmailOtp branch from 22d859e to c9cedb7 Compare April 13, 2024 03:15

timotheeg mentioned this pull request Apr 13, 2024

fix: fix comments #1300

Closed

timotheeg commented Apr 13, 2024

View reviewed changes

fix: use greater-than (not gte) to match previous code

897bcbe

alexanderleegs reviewed Apr 16, 2024

View reviewed changes

harishv7 reviewed Apr 16, 2024

View reviewed changes

fix: better comments to explain the use of raw query

db15356

harishv7 approved these changes Apr 17, 2024

View reviewed changes

fix: add integration tests for whitelist exact rule matching

57e0a84

alexanderleegs merged commit c2cad6e into develop Apr 17, 2024
12 checks passed

mergify bot deleted the optimize-canSendEmailOtp branch April 17, 2024 04:57

dcshzj mentioned this pull request Apr 17, 2024

release v0.80.0 #1318

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Optimize canSendEmailOtp #1301

Feat: Optimize canSendEmailOtp #1301

timotheeg commented Apr 13, 2024 •

edited

Loading

timotheeg Apr 13, 2024

alexanderleegs Apr 16, 2024

timotheeg Apr 16, 2024

alexanderleegs Apr 16, 2024

timotheeg Apr 16, 2024 •

edited

Loading

alexanderleegs Apr 16, 2024

timotheeg Apr 17, 2024

harishv7 Apr 16, 2024

timotheeg Apr 16, 2024 •

edited

Loading

timotheeg Apr 16, 2024

harishv7 Apr 16, 2024

timotheeg Apr 16, 2024 •

edited

Loading

harishv7 Apr 16, 2024

timotheeg Apr 17, 2024 •

edited

Loading

timotheeg commented Apr 18, 2024

		@@ -1,5 +1,5 @@
		import { ResultAsync, errAsync, okAsync } from "neverthrow"

Feat: Optimize canSendEmailOtp #1301

Feat: Optimize canSendEmailOtp #1301

Conversation

timotheeg commented Apr 13, 2024 • edited Loading

Problem

Solution

Behaviour and Results

Old query (for reference)

Performance

Important note:

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timotheeg Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timotheeg Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timotheeg Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timotheeg Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

timotheeg commented Apr 18, 2024

old query

new query, we need to wait for a week or to aggregate more data

timotheeg commented Apr 13, 2024 •

edited

Loading

timotheeg Apr 16, 2024 •

edited

Loading

timotheeg Apr 16, 2024 •

edited

Loading

timotheeg Apr 16, 2024 •

edited

Loading

timotheeg Apr 17, 2024 •

edited

Loading