sql,server: cache password hashes and user info using a leasing protocol #58869

ajwerner · 2021-01-12T18:12:36Z

Is your feature request related to a problem? Please describe.

Today, authenticating a new connection requires reading the password hash (and performing hashing). In global clusters this requires at least one RTT on average. See GetUserHashedPassword, which reads from both system.users and system.role_options.

Describe the solution you'd like

For role membership, we currently utilize a cache. The way this cache works is that changes to role memberships increment the version of that table descriptor. Those changes then wait for the old version to drain. This is a critical performance improvement. We could do exactly the same thing for password hashes.

Describe alternatives you've considered

#58497 proposed using follower reads which both does not guarantee a local read and exposes weird stale reads.

Epic CRDB-7217

The text was updated successfully, but these errors were encountered:

rafiss · 2021-01-21T19:09:21Z

seems related to #36160 (which has a bit of discussion)

knz · 2021-06-25T14:30:27Z

More general proposal: #44134

the issue with a node-local cache is that a node that hasn't seen a SQL user yet will block even if another node has a cached entry.

ajwerner · 2021-06-25T14:33:44Z

#44134 has different tradeoffs. It provides increased availability at operational cost. I see both as valuable.

knz · 2021-06-25T14:37:55Z

Agreed both are valuable

lunevalex · 2021-06-25T15:00:31Z

Could switching these queries to use follower reads help prevent this? We would have to do this #57992 but it seems plausible.

lunevalex · 2021-06-25T15:14:09Z

Looks like this was discussed here #58497, ignore my comment.

@asubiotto

66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss See individual commits fixes #58869 --- This improves authentication performance in multiregion clusters and improves availability when the system range leaseholder goes down. I ran the test from #66768 and observed that there was no period of authentication unavailability. Previously, each authentication attempt would require one read from each of system.users and system.role_options. In multiregion clusters, this adds quite a bit of excess latency. Also, if those tables' leaseholders were not available, then authentication to all nodes would be affected. This cache is meant to alleviate both problems. It's kept up to date by updating the table descriptors' versions whenever authentication related information is changed with CREATE/ALTER/DROP ROLE. Release note (performance improvement): The latency of authenticating a user has been improved by adding a cache for lookups of authentication related information. Release note (ops change): There is now a server.authentication_cache.enabled cluster setting. The setting defaults to true. When enabled, this cache stores authentication-related data and will improve the latency of authentication attempts. Keeping the cache up to date adds additional overhead when using the CREATE, UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk ROLE operations should be run inside of a transaction. To make the cache more effective, any regularly-scheduled ROLE updates should be done all together, rather than occurring throughout the day at all times. Release note (security update): There is now a cache for per-user authentication related information. The data in the cache is always kept up-to-date because it checks if any change to the underlying authentication tables has been made since the last time the cache was updated. The cached data includes the user's hashed password, the NOLOGIN role option, and the VALID UNTIL role option. 67135: sql/pgwire: fix statement buffer memory leak when using suspended portals r=rafiss a=joesankey The connection statement buffer grows indefinitely when the client uses the execute portal with limit feature of the Postgres protocol, eventually causing the node to crash out of memory. Any long running query that uses the limit feature will cause this memory leak such as the `EXPERIMENTAL CHANGEFEED FOR` statement. The execute portal with limit feature of the Postgres protocol is used by the JDBC Postgres driver to fetch a limited number of rows at a time. The leak is caused by commands accumulating in the buffer and never getting cleared out. The client sends 2 commands every time it wants to receive more rows: - `Execute {"Portal": "C_1", "MaxRows": 1}` - `Sync` The server processes the commands and leaves them in the buffer, every iteration causes 2 more commands to leak. A similar memory leak was fixed by #48859, however the execute with limit feature is implemented by a side state machine in limitedCommandResult. The cleanup routine added by #48859 is never executed for suspended portals as they never return to the main conn_executor loop. After this change the statement buffer gets trimmed to reclaim memory after each client command is processed in the limitedCommandResult side state machine. The StmtBuf.Ltrim function was changed to be public visibility to enable this. While this is not ideal, it does scope the fix to the limitedCommandResult side state machine and could be addressed when the limitedCommandResult functionality is refactored into the conn_executor. Added a unit test which causes the leak, used the PGWire client in the test as neither the pg or pgx clients use execute with limit, so cant be used to demonstrate the leak. Also tested the fix in a cluster by following the steps outlined in #66849. Resolves: #66849 See also: #48859 Release note (bug fix): fix statement buffer memory leak when using suspended portals. @asubiotto @andreimatei Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: joesankey <[email protected]>

ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-schema-descriptors Relating to SQL table/db descriptor handling. A-security labels Jan 12, 2021

ajwerner mentioned this issue Jan 12, 2021

sql: use follower reads for internal authentication-related queries #58497

Closed

rafiss mentioned this issue Jan 21, 2021

meta-issue: improvements to password authentication #51601

Closed

rafiss mentioned this issue Mar 30, 2021

sql: add roach test / workload for testing connection latencies #62166

Merged

rafiss mentioned this issue Apr 8, 2021

pgwire: verifying user passwords is too aggressive #36160

Closed

rafiss added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) O-postmortem Originated from a Postmortem action item. labels May 12, 2021

jlinder added T-sql-observability T-server-and-security DB Server & Security labels Jun 16, 2021

rafiss removed the T-sql-observability label Jun 25, 2021

rafiss mentioned this issue Jun 27, 2021

sql: cache hashed password and auth info during authentication #66919

Merged

craig bot closed this as completed in d4516d5 Jul 8, 2021

knz added the A-authentication Pertains to authn subsystems label Jul 29, 2021

knz added the A-cc-enablement Pertains to current CC production issues or short-term projects label Jul 29, 2021

exalate-issue-sync bot removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql,server: cache password hashes and user info using a leasing protocol #58869

sql,server: cache password hashes and user info using a leasing protocol #58869

ajwerner commented Jan 12, 2021 •

edited by elinorgarcia

Loading

rafiss commented Jan 21, 2021 •

edited

Loading

knz commented Jun 25, 2021

ajwerner commented Jun 25, 2021

knz commented Jun 25, 2021

lunevalex commented Jun 25, 2021

lunevalex commented Jun 25, 2021

sql,server: cache password hashes and user info using a leasing protocol #58869

sql,server: cache password hashes and user info using a leasing protocol #58869

Comments

ajwerner commented Jan 12, 2021 • edited by elinorgarcia Loading

rafiss commented Jan 21, 2021 • edited Loading

knz commented Jun 25, 2021

ajwerner commented Jun 25, 2021

knz commented Jun 25, 2021

lunevalex commented Jun 25, 2021

lunevalex commented Jun 25, 2021

ajwerner commented Jan 12, 2021 •

edited by elinorgarcia

Loading

rafiss commented Jan 21, 2021 •

edited

Loading