sql: replacement for NodeID #48009

tbg · 2020-04-24T09:35:17Z

SQL uses the NodeID in dozens of places, but it won't be available to SQL tenants since they're not running on a Node, and since we can't (don't want to) hand out new cluster-wide NodeIDs for each tenant process. We can use a bogus ID, but we need to be careful about semantics and so I would rather audit and replace it with something that is well-documented so that incorrect usage is avoided.

Uses of NodeID generally fall into a few categories:

actual NodeID (usually because of a node liveness or Gossip usage)
- jobs, jobs: needs to work without liveness and nodeID #47892
- DistSQL draining gossip key
- node_id() builtin (easily disabled on tenants)
- various range descriptor handling stuff
- various crdb_internals that won't be available to tenants anyway
desire for a unique identifier for the current SQL process (across all processes under the same tenant ID)
- sql descriptor leasing
- unique_rowid
- session/query IDs
- event logging (stashes a "target_id" in the JSON payload - sort of bogus, I think we could get rid of that if we wanted to), this JSON concept is pretty old and horrible anyway
clock uncertainty optimization (observe a timestamp of gateway node eagerly)
- see NewTxn and kv: observed timestamps and multi-tenancy #48008

The items in the first category will essentially not work under multi-tenants and will be disabled for Phase 2.
The third category is out of scope for Phase 2 but we're aware of there being a real technical problem to solve (#48008).

The second category needs a tenant instance ID instead of a NodeID. The simplest implementation here could be allocating these IDs off a per-tenant counter, similar to the table descriptor ID generator. Once we are starting tenant SQL servers with some optional persistence, we can persist the tenant instance ID locally and allocate fewer of them; however an int64 per tenant will last us over all of the pod restarts that we will ever need.

The strawman plan I have here is to made the NodeID optional with a deprecation process similar to #47972 (we expect stuff in category 1 to stick around for some time since most of it will be disabled in phase 2) and to introduce the tenant instance ID which will use the NodeID when it is available.

@nvanbenschoten and @asubiotto, is "(tenant) instance ID" a name that resonates? I'm avoiding "gateway ID" since a "gateway" in our definition is the SQL server serving a given transaction, and that does not matter here. Also, would you do something differently from what I've proposed here?

Also cc @ajwerner since jobs is affected

The text was updated successfully, but these errors were encountered:

ajwerner · 2020-04-24T20:31:43Z

This SGTM. In the short term should we just shove the tenant ID into the node ID spot and set a high order bit in it to make it clear that it's not a real node ID? Is that too hacky for our tastes?

tbg · 2020-04-28T08:19:45Z

At the time of writing, a NodeID of 9999 is hard-coded to tide us over. But one of my next PRs will replace it with an abstraction. Roughly you'll get a

interface {
  DeprecatedNodeID() roachpb.NodeID
  OptionalNodeID() (roachpb.NodeID, bool)
  TenantInstanceID() something.TenantInstanceID
}

The goal is then to remove all DeprecatedNodeID() calls. OptionalNodeID() sticks around for something like crdb_internal.node_metrics which has the nodeID as a column (or so I think, anyway, you get the idea - the system tenant probably has legit uses for the NodeID).

nvanbenschoten · 2020-04-28T17:03:07Z

is "(tenant) instance ID" a name that resonates?

I'm fine with either "tenant instance ID" or "tenant process ID".

Also, would you do something differently from what I've proposed here?

No, I think what you laid out makes sense. Items in category 1 will be disabled initially (other than #47892, which is tracked separately). Items in category 2 can use some other form of a per-tenant unique identifier. Items in category 3 can be disabled indefinitely.

an int64 per tenant

The only question I'm left with is whether we want to make this tenant instance ID globally unique or only unique per tenant. I suspect that the former would actually be easier to implement and will make things a little easier to debug. The only downsides I can see to that are:

faster ID usage - but we're still never going to overflow an int64.
a single global hotspot for ID allocation - can be addressed by some block allocation of IDs if this ever becomes a problem.

With the advent of multi-tenancy, SQL servers can optionally be decoupled from the (and share a) KV backend. As a consequence, the NodeID may not be available and its use across all components that need to function in a multi-tenant setup is deprecated. Yet, some components require a similar unique identifier for the SQL server they're a part of. This new identifier is the SQLInstanceID. At the time of writing, since we're not yet able to start true SQL tenant servers and only exercise parts of this functionality in unit testing, it is a randomized integer. It's possible that we'll replace it with a per-tenant or even global counter in production, putting it truly on par with the NodeID, to be able to give the simple guarantee that these IDs are never reused between different incarnations of the SQL process for a given tenant. This final decision is kicked down the road, see cockroachdb#48009. This commit introduces an IDContainer which will aid this process by wrapping both the NodeID and SQLInstanceID, very similar to the work carried out for Gossip in cockroachdb#47972. We do not yet audit all of the call sites to keep the initial diff container. Instead, IDContainer has a Get() method that makes it a drop-in replacement for the previously used NodeIDContainer. `Get ` will be removed in a future commit while moving all callers to either the `Optional{,Err}` or `DeprecatedNodeID` methods. Release note: None

tbg · 2020-07-08T12:54:07Z

We use the SQL Instance ID where needed and various cleanup issues are filed, so closing.

tbg added A-multitenancy Related to multi-tenancy A-multitenancy-blocker labels Apr 24, 2020

tbg assigned nvanbenschoten and tbg and unassigned nvanbenschoten Apr 29, 2020

asubiotto mentioned this issue May 27, 2020

distsql: transition distsql physical planner to use SQLInstanceIDs instead of NodeIDs #49596

Closed

asubiotto mentioned this issue Jun 10, 2020

sql/multitenancy: eventlog should use sql instance IDs to be compatible with multitenancy #50047

Closed

tbg removed the A-multitenancy-blocker label Jul 8, 2020

tbg closed this as completed Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: replacement for NodeID #48009

sql: replacement for NodeID #48009

tbg commented Apr 24, 2020

ajwerner commented Apr 24, 2020

tbg commented Apr 28, 2020

nvanbenschoten commented Apr 28, 2020

tbg commented Jul 8, 2020

sql: replacement for NodeID #48009

sql: replacement for NodeID #48009

Comments

tbg commented Apr 24, 2020

ajwerner commented Apr 24, 2020

tbg commented Apr 28, 2020

nvanbenschoten commented Apr 28, 2020

tbg commented Jul 8, 2020