-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: always enable sql_instances maintenance #95355
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't so straightforward, is it? The existing instance infrastructure is going to lease an ID. That ID is not going to be the same as the node ID. That's going to be very confusing. I think we need an implementation of instance storage which always uses the node ID or something like that.
I... guess? I donno, I feel like we need a pretty good reason to make it behave differently and I'm not sure this is one? It seems like we should just say kv node IDs are kv node IDs, and are meaningful in the realm of kv nodes, and sql server ids are sql server ids and are meaningful in the world of sql servers -- and the sql_instances table is the source of truth for who has which ID -- and that they have nothing to do with each other? |
It's fine. That is what we were planning to do. |
Let's let @knz weigh in. I'm fine with making them different, but it leads to some questions. In particular, it may become possible for the instance ID to change throughout the life of the process. With some more work, we could make it so that that doesn't really happen, but that's going to take some real work. |
I guess I can make it the same for now if it lets us just merge this sooner rather than later and then have that debate in isolation. Let me see how that'll look. |
1761922
to
e033ed3
Compare
No, it's the other way around. We need to inject the node ID as instance ID explicitly. This is also what #84602 is about. |
Okay, IDs are fixed to nodeIDs for now with a TODO suggesting we remove that at a later point. RFAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @dt)
pkg/server/server_sql.go
line 505 at r1 (raw file):
if isMixedSQLAndKVNode { cfg.podNodeDialer = cfg.nodeDialer
I believe this special case shouldn't be needed. Is it?
pkg/server/server_sql.go
line 813 at r1 (raw file):
nodeDialer = cfg.nodeDialer // TODO(dt): any reason not to just always use the instance reader? And just // pass it directly instead of making a new closure here?
I don't understand your question. How does the instance reader know about the decommissioned bit?
pkg/server/server_sql.go
line 837 at r1 (raw file):
instanceIDs := make([]roachpb.NodeID, len(instances)) for i, instance := range instances { instanceIDs[i] = roachpb.NodeID(instance.InstanceID)
How does this code know how to skip over decommissioned nodes?
(The answer to this question is probably what answers the one above too.)
pkg/sql/sqlinstance/instancestorage/instancestorage.go
line 187 at r1 (raw file):
// and avoiding conflicts. This is true today but may not be in more // complex deployments. if nodeID != 0 {
Please do not rely on a sentinel value as a conditional; instead, use a fixInstanceId bool
parameter or something equivalently verbose, with an explanation on the function docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @dt, and @knz)
pkg/server/server_sql.go
line 813 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I don't understand your question. How does the instance reader know about the decommissioned bit?
The TODO is just to revisit why there are two implementations; I am inclined to agree that they appear different (one skips decom) so I didn't eliminate them now, just left a TODO to evaluate it later.
pkg/server/server_sql.go
line 837 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
How does this code know how to skip over decommissioned nodes?
(The answer to this question is probably what answers the one above too.)
no idea; i didn't write or move this, but if it is good enough for sql pods, it seems like it should be good enough for non-pods, which was my point above. if it isn't good enough for pods, then we should fix that, but that's not really what this PR is about.
pkg/sql/sqlinstance/instancestorage/instancestorage.go
line 187 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Please do not rely on a sentinel value as a conditional; instead, use a
fixInstanceId bool
parameter or something equivalently verbose, with an explanation on the function docstring.
This signature was getting cumbersome enough, so I decided I'd rather just split it to have two different public API methods (and then I don't really mind the sentinel value if is strictly package-internal for their common implementation helper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/server/server_sql.go
line 505 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I believe this special case shouldn't be needed. Is it?
I don't really know; it was here before and I left it as-is and just pulled the instance storage stuff out of the conditional and left it as-is since I didn't want to change anything not strictly required here given proximity to stability and my desire to get this done asap and move on to the feature work it is blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the system.sql_instances table was only maintained by SQL servers that were operating in "pod" mode, i.e. not in mixed KV and SQL process nodes, where KV-level liveness and gossip provides an alternative means of node discovery that can be used by the SQL layer when searching for other SQL instances. However this inconsistency makes writing correct remote-node discovery and interaction SQL-level code difficult: in some cases such code needs to consult the instances list, and in some the KV liveness store, which when combined with complexities of doing so around initialization, dependency-injection, etc can become hard to maintain.
My opinion is that it seems better to not give shared-process tenants rows in system.sql_instances
. I know not everybody seems to agree, but that seems to me like that simple answer. If there are no rows in that table, there are no questions about them. For example, there's no question about whether the instance ID should be the same as the node ID (btw, my opinion on that one is that they should definitely be the same; making them different would be a big regression to me). Another big question is when should these rows be created. This dovetails with an area I've generally felt that it needs some thought/decision: when do shared-process tenant servers get created? I think things are in flux; I think @knz wants to create them dynamically, the first time when a SQL or HTTP or RPC connection comes along to a node for a particular tenant. But, then, if we also do the creation of the sql_instance
row at that time (like I think your PR does, right?), DistSQL would not find any instances around it. Which would be different from the non-UA clusters, for the worse.
Yet another question is when should these rows be deleted. We're already forced to deal with node status (alive, dead, decommissioned, etc) - and I'd note that we haven't been known to do a great job at it - so why reinvent it for sql instances?
So, I for one would rather we push on making better abstractions for listing nodes that better hide whether sql_instances
is used or not. I see you've run into some difficulties there, and I remember everything being messy, but I imagine that we could make it better.
this inconsistency
There are other inconsistencies between shared-process tenants and SQL pods, particularly around how they deal with their "liveness/sql sessions". SQL pods heartbeat a liveness session, and the instance ID is tied to that session id. If there's no session ID, than do we leave the session_id
field of sql_instances
empty? I think that's ugly and forces the code dealing with these instances ID to understand the different types of clusters.
There's a related discussion about whether shared-process tenants should also get sessions, in the name of consistency. I, for one, would be even more against that than against giving them sql_instances
.
Additionally such a design precludes a cluster where some SQL instances are in mixed KV nodes and some are not, as the non-KV nodes would have no way discover the KV ones. Such deployments are not currently possible but could be in the future.
Indeed. I discussed this a bit in the past with Werner and Rafa, and nobody seemed to care about such clusters. So, I feel like we can leave this hanging.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
No, I'm registering at server startup, whether I'm a pod or in-proc. I'm registering using the node ID if I'm in-proc, and the next available ID if I'm a pod (since a single cluster never mixes these, conflicts are not a concern). More broadly, on the other points in this paragraph, I don't really see this line of argument about it being complex/inconsistent/confusing to reason about when there is or isn't a row or whatever -- if anything I think this change makes it simpler: If you started a sql server, i.e. a server that I might want to know about when I'm scheduling a distsql flow, then you put it in the table that I'll look in to find the sql servers, and do so linked to the session that'll let me know that that server is still alive and well. That's it, end of story, "well, if it is..." things we have to handle specially. It doesn't matter if that sql server is in a kv proc or sql pod to me -- that's irrelevant to this question of what are the sql servers, where (locality-wise) are they, and how can I contact them to ask them to run my flows.
all sql servers -- in proc or standalone -- get sessions. The jobs system, for example, depends on this, as job leases are tied to those sessions. Having this be uniforms good and I don't want to change that. Perhaps we could go further though and unify the idea of a sql liveness session with a sql_instance row, so they become one and the same. But that's for another day. For now I just want to make sql-process-discovery -- including the sessions in use by those processes -- uniform across in-proc and pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like I think your PR does, right?
No, I'm registering at server startup, whether I'm a pod or in-proc. I'm registering using the node ID if I'm in-proc, and the next available ID if I'm a pod (since a single cluster never mixes these, conflicts are not a concern).
I'm not sure, but I think "server startup" is not what you think it is. You've changed newSQLServer()
, right? That doesn't get called when the node starts; it gets called later, dynamically, when the serverController
has decided to instantiate a particular tenant.
Right?
If there's no session ID, than do we leave the session_id field of sql_instances empty
all sql servers -- in proc or standalone -- get sessions. The jobs system, for example, depends on this, as job leases are tied to those sessions. Having this be uniforms good and I don't want to change that. Perhaps we could go further though and unify the idea of a sql liveness session with a sql_instance row, so they become one and the same. But that's for another day. For now I just want to make sql-process-discovery -- including the sessions in use by those processes -- uniform across in-proc and pod.
You're right. I had forgotten how stuff works.
Still, apart for shared-process tenants, I think it's a bad idea to tie the instance id to a particular session.
So when you do this:
instance, err = s.sqlInstanceStorage.CreateNodeInstance(
ctx, session.ID(), session.Expiration(), s.cfg.AdvertiseAddr, s.cfg.SQLAdvertiseAddr, s.distSQLServer.Locality, nodeId)
I don't think we should be referencing the session. We don't want this instance ID to ever "expire"; there's no reason for it to expire. If it were to expire, various things that want a non-expired session would stop working for no good reason. It just so happens that the primary user of the instance ID (the unique_rowid()
function) does not bother to check the expiration status, but that's simply bug for serverless (#90459). (btw, for stand-alone tenants, the process is shut down when the session expires particularly because the instance ID has become invalid; we don't do the shutdown for the shared-process tenants.).
So if we do give instance IDs to shared-proc tenants, I think their ids should be decoupled from the session (but I maintain my opinion for now that we shouldn't create instance IDs at all).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
Can you elaborate? why did it expire? if it expires, it implies that this server has stopped being able to reach and update liveness, which implies it is no longer heathy and function. We tie things to liveness session specifically because we want to know just that -- it that node or whatever we tied to it still alive and heathy. If an in-proc SQL server has stopped being able to update its liveness session then I probably don't really want it to keep being the coordinator for my job or assign any new flow processors to it, since something it clearly wrong with it. |
My motivation here is specifically because I need a mechanism for discovery of all the sql servers on which I could schedule a flow and where they are located as well as their in current session, so that I can tie the transfer of work to them to that session -- if they go away while the transfer is happening, I don't want to leave the transferred work orphaned, so I explicitly want to transfer it with a specific liveness session attached. |
We have exactly the thing I need -- the list of all sql servers, where they are located, how to contact them and their associated liveness session - already implemented in the sql_instances table and have all the code to maintain it. And we know it all works and we'll keep maintaining it because we rely on it for MT clusters. But then I discover that this is only run for MT servers, not in-proc servers. There is no persisted directory of in-proc servers and their current liveness session, so I'd need to go add a new RPC to ask each server for its session, and even getting the list of sql servers now involves hitting raw kv keys instead of a system table. Having this be wildly different just because of the process the server lives in seems prone to making it harder to use, more code and more bugs. I also don't really see "some things don't check liveness" as a reason not to have the list. like, sure, then we should fix them! but surely having two different ways to check liveness and knowing which you need to check is much worse isn't it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? why did it expire? if it expires, it implies that this server has stopped being able to reach and update liveness, which implies it is no longer heathy and function.
I also don't really see "some things don't check liveness" as a reason not to have the list. like, sure, then we should fix them! but surely having two different ways to check liveness and knowing which you need to check is much worse isn't it?
In shared-proc tenants, for the purposes of unique_rowid()
, whether or not the sql server's session is expired or not does not matter. We're able to generate unique IDs just fine whether or not we're able to heartbeat the session record. We're also able to generate unique IDs just fine even if we can't heartbeat the node's liveness record, or even if the local storage device is unresponsive, etc. We shouldn't tie together things that don't need to be tied.
The reason why we have to tie unique_rowid
to the session expiration in standalone tenants is because there some other sql server might take your instance ID away, at which point you don't have the ability to generate unique ids any more.
My motivation here is specifically because I need a mechanism for discovery of all the sql servers on which I could schedule a flow and where they are located as well as their in current session, so that I can tie the transfer of work to them to that session -- if they go away while the transfer is happening, I don't want to leave the transferred work orphaned, so I explicitly want to transfer it with a specific liveness session attached.
I'm not generally against the use of the sql sessions. You can use them as far as I'm concerned. I'm not entirely convinced they needed to exist for single-tenant and shared-proc tenant servers, since it seems to me that they very much overlap in functionality with the node liveness record and its epoch, but whatever. So I take back my comment about how I don't like shared-proc tenants having them; I had forgotten that even single-tenant processes have them.
The instance IDs, though, are another matter in my opinion.
We have exactly the thing I need -- the list of all sql servers, where they are located, how to contact them and their associated liveness session - already implemented in the sql_instances table and have all the code to maintain it.
Except I think we don't. Does my note about how I don't think that the sql_instances
table is populated at node start time, like I think you assume it is, give you pause?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
For mixed kv and sql nodes, it looks like it is called at start, isn't it? pkg/server/server.go:891 ? and those are the only ones I'm suggesting changing -- we already enable this for any non-mixed servers. |
Sure, but why make SQL users deal with the complexity of using one of two different systems depending on where they're running, when they could just always use one that works in both? Why use KV node listing to find sql servers when they're colocated, and use instance table when they're not, vs just always using the instance table? If there's some argument that the instance table, or sql liveness, isn't good enough and we should use the better kv system when it is available instead, that seems like a problem for our pods where it is their only choice. So if it has to be good enough for pods to use, why not just use it all the time? Fewer branches and conditionals and special cases means less code, easier feature development, and fewer bugs. |
Here's my take: while node IDs and epochs are similar and overlapping, they are not identical and they have different APIs. It's nice to not have to build adapters everywhere -- and -- for better or for worse, we have sql sessions in all the processes. After some listening, I find David's point about wanting to discover these sessions compelling. I also find arguments for uniformity compelling. What I don't love is that in the short term we'll get confusing new IDs in play. I don't want to have to do the mapping of instance to node IDs while we don't need to. Does anybody expect a world with a mixture of kv nodes and sql pods of the same tenant any time soon? My proposal is that we indeed leverage the sql instance subsystem in kv nodes to propagate the information in the same way, but we don't do the same leasing dance -- we just plop the node ID into the instance ID slot and state the session. That way it'll behave uniformly on the reading path, but it'll unify the two concepts explicitly in the case where the server is in a kv node. |
I believe the code you point to is only relevant for the sql server corresponding to the system tenant.
In the case of sql instance IDs, I think the differences between these IDs are enough to not make it a good idea for shared-proc tenants to use them:
So ultimately it's all subjective and there's tradeoffs, etc, but my opinion is that we'd end up with worse code by trying to hide the differences between the instance IDs between shared-proc and dedicated tenants pods while using the Btw, as far as I'm concerned, I don't think you particularly need my nod of approval here, in particular if you get something that both Rafa and Werner are happy with. I don't feel any ownership here, fyi. |
There is one other thing to note: in a world with repaving, the node ID can get above 15 bits without too much work. Once that becomes true, all claims about |
After a lengthy discussion with @andreimatei and @ajwerner offline, we concluded a few things/observations to state explicitly:
I'm sure I'm missing some things since we chatted for the better part of two hours and this is a brief summary from recollection, but I think the big take-aways: |
Added a test showing session ID being kept up to date after the session is expired both when the instance row is removed in the interim and when it is not (they're the same, due to Put, but seemed good to show both) |
what in this PR needs to change in light of the recent mini-RFC? |
My read is nothing. This diff is about populating the tables when you are running and that RFC is about when you are or are not running, but doesn't change what we do here. |
So after discussions today/yesterday/etc -- are we good here? Anything else to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt)
pkg/sql/sqlinstance/instancestorage/instancestorage.go
line 176 at r4 (raw file):
nodeID roachpb.NodeID, ) (instance sqlinstance.InstanceInfo, _ error) {
nit: empty line
Previously the system.sql_instances table was only maintained by SQL servers that were operating in "pod" mode, i.e. not in mixed KV and SQL process nodes, where KV-level liveness and gossip provides an alternative means of node discovery that can be used by the SQL layer when searching for other SQL instances. However this inconsistency makes writing correct remote-node discovery and interaction SQL-level code difficult: in some cases such code needs to consult the instances list, and in some the KV liveness store, which when combined with complexities of doing so around initialization, dependency-injection, etc can become hard to maintain. Additionally such a design precludes a cluster where some SQL instances are in mixed KV nodes and some are not, as the non-KV nodes would have no way discover the KV ones. Such deployments are not currently possible but could be in the future. Instead, this change enabled maintenance of the sql_instances table by all SQL servers, whether running in their own processes or embedded in a KV storage node process. This paves the way for making the means of discovery of SQL servers uniform across all SQL server types: they will all be able to simply consult the instances list, to find any other SQL servers, regardless of where those SQL servers are running. A follow-up change could simplify DistSQLPhysicalPlanner, specifically the SetupAllNodesPlanning method that has two different implementations due to the previous inconsistency in the available APIs. Release note: none. Epic: none.
Release note: none. Epic: none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
bors r=knz
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/sql/sqlinstance/instancestorage/instancestorage.go
line 176 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: empty line
fixed, thanks!
Build failed (retrying...): |
Build succeeded: |
Fixes #95571.
Previously the system.sql_instances table was only maintained by SQL servers that were operating in "pod" mode, i.e. not in mixed KV and SQL process nodes, where KV-level liveness and gossip provides an alternative means of node discovery that can be used by the SQL layer when searching for other SQL instances. However this inconsistency makes writing correct remote-node discovery and interaction SQL-level code difficult: in some cases such code needs to consult the instances list, and in some the KV liveness store, which when combined with complexities of doing so around initialization, dependency-injection, etc can become hard to maintain.
Additionally such a design precludes a cluster where some SQL instances are in mixed KV nodes and some are not, as the non-KV nodes would have no way discover the KV ones. Such deployments are not currently possible but could be in the future.
Instead, this change enabled maintenance of the sql_instances table by all SQL servers, whether running in their own processes or embedded in a KV storage node process. This paves the way for making the means of discovery of SQL servers uniform across all SQL server types: they will all be able to simply consult the instances list, to find any other SQL servers, regardless of where those SQL servers are running.
A follow-up change could simplify DistSQLPhysicalPlanner, specifically the SetupAllNodesPlanning method that has two different implementations due to the previous inconsistency in the available APIs.
Release note: none.
Epic: CRDB-14537