-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer security auto-configuration on initial node startup #82574
Defer security auto-configuration on initial node startup #82574
Conversation
Pinging @elastic/es-security (Team:Security) |
It is deemed a bug and targeting v8 past FF because it impacts user experience which is one of the core goals of Security ON by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
I think the change is good, but having to do it by blocking/sleeping threads isn't ideal.
If there's reasons why it needs to be this way, let me know, but I think we can shuffle things a little bit and avoid having threads waiting on signals/timers.
nodeStartedSignal.await(); | ||
// a lot of stuff runs when a node just started, and the autoconfiguration is not time-critical | ||
// and nothing else depends on it; be a good sport and wait a couple | ||
Thread.sleep(9000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there are reason for sleep
rather than a scheduled thread?
It's minor issue but it's better not to use up one on the generic threads sleeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overlooked the existence of the threadpool.schedule
method.
I have pushed e0a33f3 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above I actually forgot to remove the sleep, I've later pushed 459d6a9 .
@@ -1282,6 +1288,11 @@ public void onIndexModule(IndexModule module) { | |||
); | |||
} | |||
|
|||
@Override | |||
public void onNodeStarted() { | |||
this.nodeStartedSignal.countDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we signal rather than just moving the call to InitialNodeSecurityAutoConfiguration.maybeGenerateEnrollmentTokensAndElasticCredentialsOnNodeStartup
into this method?
I presume it's because we'd need to push all the arguments somewhere, but with this approach we've got a generic thread sitting waiting on a countdown when it could just be deferred until the right part of the lifecycle.
I understand that we're trying to do the smallest thing here, so maybe it's OK for now, but I think it would be better to make InitialNodeSecurityAutoConfiguration
a real object, and store it in a field, and then call execute()
on it from onNodeStarted
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we signal rather than just moving the call to InitialNodeSecurityAutoConfiguration.maybeGenerateEnrollmentTokensAndElasticCredentialsOnNodeStartup into this method?
securityIndexManager.onStateRecovered
must be invoked before the cluster state recovers, whcih means somewhere in the Node
constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clearer, there are technically two conditions that must be met to trigger this last part of auto-configuration:
- the cluster state is just recovered and does not contain the
.security
index (which might later be created by other processes) - the node is started
In practice, in my tests, these happen almost simultaneously.
In order to detect 1), a Security Index manager listener has to be registered before cluster state recovery starts, otherwise it might miss the notification for the state recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we're trying to do the smallest thing here, so maybe it's OK for now, but I think it would be better to make InitialNodeSecurityAutoConfiguration a real object, and store it in a field, and then call execute() on it from onNodeStarted.
I have replaced the CountdownLatch
with a ListenableFuture
, 4902844 .
This way no thread is blocked, instead a listener is registered, which can be called by anyone on any thread because itself only schedules an action.
How did we decide on the 9 seconds wait time ? It’s not that I have a bette suggestion or another number in mind but whether or not this is enough is mainly dependant on how much time the geo-up download stuff takes and this varies with each user’s connection speed , server availability etc last time we discussed this in the project meeting , I remember folks being in general not pro a fixed amount of delay because of not knowing what amount makes sense and instead wanted to focus on making the template installation output and the geo ip output smaller by consolidating messages ie Since geoip is unpredictable but largely resolved, I’m happy if 9s pulls us through the template output in a normal case (but I suspect this is also cpu/ram dependant) and we can remove / revisit when template output is adjusted in general I’m +1 , just wanted to give some additional context. I can take a proper loook on Monday too when I’ll have a access to a laptop again but happy to merge this before if Tim approves |
I have chosen it. I am familiar with that story, but I don't know what conclusion to take from it. |
I realise that , I was just curious as to the why more than the who :) as in why 9, not 5 or 12
The reason I brought it up is that we had discussed it and decided not to do it (add a standard delay) but now we’re doing it. I’m fine with it, as I was when putting this forward as an option originally and we can revisit it in the future. |
I've not been very methodical when choosing the value. Usually only one or two downloaded geo dbs (there are 3 in total, 6 log lines) produce log lines before the auto-conf details are printed. I was content that the information was visible on the smallish terminal size.
I'm not sure which is more reasonable: to merge the log lines or to delay the autoconf info. I don't have any favorite between the two. So I picked the one that was floated first when we discussed this issue last. |
The low hanging fruit there is gone. The geo-ip messages have been reduced, but it's still too long for our needs and there's no chance of making them shorter in 8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I suspect we'll want to revisit this (somewhere around 8.2?) but this change makes progress for 8.0
) The new security autoconfiguration in version 8 prints information to the terminal. This information is required in order to access the node (eg the generated elastic user password). But this contends for screen lines with the normal log output of a starting node, to the extent that the security autoconfiguration can pass by unnoticed by the user. This PR addresses this concern by: * making sure that auto-configuration is triggered only after the node is started (ClusterPlugin#onNodeStarted). Auto-configuration requires the http bind address, which is guaranteed to be available only after this point in time. * Additionally deferring the auto-configuration by a fixed time amount (9 sec), even after all the depended resources have finished initializing.
…82733) The new security autoconfiguration in version 8 prints information to the terminal. This information is required in order to access the node (eg the generated elastic user password). But this contends for screen lines with the normal log output of a starting node, to the extent that the security autoconfiguration can pass by unnoticed by the user. This PR addresses this concern by: * making sure that auto-configuration is triggered only after the node is started (ClusterPlugin#onNodeStarted). Auto-configuration requires the http bind address, which is guaranteed to be available only after this point in time. * Additionally deferring the auto-configuration by a fixed time amount (9 sec), even after all the depended resources have finished initializing.
The new security autoconfiguration in version 8 prints information to the terminal. This information is required in order to access the node (eg the generated elastic user password). But this contends for screen lines with the normal log output of a starting node, to the extent that the security autoconfiguration can pass by unnoticed by the user.
For example, it can be easily missed in the case of a small-ish terminal window (32x110 px):
But it is observable on a big one (66x222 px):
This PR addresses this concern by:
ClusterPlugin#onNodeStarted
). Auto-configuration requires the http bind address, which is guaranteed to be available only after this point.Thread.sleep(9000)
), even after all the depended resources have finished initializing. This is ugly but the alternative to wait for other components to do their thing (eg installing index templates, or downloading of the geo-ip database), even though auto-configuration does not depend on then, is not a better alternative either.Here are the results for the small terminal (32x110 px):
And for the bigger one (66x222 px):
Related: #82364