Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer security auto-configuration on initial node startup #82574

Merged

Conversation

albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Jan 13, 2022

The new security autoconfiguration in version 8 prints information to the terminal. This information is required in order to access the node (eg the generated elastic user password). But this contends for screen lines with the normal log output of a starting node, to the extent that the security autoconfiguration can pass by unnoticed by the user.

For example, it can be easily missed in the case of a small-ish terminal window (32x110 px):
before_32_110

But it is observable on a big one (66x222 px):
before_66_222

This PR addresses this concern by:

  • making sure that auto-configuration is triggered only after the node is started (ClusterPlugin#onNodeStarted). Auto-configuration requires the http bind address, which is guaranteed to be available only after this point.
  • additionally deferring the auto-configuration by a fixed time amount (Thread.sleep(9000)), even after all the depended resources have finished initializing. This is ugly but the alternative to wait for other components to do their thing (eg installing index templates, or downloading of the geo-ip database), even though auto-configuration does not depend on then, is not a better alternative either.

Here are the results for the small terminal (32x110 px):
after_32_110

And for the bigger one (66x222 px):
after_66_222

Related: #82364

@albertzaharovits albertzaharovits self-assigned this Jan 13, 2022
@albertzaharovits albertzaharovits added :Security/Security Security issues without another label >bug v8.0.0 and removed v8.1.0 labels Jan 13, 2022
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Jan 13, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@albertzaharovits
Copy link
Contributor Author

It is deemed a bug and targeting v8 past FF because it impacts user experience which is one of the core goals of Security ON by default.

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.
I think the change is good, but having to do it by blocking/sleeping threads isn't ideal.

If there's reasons why it needs to be this way, let me know, but I think we can shuffle things a little bit and avoid having threads waiting on signals/timers.

nodeStartedSignal.await();
// a lot of stuff runs when a node just started, and the autoconfiguration is not time-critical
// and nothing else depends on it; be a good sport and wait a couple
Thread.sleep(9000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are reason for sleep rather than a scheduled thread?
It's minor issue but it's better not to use up one on the generic threads sleeping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the existence of the threadpool.schedule method.
I have pushed e0a33f3 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above I actually forgot to remove the sleep, I've later pushed 459d6a9 .

@@ -1282,6 +1288,11 @@ public void onIndexModule(IndexModule module) {
);
}

@Override
public void onNodeStarted() {
this.nodeStartedSignal.countDown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we signal rather than just moving the call to InitialNodeSecurityAutoConfiguration.maybeGenerateEnrollmentTokensAndElasticCredentialsOnNodeStartup into this method?

I presume it's because we'd need to push all the arguments somewhere, but with this approach we've got a generic thread sitting waiting on a countdown when it could just be deferred until the right part of the lifecycle.

I understand that we're trying to do the smallest thing here, so maybe it's OK for now, but I think it would be better to make InitialNodeSecurityAutoConfiguration a real object, and store it in a field, and then call execute() on it from onNodeStarted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we signal rather than just moving the call to InitialNodeSecurityAutoConfiguration.maybeGenerateEnrollmentTokensAndElasticCredentialsOnNodeStartup into this method?

securityIndexManager.onStateRecovered must be invoked before the cluster state recovers, whcih means somewhere in the Node constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clearer, there are technically two conditions that must be met to trigger this last part of auto-configuration:

  1. the cluster state is just recovered and does not contain the .security index (which might later be created by other processes)
  2. the node is started

In practice, in my tests, these happen almost simultaneously.

In order to detect 1), a Security Index manager listener has to be registered before cluster state recovery starts, otherwise it might miss the notification for the state recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we're trying to do the smallest thing here, so maybe it's OK for now, but I think it would be better to make InitialNodeSecurityAutoConfiguration a real object, and store it in a field, and then call execute() on it from onNodeStarted.

I have replaced the CountdownLatch with a ListenableFuture, 4902844 .
This way no thread is blocked, instead a listener is registered, which can be called by anyone on any thread because itself only schedules an action.

@jkakavas
Copy link
Member

How did we decide on the 9 seconds wait time ? It’s not that I have a bette suggestion or another number in mind but whether or not this is enough is mainly dependant on how much time the geo-up download stuff takes and this varies with each user’s connection speed , server availability etc
On the other hand, we don’t want to wait too much because we don’t want to risk the user losing focus from the terminal assuming “it’s done”

last time we discussed this in the project meeting , I remember folks being in general not pro a fixed amount of delay because of not knowing what amount makes sense and instead wanted to focus on making the template installation output and the geo ip output smaller by consolidating messages ie
#81640

Since geoip is unpredictable but largely resolved, I’m happy if 9s pulls us through the template output in a normal case (but I suspect this is also cpu/ram dependant) and we can remove / revisit when template output is adjusted

in general I’m +1 , just wanted to give some additional context. I can take a proper loook on Monday too when I’ll have a access to a laptop again but happy to merge this before if Tim approves

@albertzaharovits
Copy link
Contributor Author

How did we decide on the 9 seconds wait time ?

I have chosen it.

I am familiar with that story, but I don't know what conclusion to take from it.

@jkakavas
Copy link
Member

I have chosen it.

I realise that , I was just curious as to the why more than the who :) as in why 9, not 5 or 12
For instance, I’d go for the minimum amount of seconds that gets us after the template output. Could 9 be any shorter ?
I’m not trying to argue for the sake of argument, just trying to understand and maybe help tweak this a bit in favor of users ( if that can even be done )

I am familiar with that story, but I don't know what conclusion to take from it.

The reason I brought it up is that we had discussed it and decided not to do it (add a standard delay) but now we’re doing it. I’m fine with it, as I was when putting this forward as an option originally and we can revisit it in the future.

@albertzaharovits
Copy link
Contributor Author

I realise that , I was just curious as to the why more than the who :) as in why 9, not 5 or 12
For instance, I’d go for the minimum amount of seconds that gets us after the template output. Could 9 be any shorter ?
I’m not trying to argue for the sake of argument, just trying to understand and maybe help tweak this a bit in favor of users ( if that can even be done )

I've not been very methodical when choosing the value. Usually only one or two downloaded geo dbs (there are 3 in total, 6 log lines) produce log lines before the auto-conf details are printed. I was content that the information was visible on the smallish terminal size.

last time we discussed this in the project meeting , I remember folks being in general not pro a fixed amount of delay because of not knowing what amount makes sense and instead wanted to focus on making the template installation output and the geo ip output smaller by consolidating messages ie

I'm not sure which is more reasonable: to merge the log lines or to delay the autoconf info. I don't have any favorite between the two. So I picked the one that was floated first when we discussed this issue last.

@tvernum
Copy link
Contributor

tvernum commented Jan 18, 2022

wanted to focus on making the template installation output and the geo ip output smaller by consolidating messages

The low hanging fruit there is gone. The geo-ip messages have been reduced, but it's still too long for our needs and there's no chance of making them shorter in 8.0

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I suspect we'll want to revisit this (somewhere around 8.2?) but this change makes progress for 8.0

@albertzaharovits albertzaharovits merged commit 05ddae8 into elastic:master Jan 18, 2022
@albertzaharovits albertzaharovits deleted the print-enrollment-saliently branch January 18, 2022 10:45
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this pull request Jan 18, 2022
)

The new security autoconfiguration in version 8 prints information to the terminal.
This information is required in order to access the node (eg the generated elastic
user password). But this contends for screen lines with the normal log output of
a starting node, to the extent that the security autoconfiguration can pass by
unnoticed by the user.

This PR addresses this concern by:
* making sure that auto-configuration is triggered only after the node is started
(ClusterPlugin#onNodeStarted). Auto-configuration requires the http bind address,
which is guaranteed to be available only after this point in time.
* Additionally deferring the auto-configuration by a fixed time amount (9 sec),
even after all the depended resources have finished initializing.
elasticsearchmachine pushed a commit that referenced this pull request Jan 18, 2022
…82733)

The new security autoconfiguration in version 8 prints information to the terminal.
This information is required in order to access the node (eg the generated elastic
user password). But this contends for screen lines with the normal log output of
a starting node, to the extent that the security autoconfiguration can pass by
unnoticed by the user.

This PR addresses this concern by:
* making sure that auto-configuration is triggered only after the node is started
(ClusterPlugin#onNodeStarted). Auto-configuration requires the http bind address,
which is guaranteed to be available only after this point in time.
* Additionally deferring the auto-configuration by a fixed time amount (9 sec),
even after all the depended resources have finished initializing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Security/Security Security issues without another label Team:Security Meta label for security team v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants