-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APM server monitoring #32515
APM server monitoring #32515
Conversation
/cc @chrisronline |
Pinging @elastic/es-core-infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts about creating a new index. I'm definitely more against creating a new one than for, but it does follow a pattern.
"index.number_of_replicas": 0, | ||
"index.number_of_shards": 1 | ||
}, | ||
"version": 7000001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: When this is backported, it needs to share the same version as the other templates.
} | ||
} | ||
}, | ||
"beats_stats": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should roll APM server stuff into the Beats index. We already explicitly ignore apm-server
data for all Beats monitoring UI, so it would help to avoid an extra template and set of indices, and the index is going to share a majority of the same fields with Beats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've already run into a lot of confusion with using beats
within apm-server
- I think users who monitor their indices and don't use beats will be very surprised to see a beats index present.
Are there any security concerns that would push towards separating the indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @ruflin, @graphaelli and I discussed this briefly off PR. The advantages I see of giving APM server its own set of monitoring indices are:
- We could, in the future, create a separate built-in security role for APM server monitoring and restrict it's write access to only APM server monitoring indices.
- The mapping for APM server monitoring indices can grow to fit APM server needs without "polluting" the general beats mapping.
- (Relatively minor) It removes the filtering in the Monitoring UI code for excluding APM server documents when looking at Beats monitoring and, vice versa, for only including APM server documents when looking at APM server monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The very big negative for adding a new index is that it creates a drain on Elasticsearch resources. Chances are, users won't have a lot of apm-server
instances reporting to the Monitoring cluster, so creating a separate index for such a low scale use case means we're adding the shards for what will likely be about 24 * 60 * 60 / 10 = 8640
documents per day. At the same time, we'll require Elasticsearch to search each one of those indices for data whenever we try to summarize the stack views.
That's extremely wasteful, especially for small instances/clusters that can generally host monitoring data, but are now getting pushed closer to their limits because of the added load from threading the searches, joining the data in memory, growing the cluster state, and adding the segments.
I think users who monitor their indices and don't use beats will be very surprised to see a beats index present.
We could get around this fear by either documenting it, or updating the index pattern for .monitoring-beats-6-*
to be .monitoring-beats-apm-6-*
. We could do this for 6.5 and add an alias to .monitoring-beats-6-alias
, to the template, so that there is no impact to the UI.
We could, in the future, create a separate built-in security role for APM server monitoring and restrict it's write access to only APM server monitoring indices.
We can still do this. The product user (Beats, Kibana, Logstash) writes indirectly to the monitoring data, so adding a user specific to APM does not require its own index. When Beats gains the ability to monitor APM then we can restrict access by adding a suffix to the index (e.g., .monitoring-beats-apm-6-apm-*
) and limit the user to that, if we so choose. It would be ideal, even if we do split the indices, to think of them as a .monitoring-*
index pattern rather than the subsets. This should help us, in the future, to avoid this kind of problem. We never should have separated Kibana into its own index. We did that for curation and this, but it's useless on its own, tiny, and never curated separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a chat about this with @ycombinator last week and we expect that apm-server metrics potentially diverge from the Beats ones and we would prefer to keep it separate to have the flexibility in the future.
For the resource usage problem we should rather look into using ILM to have less indices and shards in my opinion instead of mixing multiple products into 1 index.
The other extreme we could go is having the data form all products in 1 index as we currently do with Metricbeat, but we should not start to mix both options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expect that apm-server metrics potentially diverge from the Beats ones and we would prefer to keep it separate to have the flexibility in the future.
Isolating data by the prefixed type
has always worked for stack monitoring and nothing about that should be impacted by such deviations unless they try to share a type
(e.g., beats_stats
) while changing the meaning of specific fields. I don't expect that to happen (versus augmenting it, which is fine).
For the resource usage problem we should rather look into using ILM to have less indices and shards in my opinion instead of mixing multiple products into 1 index.
I agree that we should be using ILM/rollover when it is released, regardless of what we do here. However, at least for the short term, to avoid expanding the amount of data that we're storing on existing monitoring clusters, we will want to configure ILM to prune data based on time to match existing behavior (7 days by default). Longer term, particularly once Beats takes over, we can hopefully revisit this solution to be based more on size. Also, adding the shards and extra index mappings is an unnecessary impact that I really think that we should avoid.
The other extreme we could go is having the data form all products in 1 index as we currently do with Metricbeat, but we should not start to mix both options.
I strongly think that we should move in that direction. With Beats, if we / the user decides that they want their data isolated by product, then it's as simple as modifying the index like how I am suggesting to add the -apm
suffix. It really does boil down to a huge mistake that we aren't already doing this.
Think about the impact on the cluster from sharing the index relative to Beats-based usage:
- Searches use the same shards
- Bulk activity can use the same shards
Versus separate indices:
- Searches use separate shards
- Index mappings are separated (growing the cluster state and routing table)
- Bulk activity cannot be optimized further than by product
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chatted with @ruflin off-PR. For now, for the reasons that @pickypg has articulated in this thread, I'm going to not create a separate index (and template) for APM server monitoring documents. In the longer run, it's likely that we'll be changing indices altogether, perhaps consolidating all monitoring data into metricbeat indices; we can always revisit the indexing strategy then. So for now I'm going to update this PR to remove the index-related changes.
I will, however, keep the changes for a new built-in role and user for APM server monitoring, since asking users/admins to use beats users/roles could be easily confusing (as opposed to users/admins seeing monitoring indices with beats
in the name as those are a bit more behind the scenes).
"type": "keyword" | ||
}, | ||
"name": { | ||
"type": "keyword" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on marking this and other fields not relevant to apm-server as enabled: false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could even just remove such fields from the mapping entirely for now, and add them as/when we need them. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that require removing them from the libbeat reporting for apm-server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so because the mapping uses dynamic: false
, so any un-mapped fields will be simply ignored and not indexed (as opposed to if the mapping were to use dynamic: strict
, in which case un-mapped fields would cause errors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The belief is right. Anything unmapped just tags along in the _source
but is not used by ES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @graphaelli, which fields specifically is it okay to remove from the mapping? Let me know and I'll take them out of this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of metrics.libbeat.config.*
can go as we don't support modules or reloading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ycombinator Perhaps we can also make it possible to not show up in the events if not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin Can you clarify? Do you mean, we can make it so that Beats (APM server) doesn't ship these fields to Monitoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ycombinator Yes, but not only applies to apm-server as other beats also don't have these fields. But it's not important at the moment I would say.
I think we will also have to add a user for apm-server to Elasticsearch. |
++ will add built-in role and user, analogous to |
d73a203
to
23241b4
Compare
Pinging @elastic/es-security |
14e403b
to
aa65190
Compare
@@ -4,7 +4,7 @@ | |||
== elasticsearch-setup-passwords | |||
|
|||
The `elasticsearch-setup-passwords` command sets the passwords for the built-in | |||
`elastic`, `kibana`, `logstash_system`, and `beats_system` users. | |||
`elastic`, `kibana`, `logstash_system`, `beats_system`, and `apm_server_system` users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to call the user apm-server_system
? I mainly worry that users might mistype it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't mixing the dash and the underscore make it more easy for users to mistype the username?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be argued both ways and probably both options are going to cause trouble. Go with the option you feel more comfortable with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@graphaelli Why don't you break the tie here, since this username is intended for APM customers after all 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apm_server_system
looks great, thanks
Looks like the failing test is related:
|
Yeah, I can fix that test, but I'll have to "fix" it again if we decide not to go with a separate index. So I was hoping to get that resolved first. |
3a38edc
to
9985f11
Compare
@pickypg @ruflin @graphaelli I've updated this PR per #32515 (comment). Ready for your review. Additionally, could someone from @elastic/es-security review as well, please? Thanks! |
@@ -63,7 +64,7 @@ | |||
public class SetupPasswordTool extends LoggingAwareMultiCommand { | |||
|
|||
private static final char[] CHARS = ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789").toCharArray(); | |||
public static final List<String> USERS = asList(ElasticUser.NAME, KibanaUser.NAME, LogstashSystemUser.NAME, BeatsSystemUser.NAME); | |||
public static final List<String> USERS = asList(ElasticUser.NAME, KibanaUser.NAME, LogstashSystemUser.NAME, BeatsSystemUser.NAME, APMServerSystemUser.NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line probably needs shortening. See CI error:
21:07:59 [ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/esnative/tool/SetupPasswordTool.java:67: Line is longer than 140 characters (found 160). [LineLength]
3ebd1f5
to
88875c1
Compare
Now CI found a new place to complain about line length:
|
88875c1
to
44f0bb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My only thought is around whether we need server
in the username, but I'm fine with or without it.
@graphaelli what's your opinion on this? I put |
0df5f98
to
3a5cef1
Compare
@jaymode CI is green again, mind taking a (hopefully final) peek? Thanks! Shaunak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Adding new MonitoredSystem for APM server * Teaching Monitoring template utils about APM server monitoring indices * Documenting new monitoring index for APM server * Adding monitoring index template for APM server * Copy pasta typo * Removing metrics.libbeat.config section from mapping * Adding built-in user and role for APM server user * Actually define the role :) * Adding missing import * Removing index template and system ID for apm server * Shortening line lengths * Updating expected number of built-in users in integration test * Removing "system" from role and user names * Rearranging users to make tests pass
Backported to:
|
* This was broken by elastic#32515 since the 5.x versions were removed between PR creation and merge
* This was broken by #32515 since the 5.x versions were removed between PR creation and merge
* This was broken by elastic#32515 since the 5.x versions were removed between PR creation and merge
* master: Adjust BWC version on mapping version Token API supports the client_credentials grant (#33106) Build: forked compiler max memory matches jvmArgs (#33138) Introduce mapping version to index metadata (#33147) SQL: Enable aggregations to create a separate bucket for missing values (#32832) Fix grammar in contributing docs SECURITY: Fix Compile Error in ReservedRealmTests (#33166) APM server monitoring (#32515) Support only string `format` in date, root object & date range (#28117) [Rollup] Move toBuilders() methods out of rollup config objects (#32585) Fix forbiddenapis on java 11 (#33116) Apply publishing to genreate pom (#33094) Have circuit breaker succeed on unknown mem usage Do not lose default mapper on metadata updates (#33153) Fix a mappings update test (#33146) Reload Secure Settings REST specs & docs (#32990) Refactor CachingUsernamePassword realm (#32646)
* 6.x: Introduce mapping version to index metadata (#33147) Move non duplicated actions back into xpack core (#32952) HLRC: Create server agnostic request and response (#32912) Build: forked compiler max memory matches jvmArgs (#33138) * Added breaking change section for GROUP BY behavior: now it considers null or empty values as a separate group/bucket. Previously, they were ignored. * This is part of backporting of #32832 SQL: Enable aggregations to create a separate bucket for missing values (#32832) [TEST] version guard for reload rest-api-spec Fix grammar in contributing docs APM server monitoring (#32515) Support only string `format` in date, root object & date range (#28117) [Rollup] Move toBuilders() methods out of rollup config objects (#32585) Accept Gradle build scan agreement (#30645) Fix forbiddenapis on java 11 (#33116) Run forbidden api checks with runtimeJavaVersion (#32947) Apply publishing to genreate pom (#33094) Fix a mappings update test (#33146) Reload Secure Settings REST specs & docs (#32990) Refactor CachingUsernamePassword realm (#32646)
This PR:
apm_system
built-in role with privileges to call the bulk Monitoring API, analogous to thebeats_system
built-in role.apm_system
built-in user with theapm_system
role, analogous to thebeats_system
built-in user.