-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAC][Rule Registry] Index bootstrapping: make it more robust #106425
Comments
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
Hey everyone, I removed this ticket from the backlog of the Detection Rules area. We (@elastic/security-detections-response-rules) are not the owners anymore (however feel free to still ping us if you have any tech questions about the ticket). Ownership of this ticket and other tickets related to rule_registry (like #101016) now goes to the Detection Alerts area ( |
Transferring again to @elastic/response-ops as they now own the rule registry implementation. |
Parent ticket: #101016
Summary
We identified a few weak spots in the current implementation of index bootstrapping, that could represent potential issues and bugs. Let's address them to make the logic more resilient.
1. Index name creation & update logic
Currently, index name creation and update logic is split into two pieces.
rule_registry/server/rule_data_client/index.ts
contains index name creation logic:rule_registry/server/rule_data_plugin_service/utils.ts
contains index name update logic:incrementIndexName
accepts index names created bycreateWriteTargetIfNeeded
and returnsundefined
if the name doesn't match the pattern. However, the relationship between the two functions is not explicit. And it could lead to bugs if they become unsynchronized. A more robust option would be to have a single function that appends-000001
if the name doesn't contain it or increments the number otherwise.Also, it should be flexible and not rely on 6 last digits, because there can be edge cases in general (#102586 (comment)).
2. Index mapping update logic
Index mapping update logic has several white spots. Add my concerns as comments to the
updateAliasWriteIndexMapping
method.3. Error handling during bootstrapping
Currently, indices bootstrapping considered to be successful even if some requests failed. There is no way to programmatically retrieve the outcome of index template initialization as we only log errors to console. See initialisation in
ecurity_solution/server/plugin.ts
for example:It could lead to a situation when index bootstrapping fails, but
RuleDataClient
doesn't know anything about it and writes documents into an index that doesn't have proper mappings. In the worst case, dynamic mappings would be applied to the indexed documents, leading to unexpected behavior, like broken aggregations in the application code.A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the
rule_registry
library level.4. Retry logic during bootstrapping
We should also add some retry logic to the bootstrapping logic.
I talked to Elasticsearch team, and they recommended to add retries to component and index template updates, or IMO wherever else we could get
{ acknowledged: false }
from ES during index bootstrapping.Here's the log of our discussion:
Q: In Kibana (specifically, RAC project), we started to use the new template APIs for bootstrapping our indices (PR for reference #102586). This code is not well tested and not production ready, but it’s going to be used from Security and Observability solutions in the upcoming releases, so I’d like to make sure that we implemented it correctly. Specifically, we started to use:
PUT _index_template/...
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-template.htmlPUT _component_template/...
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-component-template.htmlWe make sure to wait for the responses from these endpoints before any other code will be able to write anything to the corresponding indices. So any writing happens only if these calls succeed.
In those responses, we normally get 200 OK with the body:
I wasn’t able to find it in the docs or elsewhere, so I’d like to double-check two things.
The first thing is about
acknowledged: true
:acknowledged: false
in a 200 OK response?The second thing is about race conditions between index mgmt and indexing requests. Is it possible to get a race condition between successfully acknowledged template mgmt requests, and the following (potentially, immediately after that) attempt to write a document to an index which name matches the name of the updated index template?
POST /_index_template/_simulate API
); second, if this fails, we do a rollover of the alias.When developing locally, we’ve got into some weird broken state where the concrete index ended up with dynamic mappings instead of mappings from the index template. Most likely, this was caused by our incorrectly implemented code + constant server reloads during development. However, I decided to double-check if any race conditions are possible in general.
A:
That means that the change has been acknowledged by the master and accepted with a cluster state update
You can get "false" for the acknowledged, if, when publishing the new cluster state, the request times out before enough nodes have acknowledged and accepted the cluster state
there is not any listed errors, so generally a good rule to consider would be to retry in that case
if the template update response have "acknowledged: true", then retrieving those settings through something like the simulate API or creating a new index will use the just updated template
Q: If this question even makes sense, how many nodes is enough though? Does acknowledged: true guarantee that any subsequent write request will be handled by a node that received the most recent cluster state with the new mappings etc?
Just want to make sure there’s no way to get a race condition, leading to some weird state like let’s say a node which doesn’t know about a newly created index, processes write request and creates an index with dynamic mappings instead of mappings specified in the templates.
A: You shouldn't have to worry about this assuming the request is acknowledged: true, because that means the master node has acknowledged it, and all those requests (index creation, mappings updates) go through the master node
Further details
The text was updated successfully, but these errors were encountered: