-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher: Possible areas of future work #34546
Comments
Pinging @elastic/es-core-infra |
Not sure it is the right place, btw my wishlist is:
|
This adds the ability to execute an action for each element that occurs in an array, for example you could sent a dedicated slack action for each search hit returned from a search. Relates elastic#34546
…41997) This adds the ability to execute an action for each element that occurs in an array, for example you could sent a dedicated slack action for each search hit returned from a search. There is also a limit for the number of actions executed, which is hardcoded to 100 right now, to prevent having watches run forever. The watch history logs each action result and the total number of actions the were executed. Relates elastic#34546
This adds the ability to execute an action for each element that occurs in an array, for example you could sent a dedicated slack action for each search hit returned from a search. There is also a limit for the number of actions executed, which is hardcoded to 100 right now, to prevent having watches run forever. The watch history logs each action result and the total number of actions the were executed. Relates #34546
Foreach&Transform inconsistency in
|
Payload payload = ctx.payload(); |
payload
at Line 153 in 8109a2c
payload = transformResult.payload(); |
foreach
path is specified, the execution works on the original ctx
instead of the payload
variable at Line 173 in 8109a2c
Object object = ObjectPath.eval(path, toMap(ctx)); |
@spinscale Please make the executing elasticsearch node name available as a ctx variable. This would be helpful for debugging. As well it would be nice to have a dedicated watcher node role to control the nodes on which watches are allowed to run. |
Please see https://www.elastic.co/blog/distributed-watch-execution-elasticsearch-6.0 - you can use shard allocation filtering for dedicated watch execution nodes, and the node_id is stored in the watch history. Hope this helps! |
@spinscale Thank you about the shard allocation hint. This is very helpful. The problem of the node_id is that I would need it in the context, i.e. use it in the content sent by the watch. Is there any solution for this available? I have tried to get it directly with painless, like e.g. get the hostname system variable, but without success. |
@spinscale One more thing. I have tried to use metadata variables in search inputs to define the index and set params without success. Am I doing something wrong or is this not supported? |
This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed. |
Fix tests
Right now there are just way too many watch tests broken and issues open regarding that. They need to be triaged and fixed.
Upgrade from 5.x may break when using stored scripts
Due to a change in how stored scripts are parsed in 6.x, a watch stored in 5.x may cease to work after the 6.x upgrade. The issue is described in #33058. This comment describes the result of a discussion, basically doing some form of read-repair. However this approach requires a proper upgrade strategy when going to 7.x.
Storing Execution State as part of the Watch status
This is one of the most asked features. Right now every watch has a stateless execution - it is loaded, executed, a watch history entry gets created and the status gets stored together with the watch. However, if a watch execution would like to know what its previous states were (for example to detect flapping of the condition and only trigger an alert, if the last three conditions were true), than one could query the watch history and extract this from there - if it is stored in there.
It would be really useful, if there was an ability to store some kind of small state within the watch status. This state could contain arbitrary data structures (keys, values, maps, lists), but should not exceed a certain size to keep it reasonable small.
One issue with this would be, when this state should be set - before or after a condition was executed? This could be circumvented by simply setting it within any scripts. So you could have a script condition, that did something like this
When the watch status is stored, we could simply serialize the
ctx.state
object and put it into the watch status as well, and of course load it on the next execution.Prevent execution hotspots
Good
getting started
issue: This is mainly about thread pool/execution management and not too much about watcher.If you are using the
interval
based schedule, and you have 1800 watches running every 5m, in theory you have a window of 300 seconds, meaning you could run 6 watches per second, which is not too much. In practice however we currently load all watches at roughly the same time and count 5 minutes from loading, so that an execution hotspot is generated, that can result in watches not being executed because the thread pool queue size has been exceeded.We could find fancy algorithms to even out the execution, but maybe something more simple like catching the rejected executions and just feeding them back into the trigger service to pick it up for the next execution is already enough.
Silence a watch for some time
Good
getting started
issue.The ability to silence the watch for 20 minutes is an asked feature. With the current infrastructure we could implement this using the existing Ack Watch API. One could add and expiry date or time, which then gets stored as a throttle period in the watch status.
More details can be found at #34038
Allow black/whitelisting of URLS the HTTP client used for inputs/actions can access
Good
getting started
issue.Watcher uses an own HTTP client that currently can send any data anywhere. Users may like to reduce this by configuring an explicit blacklist and thus reduce the possible endpoints, especially if regular users are allowed to add watches.
More details can be found at #29937
Add support for HTTP basic auth passwords in keystore
Good
getting started
issue.For slack/pagerduty/email we can store the passwords of an account in the keystore. For HTTP basic auth however we can only store this in a watch. Maybe we need something like
This would allow to configure a setting like
xpack.notification.http.foo.password
. Then all that is needed to do in the watch would be to refer to this password via some way, i.e.::es_keystore::foo
instead of the actual password.Issue at #30686
Use ILM for history indices
Good
getting started
issue.Right now the cleaning up of watch history indices is done by monitoring as the code already existed in there. Luckily we will soon have ILM in core, allowing us to reuse this and fix this issue properly. See
#34073
Improve watch parsing
Parsing of watches is in general too lenient and needs to be improved. See #34073, #31852, #29746, #35189
Stop strict parsing of JSON structures sent to 3rd parties
Whenever a 3rd party provider updates their API and adds new fields, we are playing catchup with them. Especially slack is known for adding new features over time. As the watch API parsing is strict and parses all fields, new fields in the upstream API have to be added to the watcher parsing code as well.
I think we should loosen this restriction and just take the JSON as is and sent it over to 3rd parties - ensuring that it is valid JSON seems sufficient to me and will save us development cycles over time.
Related issues are #34073, #31032, #30090
Improve watch index mappings
Metadata fields are searchable. This can end up in conflicts See #30542
The watch history should probably use index time sorting. See #30069 - OTOH this could slow down indexing and increase watch execution times.
UI bugfixes
There are some outstanding issues with the UI, that require fixing and/or discussion.
Prevent executions on single shard due to hashing function
This issue can only happen in certain configurations, but one could end up with watches being executed on one shard only (or the number of primary shards, if it is > 1). See [#32636](#32636
Monitoring integration
The watches in monitoring need to be examined and fixed. See #33649 and #33051
Run action for each element in an array
Good
getting started
issue. Requires some sync with PM first.Allow to run an action for each element of an array, for example for each hit of a search response you may want to create an own slack message.
An action would look like this
I started a branch at some point but never had the time to keep working on it and add proper tests.
SMTP CA configuration needs to be aligned with other TLS configuration
Configuring TLS for SMTP is different because it is using the javax mail classes. We should align this with the regular security configuration, based on settings. See #30090
Allow to read cluster state metadata in watch payload
A recently added feature in Elasticsearch allowing to store user defined cluster metadata, see #33325. This might be interesting to expose in the watch payload.
Allow to read output of earlier executed actions
The list of actions are executed sequentially. Many customers have a use-case where they want to create a ticket in an external system as the first action and then send a link of that created ticket in a second email. However the current actions run independently from each other, and one cannot access the data generated by any earlier action. We should take a look if we can allow for this by having something like an
ctx.payload_actions
field.One potential issue: The order of the actions is not guaranteed, as this is not a list, but a key value mapping. So the order depends on the JSON client inserting the watch.
Have a story around watch deployments
This is also an ongoing discussion item for us. Right now we always assume that watches are just deployed in to the cluster and that's it - but we should take a step back and think how watches can be managed until they hit the cluster. Maybe integrating with Terraform is an interesting thing to do, maybe something else is out there that could help. Integrating into an SCM, plus templating are also thinks that need to be kept in mind. Sometimes you only want to change a part of your query and that's it, similar to what we do with our monitoring watches, where we (very basically) just change the cluster uuid to query for, when several clusters are being monitored.
Reduce heap usage of cron schedules
The data structure used to store the information when a watch should be run is quite a memory hog due to using lots of
TreeSet<Integer>
structures. There are other cron data structures that use only bitsets and use roughly 3% of what the existing one requires. However those do not support all the features (likerun every second friday into the month
), but the amount of memory saved might make us go with a hybrid approach here.Track long running/stuck watches
This is lowest priority. Right now the user has no indication of long running watches, except using the watcher stats api and check out the stack traces manually. If a watch exposes itself as a Task, than the tasks API could be used. This might also be useful to track long running watches, instead of writing own code.
Parse only watch status in IndexingListener
This is lowest priority. Currently a watch is parsed twice when being stored. Once on the coordinating node receiving the watch and once in the
WatcherIndexingListener
in order to be scheduled. At this point we basically only need to access the watch status to check if it is active and thats it. There is no need to parse the whole watch again and try to compile its scripts for example.Prevent storing of unbounded watch records
See #35997
Refactor out duplicate classes with TextTemplates
We currently have a ton of classes that have a TextTemplate backed version of all a classes methods, and another version with the actual string/int accessors. Much of this can be cleaned up and deduplicated, and only text templates are used at render time.
The text was updated successfully, but these errors were encountered: