[ML] Add a response mechanism to ML controller command processing #62823

droberts195 · 2020-09-23T12:09:05Z

When the ML Java code needs to start one of the ML native processes (autodetect, normalize or data_frame_analyzer) it sends a command to the controller process telling it to spawn the required process. Currently the communications are one way only - the JVM sends a command to the controller and assumes it will be actioned immediately. There is no mechanism for the controller to respond when it has actioned the command. This seemed reasonable in the initial design because the controller is completely dedicated to starting and killing processes, and these were assumed to be very fast operations.

We have observed that when security software is running on a machine spawning a new process can take a very long time - over 20 seconds has been observed between the command being received in the controller and the resulting posix_spawn call returning. This invalidates the assumption that commands issued to controller by the JVM will be near instantaneous. It causes a problem because the timeout waiting for the named pipes to connect starts immediately after the command is issued, but the process may not actually start until considerably later.

Therefore, there is a need for controller to be able to report back to the ES JVM when each command sent to it has been actioned. Then the ES JVM should not try to connect the named pipes to a process until the controller has reported that it has actually spawned that process. This will mean that the configured timeout for connecting the named pipes is measured from a more appropriate point in time.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-23T12:09:06Z

Pinging @elastic/ml-core (:ml)

droberts195 · 2020-09-29T14:31:07Z

Probably the biggest problem with making this change is coordination of changes between C++ and Java without breaking every ML test that uses native processes.

There are basically two ways it could be done:

Muting tests
Sequence of PRs that start by adding leniency

The first option would look like this:

Prepare the full C++ and Java changes on branches, testing locally on a machine where both are built together
Mute every single ML test that uses native processes
Merge the C++ PR and wait for a snapshot build
Merge the Java PR
Unmute all the ML tests that use native processes

The second option would look like this:

Change C++ to optionally take a command ID - if first token is numeric it's the command ID; if not present that's OK
Change Java so that timeout on connecting results pipe isn't fatal initially if there is a command pipe (hack to be removed later)
Change Java to try to connect a results pipe to the controller, and if it works listen for responses to commands
Change Java to send command IDs with each command
Change C++ to make the command ID compulsory
Change C++ to connect a results pipe in controller and reply to commands when they are completed
Change Java to wait for responses before moving on from requesting a process be started to the next stages during actions that require a process be started
Change Java to require that controller successfully connects a results pipe

Given the complexity of the second option I am favouring the first. The risk is that while the ML tests are muted somebody else breaks something. This risk could be mitigated by merging the changes over a weekend. The PRs would need to be approved in advance, then the merge steps could be done with very little time spent during the weekend itself, and if everything went to plan the tests would be unmuted by Monday morning.

This change makes the controller process respond to each command it receives with a document indicating whether that command was successfully executed or not. This response will be used by the Java side of the connection to determine when it is appropriate to move on to the next phase of the action that the controller command was part of. For example, when starting a process and connecting named pipes to it it is best that the named pipe connections are not attempted until the process is confirmed to be started. Relates elastic/elasticsearch#62823

This change makes threads that send a command to the ML controller process wait for it to respond to the command. Previously such threads would block until the command was sent, but not until it was actioned. This was on the assumption that the sort of commands being sent would be actioned almost instantaneously, but that assumption has been shown to be false when anti-malware software is running. Relates elastic/ml-cpp#1520 Fixes elastic#62823

This change makes the controller process respond to each command it receives with a document indicating whether that command was successfully executed or not. This response will be used by the Java side of the connection to determine when it is appropriate to move on to the next phase of the action that the controller command was part of. For example, when starting a process and connecting named pipes to it it is best that the named pipe connections are not attempted until the process is confirmed to be started. Relates elastic/elasticsearch#62823

This change makes threads that send a command to the ML controller process wait for it to respond to the command. Previously such threads would block until the command was sent, but not until it was actioned. This was on the assumption that the sort of commands being sent would be actioned almost instantaneously, but that assumption has been shown to be false when anti-malware software is running. Relates elastic/ml-cpp#1520 Fixes #62823

droberts195 added the :ml Machine learning label Sep 23, 2020

This was referenced Sep 23, 2020

[ML] Processes that fail to connect to the JVM within a reasonable time should exit elastic/ml-cpp#1504

Closed

[ML] Consider spawning processes from dedicated threads in controller elastic/ml-cpp#1503

Closed

droberts195 self-assigned this Sep 24, 2020

This was referenced Oct 1, 2020

[ML] Make controller send responses for each command received elastic/ml-cpp#1520

Merged

[CI] Failure in yaml=reference/ml/anomaly-detection/apis/close-job/line_102 #48941

Closed

droberts195 mentioned this issue Oct 12, 2020

[ML] Wait for controller to respond to commands #63542

Merged

droberts195 closed this as completed in #63542 Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add a response mechanism to ML controller command processing #62823

[ML] Add a response mechanism to ML controller command processing #62823

droberts195 commented Sep 23, 2020

elasticmachine commented Sep 23, 2020

droberts195 commented Sep 29, 2020

[ML] Add a response mechanism to ML controller command processing #62823

[ML] Add a response mechanism to ML controller command processing #62823

Comments

droberts195 commented Sep 23, 2020

elasticmachine commented Sep 23, 2020

droberts195 commented Sep 29, 2020