Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add a response mechanism to ML controller command processing #62823

Closed
droberts195 opened this issue Sep 23, 2020 · 2 comments · Fixed by #63542
Closed

[ML] Add a response mechanism to ML controller command processing #62823

droberts195 opened this issue Sep 23, 2020 · 2 comments · Fixed by #63542
Assignees
Labels
:ml Machine learning

Comments

@droberts195
Copy link
Contributor

When the ML Java code needs to start one of the ML native processes (autodetect, normalize or data_frame_analyzer) it sends a command to the controller process telling it to spawn the required process. Currently the communications are one way only - the JVM sends a command to the controller and assumes it will be actioned immediately. There is no mechanism for the controller to respond when it has actioned the command. This seemed reasonable in the initial design because the controller is completely dedicated to starting and killing processes, and these were assumed to be very fast operations.

We have observed that when security software is running on a machine spawning a new process can take a very long time - over 20 seconds has been observed between the command being received in the controller and the resulting posix_spawn call returning. This invalidates the assumption that commands issued to controller by the JVM will be near instantaneous. It causes a problem because the timeout waiting for the named pipes to connect starts immediately after the command is issued, but the process may not actually start until considerably later.

Therefore, there is a need for controller to be able to report back to the ES JVM when each command sent to it has been actioned. Then the ES JVM should not try to connect the named pipes to a process until the controller has reported that it has actually spawned that process. This will mean that the configured timeout for connecting the named pipes is measured from a more appropriate point in time.

@droberts195 droberts195 added the :ml Machine learning label Sep 23, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor Author

Probably the biggest problem with making this change is coordination of changes between C++ and Java without breaking every ML test that uses native processes.

There are basically two ways it could be done:

  1. Muting tests
  2. Sequence of PRs that start by adding leniency

The first option would look like this:

  1. Prepare the full C++ and Java changes on branches, testing locally on a machine where both are built together
  2. Mute every single ML test that uses native processes
  3. Merge the C++ PR and wait for a snapshot build
  4. Merge the Java PR
  5. Unmute all the ML tests that use native processes

The second option would look like this:

  1. Change C++ to optionally take a command ID - if first token is numeric it's the command ID; if not present that's OK
  2. Change Java so that timeout on connecting results pipe isn't fatal initially if there is a command pipe (hack to be removed later)
  3. Change Java to try to connect a results pipe to the controller, and if it works listen for responses to commands
  4. Change Java to send command IDs with each command
  5. Change C++ to make the command ID compulsory
  6. Change C++ to connect a results pipe in controller and reply to commands when they are completed
  7. Change Java to wait for responses before moving on from requesting a process be started to the next stages during actions that require a process be started
  8. Change Java to require that controller successfully connects a results pipe

Given the complexity of the second option I am favouring the first. The risk is that while the ML tests are muted somebody else breaks something. This risk could be mitigated by merging the changes over a weekend. The PRs would need to be approved in advance, then the merge steps could be done with very little time spent during the weekend itself, and if everything went to plan the tests would be unmuted by Monday morning.

droberts195 added a commit to droberts195/ml-cpp that referenced this issue Oct 1, 2020
This change makes the controller process respond to each command
it receives with a document indicating whether that command was
successfully executed or not.

This response will be used by the Java side of the connection to
determine when it is appropriate to move on to the next phase of
the action that the controller command was part of.  For example,
when starting a process and connecting named pipes to it it is
best that the named pipe connections are not attempted until the
process is confirmed to be started.

Relates elastic/elasticsearch#62823
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Oct 12, 2020
This change makes threads that send a command to the ML
controller process wait for it to respond to the command.
Previously such threads would block until the command was
sent, but not until it was actioned.  This was on the
assumption that the sort of commands being sent would be
actioned almost instantaneously, but that assumption has
been shown to be false when anti-malware software is
running.

Relates elastic/ml-cpp#1520
Fixes elastic#62823
droberts195 added a commit to elastic/ml-cpp that referenced this issue Oct 17, 2020
This change makes the controller process respond to each command
it receives with a document indicating whether that command was
successfully executed or not.

This response will be used by the Java side of the connection to
determine when it is appropriate to move on to the next phase of
the action that the controller command was part of.  For example,
when starting a process and connecting named pipes to it it is
best that the named pipe connections are not attempted until the
process is confirmed to be started.

Relates elastic/elasticsearch#62823
droberts195 added a commit that referenced this issue Oct 17, 2020
This change makes threads that send a command to the ML
controller process wait for it to respond to the command.
Previously such threads would block until the command was
sent, but not until it was actioned.  This was on the
assumption that the sort of commands being sent would be
actioned almost instantaneously, but that assumption has
been shown to be false when anti-malware software is
running.

Relates elastic/ml-cpp#1520
Fixes #62823
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants