Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Change way the job config is passed to autodetect #1253

Closed
droberts195 opened this issue May 15, 2020 · 3 comments
Closed

[ML] Change way the job config is passed to autodetect #1253

droberts195 opened this issue May 15, 2020 · 3 comments

Comments

@droberts195
Copy link
Contributor

For historical reasons the job config that autodetect uses is passed to it in a mixture of 2 Windows .ini file style config files and a few command line options.

It would make life much easier and would make the code clearer if it was passed as a single config file in JSON format containing the exact job config document from the .ml-config index.

Benefits would be:

  1. Much less code on the Java side - currently it has to translate the job config into a completely different format to pass to autodetect
  2. Less of a learning curve for new joiners - when looking at the config handling code in autodetect they will be able to look at job config examples from our public documentation
  3. Easier reproduction of problems directly using the autodetect process from the command line - currently when we get a job config from a support diag we have to manually translate it to the .ini formats and command line arguments in order to try to reproduce problems
@droberts195
Copy link
Contributor Author

One tricky thing about this change is how to keep the way Java passes the inputs to C++ working while the changes are made.

I think the way to do it is as follows:

  1. Add an option to the C++ code to accept a job config in JSON format. This could be a --config=/path/to/config.json argument. Initially nothing needs to be done with the supplied config, not even confirming that the file contains JSON.
  2. Change the Java code to pass --config=/path/to/config.json as well as all the arguments it currently passes. The JSON file it should save to disk (in the same temporary directory where it currently writes the limits, model plot and field config files) should contain the job config JSON in UTF-8 format (as created by Job.toXContent()) with no adjustments. No adjustments is critical to making debugging easier in the future.
  3. Change the C++ code to get its config information from the JSON config specified to --config instead of the mishmash of options it gets them from at the moment. This is obviously the main part of the work. Additionally the options that are no longer needed should be made optional (if they aren't already) but not completely removed.
  4. Change the Java code not to pass the redundant options: --limitconfig, --fieldconfig, --bucketspan, --jobid, --modelplotconfig, --latency, --summarycountfield, --delimiter, --multivariateByFields, --stopCategorizationOnWarnStatus, --maxQuantileInterval, --persistInterval. This should allow quite a lot of Java code to be deleted.
  5. Change the C++ code not to accept the redundant options: --limitconfig, --fieldconfig, --bucketspan, --jobid, --modelplotconfig, --latency, --summarycountfield, --delimiter, --multivariateByFields, --stopCategorizationOnWarnStatus, --maxQuantileInterval, --persistInterval.

Step 3 is by far the biggest of these steps. But the beauty of passing the config in both formats during the transition is that step 3 can be broken down into smaller steps, and these can even span multiple releases. (There is a slight cost if they span multiple releases in that anyone doing low level debugging of problems with that release will have to create both config formats. However, since the new config option takes the unmodified job config in JSON format the extra format is not hard to obtain.). We are only completely committed to this project at step 4.

One subtlety with the redundant options is that --timeformat option is still required. This is because the job config contains the time field format in Java notation, whereas the C++ processes require it in C++ format. So the C++ processes will have to ignore the time format in the JSON job config and use the override from the command in the unlikely event it's provided - it's not used in production, where Java passes the timestamps in the C++'s default format. So --timeformat is only for command line debugging using human readable CSV files.

@droberts195
Copy link
Contributor Author

Once all this work is done the command line arguments of autodetect will be roughly the same as those of data_frame_analyzer - compare and contrast https://github.com/elastic/ml-cpp/blob/499338d5fcc16f5a5548e07311973e5381071bc4/bin/autodetect/CCmdLineParser.cc and https://github.com/elastic/ml-cpp/blob/499338d5fcc16f5a5548e07311973e5381071bc4/bin/data_frame_analyzer/CCmdLineParser.cc to see how much cleaner it will be.

edsavage added a commit that referenced this issue Oct 16, 2020
Add an argument to autodetect to enable specifying a config file.
Eventually this will be used to read an autodetect job config in JSON
format but in this initial step it is unused.

Relates to #1253
edsavage added a commit that referenced this issue Oct 16, 2020
Add an argument to autodetect to enable specifying a config file.
Eventually this will be used to read an autodetect job config in JSON
format but in this initial step it is unused.

Relates to #1253
edsavage added a commit to edsavage/ml-cpp that referenced this issue Oct 16, 2020
Add an argument to autodetect to enable specifying a config file.
Eventually this will be used to read an autodetect job config in JSON
format but in this initial step it is unused.

Relates to elastic#1253
edsavage added a commit that referenced this issue Oct 16, 2020
Add an argument to autodetect to enable specifying a config file.
Eventually this will be used to read an autodetect job config in JSON
format but in this initial step it is unused.

Relates to #1253
Backports #1540
edsavage added a commit to elastic/elasticsearch that referenced this issue Oct 19, 2020
Write job config in the form of a JSON formatted file and pass its
location to autodetect as an additional command line option. This will
ultimately allow all relevant job config to be read from the JSON config
file and hence reduce the number of command line arguments for
autodetect.

Relates to elastic/ml-cpp#1253
edsavage added a commit to elastic/elasticsearch that referenced this issue Oct 20, 2020
Write job config in the form of a JSON formatted file and pass its
location to autodetect as an additional command line option. This will
ultimately allow all relevant job config to be read from the JSON config
file and hence reduce the number of command line arguments for
autodetect.

Relates to elastic/ml-cpp#1253
Backport of #63865
@edsavage edsavage added the Meta label Oct 30, 2020
@edsavage
Copy link
Contributor

edsavage commented Oct 30, 2020

edsavage added a commit to edsavage/ml-cpp that referenced this issue Nov 3, 2020
Add a parser for the recently added anomaly job configuration file, which
is in JSON format.

As an initial step to the configuration file replacing a number of
command line arguments, modify autodetect to use the jobId extracted by
the parser.

Relates to elastic#1253
edsavage added a commit that referenced this issue Nov 6, 2020
Add a parser for the recently added anomaly job configuration file, which
is in JSON format.

As an initial step to the configuration file replacing a number of
command line arguments, modify autodetect to use the jobId extracted by
the parser.

Relates to #1253
edsavage added a commit to edsavage/ml-cpp that referenced this issue Nov 6, 2020
Add a parser for the recently added anomaly job configuration file, which
is in JSON format.

As an initial step to the configuration file replacing a number of
command line arguments, modify autodetect to use the jobId extracted by
the parser.

Relates to elastic#1253
edsavage added a commit that referenced this issue Nov 9, 2020
Add a parser for the recently added anomaly job configuration file, which
is in JSON format.

As an initial step to the configuration file replacing a number of
command line arguments, modify autodetect to use the jobId extracted by
the parser.

Relates to #1253
Backports #1552
edsavage added a commit to edsavage/ml-cpp that referenced this issue Nov 9, 2020
Detector rules may possibly refer to rule filters that are defined
separately, outside of the anomaly job configuration. As an interim
solution pass the rule filters parsed from the old-style field config to
the new-style JSON parser.

Relates to elastic#1253
edsavage added a commit that referenced this issue Nov 9, 2020
Detector rules may possibly refer to rule filters that are defined
separately, outside of the anomaly job configuration. As an interim
solution pass the rule filters parsed from the old-style field config to
the new-style JSON parser.

Relates to #1253
edsavage added a commit to edsavage/ml-cpp that referenced this issue Nov 9, 2020
Detector rules may possibly refer to rule filters that are defined
separately, outside of the anomaly job configuration. As an interim
solution pass the rule filters parsed from the old-style field config to
the new-style JSON parser.

Relates to elastic#1253
edsavage added a commit that referenced this issue Nov 9, 2020
Detector rules may possibly refer to rule filters that are defined
separately, outside of the anomaly job configuration. As an interim
solution pass the rule filters parsed from the old-style field config to
the new-style JSON parser.

Relates to #1253
Backports #1562
edsavage added a commit that referenced this issue Jan 6, 2021
* Remove trace of the CFieldConfig class and any related files (e.g. test
configuration files).

* Enable the code to use JSON formatted files to obtain configuration
settings.

* Use the model plot configuration settings present in the anomaly job
configuration to initialise model plot properties.

Relates #1253
Backports #1644
edsavage added a commit that referenced this issue Jan 12, 2021
Prepare the ground for the Java process to stop sending soon to be
redundant command line arguments. In particular:

* Ensure that the persist interval, time field and categorisation stop
on warn properties are obtained from job config
* Perform the calculation of the random 'staggering interval' in C++ code (to be removed from Java in due course)

Backports #1648
Relates #1253
edsavage added a commit to elastic/elasticsearch that referenced this issue Jan 12, 2021
Do not pass the now unneeded command line arguments to autodetect and
remove any supporting code.

Remove staggering interval - now in C++

Relates elastic/ml-cpp#1253
edsavage added a commit to elastic/elasticsearch that referenced this issue Jan 13, 2021
Do not pass the now unneeded command line arguments to autodetect and
remove any supporting code.

Remove staggering interval - now in C++

Relates elastic/ml-cpp#1253
Backports #67344
edsavage added a commit that referenced this issue Jan 13, 2021
Be more lenient when parsing the background_persist_interval field as it
is a) optional and b) not present in the job config in the majority of
cases

relates #1253, #1648
edsavage added a commit to edsavage/ml-cpp that referenced this issue Jan 13, 2021
Be more lenient when parsing the background_persist_interval field as it
is a) optional and b) not present in the job config in the majority of
cases

relates elastic#1253, elastic#1648
edsavage added a commit that referenced this issue Jan 13, 2021
Remove the now redundant autodetect command line options and
exit with error message on seeing unknown command line arguments

Relates #1253
edsavage added a commit to edsavage/ml-cpp that referenced this issue Jan 13, 2021
Remove the now redundant autodetect command line options and
exit with error message on seeing unknown command line arguments

Relates elastic#1253
edsavage added a commit that referenced this issue Jan 13, 2021
Be more lenient when parsing the background_persist_interval field as it
is a) optional and b) not present in the job config in the majority of
cases

Relates #1253, #1648
Backports #1662
edsavage added a commit that referenced this issue Jan 13, 2021
Remove the now redundant autodetect command line options and
exit with error message on seeing unknown command line arguments

Relates #1253
Backports #1663
edsavage added a commit that referenced this issue Jan 15, 2021
Fix assertion failures shown up in debug builds of test cases.

Relates #1253 
Backports #1670
edsavage added a commit that referenced this issue Jan 21, 2021
* Expect configuration update requests to be received in JSON formatted
strings. 
* Remove support for old, ini file parsing.

Relates elastic/elasticsearch#67721, #1253
edsavage added a commit to edsavage/ml-cpp that referenced this issue Jan 21, 2021
* Expect configuration update requests to be received in JSON formatted
strings. 
* Remove support for old, ini file parsing.

Relates elastic/elasticsearch#67721, elastic#1253
edsavage added a commit that referenced this issue Jan 21, 2021
* Expect configuration update requests to be received in JSON formatted
strings. 
* Remove support for old, ini file parsing.

Relates elastic/elasticsearch#67721, #1253
Backports #1682
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this issue May 2, 2024
Write job config in the form of a JSON formatted file and pass its
location to autodetect as an additional command line option. This will
ultimately allow all relevant job config to be read from the JSON config
file and hence reduce the number of command line arguments for
autodetect.

Relates to elastic/ml-cpp#1253
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this issue May 3, 2024
Pass JSON files containing rule filters and scheduled
events configuration to autodetect.

This change must not be merged until after elastic/ml-cpp#1640 as the
C++ backend code must be able to safely consume the new command line
options.

Relates elastic/ml-cpp#1253
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this issue May 3, 2024
Do not pass the now unneeded command line arguments to autodetect and
remove any supporting code.

Remove staggering interval - now in C++

Relates elastic/ml-cpp#1253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants