-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add a file structure determination endpoint #33471
[ML] Add a file structure determination endpoint #33471
Conversation
This endpoint accepts an arbitrary file in the request body and attempts to determine the structure. If successful it also proposes mappings that could be used when indexing the file's contents, and calculates simple statistics for each of the fields that are useful in the data preparation step prior to configuring machine learning jobs.
Pinging @elastic/ml-core |
I'll add the docs and HLRC implementation in follow-up PRs. Next week I'll add the functionality to force some of the decisions the structure finder makes (to support interaction in the data prep UI), and as this will add more parameters it makes sense to defer the docs and HLRC client until this the full set of parameters is available. But I want to get the basic endpoint in soon so that @jgowdyelastic can develop against it. |
if (linesToSample != null && linesToSample <= 0) { | ||
validationException = addValidationError("lines_to_sample must be positive if specified", validationException); | ||
} | ||
if (sample == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may want to verify that sample.length() > 0
as well.
|
||
@Override | ||
protected Set<String> responseParams() { | ||
return Collections.singleton(FileStructure.EXPLAIN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for my own edification.
Does this "magically" take the query parameter the user provides and the passes it to the xcontent builder? I don't see that parameter directly referenced from the incoming request and it has be curious :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the RestRequest
parameters are automatically passed to the toXContent()
method - see
elasticsearch/server/src/main/java/org/elasticsearch/rest/action/RestToXContentListener.java
Line 47 in 6a3adbd
response.toXContent(builder, channel.request()); |
This method just stops the parameter that is only used for X-content formatting and not added to the action request from triggering the validation check for unused parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
Thanks for the run down!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple minor comments.
|
||
import static org.elasticsearch.action.ValidateActions.addValidationError; | ||
|
||
public class FileStructureAction extends Action<FileStructureAction.Response> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this is different in 6.x
so make sure to fix compilation problems before backporting.
this.fileStructure = fileStructure; | ||
} | ||
|
||
public Response() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need this at all, it should not be public
@Override | ||
public ActionRequestValidationException validate() { | ||
ActionRequestValidationException validationException = null; | ||
if (linesToSample != null && linesToSample <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check against an upper bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the uploaded data contains fewer lines then we just sample all of what's been uploaded.
Therefore the number of lines that can be sampled is limited by the size of the ES HTTP receive buffer.
One problem with sampling many lines is the time taken to do the analysis. At some point I should probably add the option to limit the maximum time allowed. Some other functionality has this facility, like the Grok ingest processor. But I'd rather do this in a followup PR.
@dimitris-athanasiou I made the constructor package private rather than public as you suggested. Also, as we discussed verbally, I added "find" to the beginning of the action name as all REST actions should begin with verbs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* master: (30 commits) Include fallback settings when checking dependencies (elastic#33522) [DOCS] Fixing formatting issues in breaking changes CRUD: Disable wait for refresh tests with delete Test: Fix test name (elastic#33510) HLRC: split ingest request converters (elastic#33435) Logging: Configure the node name when we have it (elastic#32983) HLRC: split xpack request converters (elastic#33444) HLRC: split watcher request converters (elastic#33442) HLRC: add enable and disable user API support (elastic#33481) [DOCS] Fixes formatting error TEST: Ensure merge triggered in _source retention test (elastic#33487) [ML] Add a file structure determination endpoint (elastic#33471) HLRC: ML Forecast Job (elastic#33506) HLRC: split migration request converters (elastic#33436) HLRC: split snapshot request converters (elastic#33439) Make Watcher validation message copy/pasteable Removes redundant test method in SQL tests (elastic#33498) HLRC: ML Post Data (elastic#33443) Pass Directory instead of DirectoryService to Store (elastic#33466) Collapse package structure for metrics aggs (elastic#33463) ...
This endpoint accepts an arbitrary file in the request body and attempts to determine the structure. If successful it also proposes mappings that could be used when indexing the file's contents, and calculates simple statistics for each of the fields that are useful in the data preparation step prior to configuring machine learning jobs.
This endpoint accepts an arbitrary file in the request body and
attempts to determine the structure. If successful it also
proposes mappings that could be used when indexing the file's
contents, and calculates simple statistics for each of the fields
that are useful in the data preparation step prior to configuring
machine learning jobs.