Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add a file structure determination endpoint #33471

Merged
merged 6 commits into from
Sep 7, 2018

Conversation

droberts195
Copy link
Contributor

This endpoint accepts an arbitrary file in the request body and
attempts to determine the structure. If successful it also
proposes mappings that could be used when indexing the file's
contents, and calculates simple statistics for each of the fields
that are useful in the data preparation step prior to configuring
machine learning jobs.

This endpoint accepts an arbitrary file in the request body and
attempts to determine the structure.  If successful it also
proposes mappings that could be used when indexing the file's
contents, and calculates simple statistics for each of the fields
that are useful in the data preparation step prior to configuring
machine learning jobs.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor Author

droberts195 commented Sep 6, 2018

I'll add the docs and HLRC implementation in follow-up PRs.

Next week I'll add the functionality to force some of the decisions the structure finder makes (to support interaction in the data prep UI), and as this will add more parameters it makes sense to defer the docs and HLRC client until this the full set of parameters is available. But I want to get the basic endpoint in soon so that @jgowdyelastic can develop against it.

if (linesToSample != null && linesToSample <= 0) {
validationException = addValidationError("lines_to_sample must be positive if specified", validationException);
}
if (sample == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to verify that sample.length() > 0 as well.


@Override
protected Set<String> responseParams() {
return Collections.singleton(FileStructure.EXPLAIN);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for my own edification.

Does this "magically" take the query parameter the user provides and the passes it to the xcontent builder? I don't see that parameter directly referenced from the incoming request and it has be curious :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the RestRequest parameters are automatically passed to the toXContent() method - see

This method just stops the parameter that is only used for X-content formatting and not added to the action request from triggering the validation check for unused parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL
Thanks for the run down!

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple minor comments.


import static org.elasticsearch.action.ValidateActions.addValidationError;

public class FileStructureAction extends Action<FileStructureAction.Response> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this is different in 6.x so make sure to fix compilation problems before backporting.

this.fileStructure = fileStructure;
}

public Response() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need this at all, it should not be public

@Override
public ActionRequestValidationException validate() {
ActionRequestValidationException validationException = null;
if (linesToSample != null && linesToSample <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check against an upper bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the uploaded data contains fewer lines then we just sample all of what's been uploaded.

Therefore the number of lines that can be sampled is limited by the size of the ES HTTP receive buffer.

One problem with sampling many lines is the time taken to do the analysis. At some point I should probably add the option to limit the maximum time allowed. Some other functionality has this facility, like the Grok ingest processor. But I'd rather do this in a followup PR.

@droberts195
Copy link
Contributor Author

@dimitris-athanasiou I made the constructor package private rather than public as you suggested.

Also, as we discussed verbally, I added "find" to the beginning of the action name as all REST actions should begin with verbs.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@droberts195 droberts195 merged commit e42cc5c into elastic:master Sep 7, 2018
@droberts195 droberts195 deleted the structure_finder_endpoint branch September 7, 2018 16:42
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 7, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 8, 2018
* master: (30 commits)
  Include fallback settings when checking dependencies (elastic#33522)
  [DOCS] Fixing formatting issues in breaking changes
  CRUD: Disable wait for refresh tests with delete
  Test: Fix test name (elastic#33510)
  HLRC: split ingest request converters (elastic#33435)
  Logging: Configure the node name when we have it (elastic#32983)
  HLRC: split xpack request converters (elastic#33444)
  HLRC: split watcher request converters (elastic#33442)
  HLRC: add enable and disable user API support (elastic#33481)
  [DOCS] Fixes formatting error
  TEST: Ensure merge triggered in _source retention test (elastic#33487)
  [ML] Add a file structure determination endpoint (elastic#33471)
  HLRC: ML Forecast Job (elastic#33506)
  HLRC: split migration request converters (elastic#33436)
  HLRC: split snapshot request converters (elastic#33439)
  Make Watcher validation message copy/pasteable
  Removes redundant test method in SQL tests (elastic#33498)
  HLRC: ML Post Data (elastic#33443)
  Pass Directory instead of DirectoryService to Store (elastic#33466)
  Collapse package structure for metrics aggs (elastic#33463)
  ...
droberts195 added a commit that referenced this pull request Sep 10, 2018
This endpoint accepts an arbitrary file in the request body and
attempts to determine the structure.  If successful it also
proposes mappings that could be used when indexing the file's
contents, and calculates simple statistics for each of the fields
that are useful in the data preparation step prior to configuring
machine learning jobs.
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 14, 2018
droberts195 added a commit that referenced this pull request Sep 20, 2018
droberts195 added a commit that referenced this pull request Sep 20, 2018
kcm pushed a commit that referenced this pull request Oct 30, 2018
lcawl added a commit to lcawl/elasticsearch that referenced this pull request Nov 12, 2018
lcawl added a commit that referenced this pull request Nov 12, 2018
lcawl added a commit that referenced this pull request Nov 12, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants