-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Data Frame] adds new pipeline field to dest config #43124
[ML][Data Frame] adds new pipeline field to dest config #43124
Conversation
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may bring the need for adding BWC tests for data frames. If desired, will address that in a separate PR.
} | ||
|
||
public DestConfig(final StreamInput in) throws IOException { | ||
index = in.readString(); | ||
if (in.getVersion().onOrAfter(Version.CURRENT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be adjusted after backport
public boolean isValid() { | ||
return index.isEmpty() == false; | ||
} | ||
|
||
@Override | ||
public void writeTo(StreamOutput out) throws IOException { | ||
out.writeString(index); | ||
if (out.getVersion().onOrAfter(Version.CURRENT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be adjusted after backport
I have been thinking more about how to handle pipelines in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's there so far looks good, but, as you said in a comment, we need to think how to include this in the preview too. Otherwise the preview becomes actively misleading.
* @return The {@link Builder} with index set | ||
*/ | ||
public Builder setIndex(String index) { | ||
this.index = index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Objects.requireNonNull(index)
here so the error that the constructor will throw is traced to the root cause.
@droberts195 thinking more about Consequently, even without a pipeline the preview results should probably look like |
…-pipeline-support
Current simulation returns formatted docs like:
It seems to me that we want the two returns to be as similar as possible. So, we can: 1️⃣ Nest the normal preview doc objects into a I am not sure which would be better. When it comes to consuming the API, either work. |
Ultimately, I have opted to use option 3️⃣ mentioned above. It keeps the preview response consistent with what is currently done. Additionally, we don't want to "advertise" the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I agree that making the output of preview match the final document structure as closely as possible is the best option.
My main comment is about the authorization aspect of this.
String id = (String) doc.get(DataFrameField.DOCUMENT_ID_FIELD); | ||
doc.keySet().removeIf(k -> k.startsWith("_")); | ||
src.put("_source", doc); | ||
src.put("_id", id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also pass in _index
(obtained from config.getDestination().getIndex()
) so that if the pipeline accesses _index
the simulation works like the real thing would.
ClientHelper.executeWithHeadersAsync(threadPool.getThreadContext().getHeaders(), | ||
ClientHelper.DATA_FRAME_ORIGIN, | ||
client, | ||
SimulatePipelineAction.INSTANCE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action name for this is cluster:admin/ingest/pipeline/simulate
, so previewing a data frame transform with a pipeline now requires that privilege. We should document this and also test how easy it is to understand the resulting error message if you have all the other privileges required to use the data frame transform except this one.
It's not ideal that using the data frame transform for real does not require any special privilege over and above being able to index data, but simulating does. The cluster privilege that includes cluster:admin/ingest/pipeline/simulate
is manage_ingest_pipelines
, which is the same one that lets you delete ingest pipelines! As a followup maybe we should enquire about the possibility of adding a simulate_ingest_pipelines
cluster privilege that just lets you simulate, and not CRUD.
We also need to pass this information on to the UI team when the ability to add a pipeline gets added to the UI, as the UI might want to proactively prevent people who cannot simulate a pipeline from creating a data frame transform that uses one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 what do you think of not using the user headers for _simulate
? I am not sure the reason behind requiring the special permission when they can already index documents referring to the pipeline itself....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea. If we run this action as the internal _xpack
user then it bypasses the need for the end user to have pipeline admin privileges. We're still requiring that they have permission to preview the data frame transform. If we consider simulating the pipeline an internal implementation detail of previewing the data frame transform then it's reasonable to run it as _xpack
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [ML][Data Frame] adds new pipeline field to dest config * Adding pipeline support to _preview * removing unused import * moving towards extracting _source from pipeline simulation * fixing permission requirement, adding _index entry to doc
#43388) * [ML][Data Frame] adds new pipeline field to dest config (#43124) * [ML][Data Frame] adds new pipeline field to dest config * Adding pipeline support to _preview * removing unused import * moving towards extracting _source from pipeline simulation * fixing permission requirement, adding _index entry to doc * adjusting for java 8 compatibility * adjusting bwc serialization version to 7.3.0
This PR adds a new field
pipeline
that references an existing ingest pipeline via its id. If set, theDestConfig#pipeline
field will be set on each index request. Meaning, all documents being indexed will go through the defined pipeline.closes #43061