-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Jindex: Rolling upgrade tests #35700
[ML] Jindex: Rolling upgrade tests #35700
Conversation
Pinging @elastic/ml-core |
I pushed another commit fixing the problems with When reviewing please pay close attention to |
@@ -95,6 +96,7 @@ protected void doExecute(Task task, GetJobsStatsAction.Request request, ActionLi | |||
List<GetJobsStatsAction.Response.JobStats> stats = new ArrayList<>(); | |||
for (QueryPage<GetJobsStatsAction.Response.JobStats> task : tasks) { | |||
stats.addAll(task.results()); | |||
Collections.sort(stats, Comparator.comparing(GetJobsStatsAction.Response.JobStats::getJobId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that it is better to sort this once once all the tasks are added instead of for every task : tasks
.
retest this please |
run gradle build tests |
2 similar comments
run gradle build tests |
run gradle build tests |
retest this please |
Use job update to set model snapshot and established memory in autodetect results processor. This simplifies the code.
… updates have completed before close
Accounts for jobs created by the other tests etc
Look for datafeed and job config in the cluster state if not in task parameters
58f5eff
to
6e86639
Compare
bcce73b
to
0a35dd6
Compare
0a35dd6
to
6532995
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
private void finalizeIndexJobs(Collection<String> jobIds, ActionListener<AcknowledgedResponse> listener) { | ||
String jobIdString = String.join(",", jobIds); | ||
logger.debug("finalizing jobs [{}]", jobIdString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isn't it stringified to "[id1, id2, ...]" anyway? + less garbage for non-debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I'll remedy in a later commit as this is merged now.
@@ -158,6 +158,10 @@ static void validate(String jobId, Job job) { | |||
int maxMachineMemoryPercent, | |||
MlMemoryTracker memoryTracker, | |||
Logger logger) { | |||
if (job == null) { | |||
logger.debug("[{}] select node job is null", jobId); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job == null
is valid for the rolling upgrade as that field isn't present in the persistent task parameters for jobs < v6.6.0. I added this to help me understand what was happening during the rolling upgrade tests it probably shouldn't be in the released code however.
String reason = "Not opening job [" + jobId + "] on node [" + nodeNameOrId(node) | ||
+ "] version [" + node.getVersion() + "], because this node does not support " + | ||
"jobs of version [" + job.getJobVersion() + "]"; | ||
logger.trace(reason); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: debug seems more suited to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that on a 100 node cluster each allocation will generate 100 messages similar to this one, which would be significant log spam. They get concatenated into the overall reason, which is stored in the cluster state if the persistent task exists (and returned in the error message in the case of this being called prior to opening).
All the other possible reasons for ruling out a node in this method also currently log at the trace level. I think they should all log at the same level, otherwise someone reading the logs could get a misleading picture of what is happening.
I would leave this as trace to match the others.
jobUpdateSemaphore.acquire(); | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); | ||
LOGGER.info("[{}] Interrupted acquiring update established model memory semaphore", jobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level info correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think it's an error level message as the interrupt is not necessarily an error perhaps debug. ml code does not interrupt threads and the only usage I can find in es core outside of tests is in CancellableThreads
I doubt we will every see this message outside of testing. The test framework interrupts zombie threads after each test so that will probably be the only time we see it.
Adds rolling upgrade tests specific to the migration project and re-enables all the previously muted upgrade tests. A milestone
A number of minor issues had to be fixed to enable the tests:
.ml-config
template to be installed before running the tests