-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] Add full cluster restart test for Rollup #31533
Conversation
Pinging @elastic/es-search-aggs |
assertThat(createRollupJobResponse.get("acknowledged"), equalTo(Boolean.TRUE)); | ||
|
||
// start the rollup job | ||
final Request startRollupJobRequest = new Request("POST", "_xpack/rollup/job/rollup-job-test/_start"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, you don't have to start the job if you just want to make sure the task is running. We start the persistent task when the job is created in the CreateRollupJob API.
Although I wonder if there might be a potentially rare timing issue here? The CreateRollupJob API returns when the persistent task framework acknowledges that task was created. But I believe there might be a lag between that and when the allocated task is actually created on the target node?
So it might be possible for the StartJob API (or assertRollupJob()
if starting is skipped) to fail because the allocated task hasn't started yet? Maybe we need an assertBusy
checking for the job, like below?
I may be wrong about that though, you're much more familiar with how the persistent tasks work :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, you don't have to start the job if you just want to make sure the task is running. We start the persistent task when the job is created in the CreateRollupJob API.
Ok, I've been fooled by the doc.
Although I wonder if there might be a potentially rare timing issue here?
Good point. I'll take a look at this, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, the docs are a little confusing with regards to how it works internally. The persistent/allocated Task is always running, start and stop just toggles internal state. If the task doesn't exist, the job doesn't exist basically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The rollup job is started so that we can ensure that it is correctly restarted after upgrade: it means that the state was correctly persisted on the old cluster and correctly deserialized back on the upgraded cluster, which is what we want to test.
@@ -415,7 +415,11 @@ private QueryBuilder createBoundaryQuery(Map<String, Object> position) { | |||
DateHistoGroupConfig dateHisto = job.getConfig().getGroupConfig().getDateHisto(); | |||
String fieldName = dateHisto.getField(); | |||
String rollupFieldName = fieldName + "." + DateHistogramAggregationBuilder.NAME; | |||
long lowerBound = position != null ? (long) position.get(rollupFieldName) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oi, I would not have thought of this in a million years. Thanks for fixing :)
Left a question about timing in the test, but otherwise this LGTM! Thanks @tlrx :) |
Thanks @tlrx for picking this up. It's a very important safety net. @droberts195 can you maybe confirm similar tests exist for ML? |
@polyfractal I updated the code according to your comments. Can you have another look, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM :)
This pull request adds a full cluster restart test for a Rollup job. The test creates and starts a Rollup job on the cluster and checks that the job already exists and is correctly started on the upgraded cluster. This test allows to test that the persistent task state is correctly parsed from the cluster state after the upgrade, as the status field has been renamed to state in #31031. The test undercovers a ClassCastException that can be thrown in the RollupIndexer when the timestamp as a very low value that fits into an integer. When it's the case, the value is parsed back as an Integer instead of Long object and (long) position.get(rollupFieldName) fails.
This pull request adds a full cluster restart test for a Rollup job. The test creates and starts a Rollup job on the cluster and checks that the job already exists and is correctly started on the upgraded cluster. This test allows to test that the persistent task state is correctly parsed from the cluster state after the upgrade, as the status field has been renamed to state in #31031. The test undercovers a ClassCastException that can be thrown in the RollupIndexer when the timestamp as a very low value that fits into an integer. When it's the case, the value is parsed back as an Integer instead of Long object and (long) position.get(rollupFieldName) fails.
Thanks @polyfractal ! I also pushed the test in the 6.3 branch (1adf4f1) as I found it useful for 6.3 too. |
Not at the moment - I'll add one. |
Thx @droberts195 ! |
* 6.x: Fix broken backport of #31578 by adjusting constructor (#31587) ingest: Add ignore_missing property to foreach filter (#22147) (#31578) Add package pre-install check for java binary (#31343) Docs: Clarify sensitive fields watcher encryption (#31551) Watcher: Remove never executed code (#31135) Improve test times for tests using `RandomObjects::addFields` (#31556) Revert "Remove RestGetAllAliasesAction (#31308)" REST high-level client: add simulate pipeline API (#31158) Get Mapping API to honour allow_no_indices and ignore_unavailable (#31507) Fix Mockito trying to mock IOException that isn't thrown by method (#31433) (#31527) [Test] Add full cluster restart test for Rollup (#31533) Enhance thread context uniqueness assertion fix writeIndex evaluation for aliases (#31562) Add x-opaque-id to search slow logs (#31539) Watcher: Fix put watch action (#31524) [DOCS] Significantly improve SQL docs turn GetFieldMappingsResponse to ToXContentObject (#31544) TEST: Unmute testHistoryUUIDIsGenerated Ingest Attachment: Upgrade Tika to 1.18 (#31252) TEST: Correct the assertion arguments order (#31540)
* master: ingest: Add ignore_missing property to foreach filter (#22147) (#31578) Fix a formatting issue in the docvalue_fields documentation. (#31563) reduce log level at gradle configuration time [TEST] Close additional clients created while running yaml tests (#31575) Docs: Clarify sensitive fields watcher encryption (#31551) Watcher: Remove never executed code (#31135) Add support for switching distribution for all integration tests (#30874) Improve robustness of geo shape parser for malformed shapes (#31449) QA: Create xpack yaml features (#31403) Improve test times for tests using `RandomObjects::addFields` (#31556) [Test] Add full cluster restart test for Rollup (#31533) Enhance thread context uniqueness assertion [DOCS] Fix heading format errors (#31483) fix writeIndex evaluation for aliases (#31562) Add x-opaque-id to search slow logs (#31539) Watcher: Fix put watch action (#31524) Add package pre-install check for java binary (#31343) Reduce number of raw types warnings (#31523) Migrate scripted metric aggregation scripts to ScriptContext design (#30111) turn GetFieldMappingsResponse to ToXContentObject (#31544) Close xcontent parsers (partial) (#31513) Ingest Attachment: Upgrade Tika to 1.18 (#31252) TEST: Correct the assertion arguments order (#31540)
This pull request adds a full cluster restart test for a Rollup job. The test creates and starts a Rollup job on the cluster and checks that the job already exists and is correctly started on the upgraded cluster.
This test allows to test that the persistent task state is correctly parsed from the cluster state after the upgrade, as the
status
field has been renamed tostate
in #31031.The test undercovers a
ClassCastException
that can be thrown in theRollupIndexer
when the timestamp as a very low value that fits into an integer. When it's the case, the value is parsed back as an Integer instead of Long object and(long) position.get(rollupFieldName)
fails.@bleskes This is the follow up test you asked for in #31031 (comment)