-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingest Manager] Make setupIngestManager wait if setup is in progress #70008
Conversation
export async function setupIngestManager( | ||
soClient: SavedObjectsClientContract, | ||
callCluster: CallESAsCurrentUser | ||
) { | ||
const [installedPackages, defaultOutput, config] = await Promise.all([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from moving the body inside a try/catch
); | ||
if (!packageShouldBeInstalled) { | ||
continue; | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we can't put this whole try/catch block into the else
above so we don't need to run all this code again since we now know that it ran/is running? Would speed things up a bit. I tested it out and it seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would only run the setup one time which isn't the current or previous behavior. I'm trying to avoid any changes to the prior flow / guarantees or only make the smallest incremental changes required.
We'll definitely revisit this and other parts of the setup in later PRs but for now I want small conservative changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but we were only ever running the entire setup twice because we couldn't track the first. Now there is no need. Otherwise this change is going to cause the Ingest Manager setup to take longer because we have to await the first call before running the whole of the second and awaiting that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but we were only ever running the entire setup twice because we couldn't track the first. Now there is no need.
Then maybe we can remove the second call? Either way, I'm starting with stability then we can move on to speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need the second call, which functions more like a check now, because its purpose is to not let the user interact with Ingest Manager until setup has completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should talk about this outside this PR. In person or another issue. As far as I can tell, this PR fixes the known bugs and tests are passing. I'd like to take the win and do more in another effort.
They wouldn't remain and seem to cause an issue with ensureDefaultConfig or something default config related. Here's some logs from a failing run which had some additional logging <details><summary>Failing log</summary> ``` proc [kibana] spawing 3 calls to setupIngestManager │ proc [kibana] has setupIngestStatus. is pending │ proc [kibana] has setupIngestStatus. is pending │ proc [kibana] has setupIngestStatus. is pending │ proc [kibana] after await setupIngestStatus │ proc [kibana] before promise.all │ proc [kibana] after await setupIngestStatus │ proc [kibana] before promise.all │ proc [kibana] after await setupIngestStatus │ proc [kibana] before promise.all │ info [o.e.c.m.MetadataMappingService] [JFSIII.local] [.kibana_2/d9f18zusTYONXlP_UrRRCQ] update_mapping [_doc] │ info [o.e.c.m.MetadataMappingService] [JFSIII.local] [.kibana_2/d9f18zusTYONXlP_UrRRCQ] update_mapping [_doc] │ info [o.e.c.m.MetadataMappingService] [JFSIII.local] [.kibana_2/d9f18zusTYONXlP_UrRRCQ] update_mapping [_doc] │ info [o.e.c.m.MetadataMappingService] [JFSIII.local] [.kibana_2/d9f18zusTYONXlP_UrRRCQ] update_mapping [_doc] │ info [o.e.c.m.MetadataMappingService] [JFSIII.local] [.kibana_2/d9f18zusTYONXlP_UrRRCQ] update_mapping [_doc] │ proc [kibana] after promise.all. get config.id config1 │ proc [kibana] after promise.all. get config.id config1 │ proc [kibana] after promise.all. get config.id config1 │ info [o.e.c.m.MetadataMappingService] [JFSIII.local] [.kibana_2/d9f18zusTYONXlP_UrRRCQ] update_mapping [_doc] │ proc [kibana] setupIngestManager error { Error: [ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1]: [version_conflict_engine_exception] [ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1], with { index_uuid="d9f18zusTYONXlP_UrRRCQ" & shard="0" & index=".kibana_2" } │ proc [kibana] at respond (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:349:15) │ proc [kibana] at checkRespForFailure (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:306:7) │ proc [kibana] at HttpConnector.<anonymous> (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:173:7) │ proc [kibana] at IncomingMessage.wrapper (/Users/jfsiii/work/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4929:19) │ proc [kibana] at IncomingMessage.emit (events.js:203:15) │ proc [kibana] at endReadableNT (_stream_readable.js:1145:12) │ proc [kibana] at process._tickCallback (internal/process/next_tick.js:63:19) │ proc [kibana] status: 409, │ proc [kibana] displayName: 'Conflict', │ proc [kibana] message: │ proc [kibana] '[ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1]: [version_conflict_engine_exception] [ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1], with { index_uuid="d9f18zusTYONXlP_UrRRCQ" & shard="0" & index=".kibana_2" }', │ proc [kibana] path: '/.kibana/_update/ingest-agent-configs%3Aconfig1', │ proc [kibana] query: { refresh: 'wait_for' }, │ proc [kibana] body: │ proc [kibana] { error: │ proc [kibana] { root_cause: [Array], │ proc [kibana] type: 'version_conflict_engine_exception', │ proc [kibana] reason: │ proc [kibana] '[ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1]', │ proc [kibana] index_uuid: 'd9f18zusTYONXlP_UrRRCQ', │ proc [kibana] shard: '0', │ proc [kibana] index: '.kibana_2' }, │ proc [kibana] status: 409 }, │ proc [kibana] statusCode: 409, │ proc [kibana] response: │ proc [kibana] '{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1]","index_uuid":"d9f18zusTYONXlP_UrRRCQ","shard":"0","index":".kibana_2"}],"type":"version_conflict_engine_exception","reason":"[ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1]","index_uuid":"d9f18zusTYONXlP_UrRRCQ","shard":"0","index":".kibana_2"},"status":409}', │ proc [kibana] toString: [Function], │ proc [kibana] toJSON: [Function], │ proc [kibana] isBoom: true, │ proc [kibana] isServer: false, │ proc [kibana] data: null, │ proc [kibana] output: │ proc [kibana] { statusCode: 409, │ proc [kibana] payload: │ proc [kibana] { message: │ proc [kibana] '[ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1]: [version_conflict_engine_exception] [ingest-agent-configs:config1]: version conflict, required seqNo [274], primary term [1]. current document has seqNo [275] and primary term [1], with { index_uuid="d9f18zusTYONXlP_UrRRCQ" & shard="0" & index=".kibana_2" }', │ proc [kibana] statusCode: 409, │ proc [kibana] error: 'Conflict' }, │ proc [kibana] headers: {} }, │ proc [kibana] reformat: [Function], │ proc [kibana] [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/conflict' } │ proc [kibana] setupIngestManager ok │ proc [kibana] setupIngestManager ok │ proc [kibana] has setupIngestStatus. is pending │ proc [kibana] after await setupIngestStatus │ proc [kibana] before promise.all │ proc [kibana] after promise.all. get config.id config1 │ proc [kibana] setupIngestManager ok │ info [o.e.x.s.a.r.TransportPutRoleAction] [JFSIII.local] updated role [fleet_enroll] │ info [o.e.x.s.a.u.TransportPutUserAction] [JFSIII.local] updated user [fleet_enroll] │ proc [kibana] setupFleet before getDefaultOutputId │ proc [kibana] setupFleet after getDefaultOutputId aed7a97e-9212-4ff9-b562-8b45f085f0d6 │ proc [kibana] setupFleet before updateOutput aed7a97e-9212-4ff9-b562-8b45f085f0d6 { fleet_enroll_username: 'fleet_enroll', │ proc [kibana] fleet_enroll_password: 'OWE3MTM3NzMtZDM4Zi00OGUwLWFkYmMtZDMwNzE0YjMwMGY2' } │ proc [kibana] setupFleet before generateEnrollmentAPIKey { name: 'Default', configId: 'config1' } │ proc [kibana] server error [09:30:33.314] Error: Internal Server Error │ proc [kibana] at HapiResponseAdapter.toError (/Users/jfsiii/work/kibana/src/core/server/http/router/response_adapter.ts:129:19) │ proc [kibana] at HapiResponseAdapter.toHapiResponse (/Users/jfsiii/work/kibana/src/core/server/http/router/response_adapter.ts:79:19) │ proc [kibana] at HapiResponseAdapter.handle (/Users/jfsiii/work/kibana/src/core/server/http/router/response_adapter.ts:74:17) │ proc [kibana] at Router.handle (/Users/jfsiii/work/kibana/src/core/server/http/router/router.ts:263:34) │ proc [kibana] at process._tickCallback (internal/process/next_tick.js:68:7) ``` </details>
@elasticmachine merge upstream |
Pinging @elastic/ingest-management (Team:Ingest Management) |
@@ -30,70 +30,93 @@ import { appContextService } from './app_context'; | |||
const FLEET_ENROLL_USERNAME = 'fleet_enroll'; | |||
const FLEET_ENROLL_ROLE = 'fleet_enroll'; | |||
|
|||
// the promise which tracks the setup | |||
let setupIngestStatus: Promise<void> | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to work if we have multiple instances of Kibana no? Can we use a saved object, to keep the state of the setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wondered about the same things. From Slack the other day:
I initially had some logic for dealing with pending & succeeded state but they way we tried initially had issues in multi-kibana setups (and tests, iirc). We might need to move the state out of JS and into ES or somewhere else that's durable & shared by multiple Kibana
I did this because it's the simplest thing change that could possibly work, but I do think we'll want to move it to Saved Objects eventually.
Along those lines and https://github.com/elastic/kibana/issues/63123 perhaps we want a GET route to match /fleet/setup
?
Do you think these need to be done in this PR or can they be discussed/implemented in a followup issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of doing it in this PR as I think this is probably not going to solve our problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the failure in multi-kibana exist in the code before #69505, e.g. in 7.8?
I'll see how quickly I can do a dead-simple boolean but, like I said in the description, I think we'll want more fine-grained permissions and I'd like to avoid designing that during the PR or writing such short-lived code if this works. I know, "if" it works.
Given that this is a non-trivial problem, would it make sense to open follow-up issues for
and merging this one so we can move on? @neptunian @nchaulet what do you think? |
My issue is the code no longer makes sense to me and increases the setup time. Here, the second call is now awaiting the first call (increases time) to wait for setup to complete, and then running through the setup again knowing that the first has already completed successfully. I think it would make sense to clear it up in this PR. Since it does fix the race condition issue, I suppose we can have followup PRs after @nchaulet approves his request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok with this as a temporary solution that fix local development, but we should have a follow up issue
lets try to have a followup soon for the multi kibana issues as @nchaulet mentioned, this would be a problem for 7.9 and the scaling concerns. |
That's good feedback. I'll try to clear it up
This seems like a fundamental tradeoff we're making (correctness for speed). If we call the same endpoint/function multiple times we can either execute them a) in parallel/concurrently or b) in series/consecutively.
That's the status quo behavior. Multiple HTTP calls result in multiple calls to that function. I talked about this some in #70008 (comment) but we're making a change at the lowest level of the plugin. So much code was written with a given behavior I don't think it's wise to make large changes to it. This way we can see if this works as expected and move forward with further refactoring. |
created #70333 for follow-up |
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Test FailuresKibana Pipeline / kibana-oss-agent / Firefox UI Functional Tests.test/functional/apps/visualize/input_control_vis/chained_controls·js.visualize app input controls chained controls should create a seperate filter pill for parent control and child controlStandard Out
Stack Trace
Build metrics
History
To update your PR or re-run it, just comment with: |
Going to merge b/c the failure is seems unrelated to our plugin or this code and has failed before. |
Okay, thats fine. But just to be clear, I expect multiple calls to the function but not to executing what's in the |
@neptunian I understand what you're saying. I left out a part
The current behavior has no idea if prior call succeeded, failed, or is pending. The logic is the same for each run. It calls the same functions each time. Those functions might early exit or do something else, but there is no "did setup run before" logic. I explained why I don't want to change that. At this point we're each repeating ourselves so I don't think it's worth continuing here. |
…elastic#70008) * Make setupIngestManager wait if another setup is in progress
Kicking off a backport for this to avoid potential conflicts with other incoming backports. |
…#70008) (#70522) * Make setupIngestManager wait if another setup is in progress Co-authored-by: John Schulz <[email protected]>
Summary
closes #69843
Draft while waiting for CI to pass and for other tests to be added
Handle the case when
setupIngestManager
is called before the first has had a chance to complete. Chose toawait
instead of returning early or erroring since that's the closest to what we're doing now.We likely want to eventually split the status up into more granular settings like "hasDefaultPackages", "hasDefaultOutput", "hasDefaultConfig", etc but I think this works