Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hitting /api/status successfully takes a very long time after start #107300

Closed
cachedout opened this issue Jul 30, 2021 · 10 comments
Closed

Hitting /api/status successfully takes a very long time after start #107300

cachedout opened this issue Jul 30, 2021 · 10 comments
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@cachedout
Copy link
Contributor

cachedout commented Jul 30, 2021

Kibana version:

Nightly snapshots

Elasticsearch version:
Nightly snapshots

Server OS version:
Linux

Browser version:
N/A
Browser OS version:
N/A
Original install method (e.g. download page, yum, from source, etc.):
Nightly snapshot
Describe the bug:

I am moving a conversation from #106749 to this issue at the request of @tylersmalley and re-tagging with @elastic/kibana-core instead of @elastic/kibana-operations .

Summary

In Observability, we run APM Integration Tests which have been failing quite frequently as a result of the status API not being available after the Kibana container starts.

We even raised our timeout to wait for this API in Kibana to 5 minutes and we're still seeing the same timeouts.

Of course, if we can't reliably count on Kibana to give us a signal that it has started successfully, this is quite problematic for the orchestration that our entire integration testing platform depends on.

A few questions:

  1. Is @elastic/kibana-core tracking any regressions right now which could explain our occasional inability to hit the /api/status endpoint within 5 minutes of a Kibana container starting? (We see no errors coming from Kibana.)
  2. Is hitting this endpoint the best way to determine that Kibana is fully started and ready for receive requests? If not, could @elastic/kibana-core or recommend an alternative approach?

We're definitely happy to continue providing any information that can help get this fixed, but we'd also like to see what could possibly be done about prioritizing a fix for this. Though it's certainly not a high-priority for end users (probably?) it's definitely having a negative impact on our ability to deliver reliable testing for our teams.

Logs

There are no logs after 2021-07-29T03:24:35+00:00. We wait 5 min for Kibana to check /api/status, but there is no response. After the five minute timeout, we bail out.

[2021-07-29T03:24:09.159Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["info","plugins-service"],"pid":1195,"message":"Plugin \"telemetry\" is disabled."}
[2021-07-29T03:24:09.160Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["info","plugins-service"],"pid":1195,"message":"Plugin \"telemetryManagementSection\" has been disabled since the following direct or transitive dependencies are missing, disabled, or have incompatible types: [telemetry]"}
[2021-07-29T03:24:09.161Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["info","plugins-service"],"pid":1195,"message":"Plugin \"userSetup\" is disabled."}
[2021-07-29T03:24:09.166Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["info","plugins-service"],"pid":1195,"message":"Plugin \"metricsEntities\" is disabled."}
[2021-07-29T03:24:09.255Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["info","http","server","Preboot"],"pid":1195,"message":"http server running at http://0.0.0.0:5601"}
[2021-07-29T03:24:09.312Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["warning","config","deprecation"],"pid":1195,"message":"plugins.scanDirs is deprecated and is no longer used"}
[2021-07-29T03:24:09.312Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["warning","config","deprecation"],"pid":1195,"message":"xpack.fleet.agents.kibana is deprecated and is no longer used"}
[2021-07-29T03:24:09.312Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["warning","config","deprecation"],"pid":1195,"message":"Config key [xpack.fleet.agents.elasticsearch.host] is deprecated and replaced by [xpack.fleet.agents.elasticsearch.hosts]"}
[2021-07-29T03:24:09.312Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["warning","config","deprecation"],"pid":1195,"message":"\"xpack.reporting.roles\" is deprecated. Granting reporting privilege through a \"reporting_user\" role will not be supported starting in 8.0. Please set \"xpack.reporting.roles.enabled\" to \"false\" and grant reporting privileges to users using Kibana application privileges **Management > Security > Roles**."}
[2021-07-29T03:24:09.527Z] {"type":"log","@timestamp":"2021-07-29T03:24:09+00:00","tags":["info","plugins-system","standard"],"pid":1195,"message":"Setting up [109] plugins: [translations,licensing,globalSearch,globalSearchProviders,banners,licenseApiGuard,usageCollection,xpackLegacy,taskManager,telemetryCollectionManager,telemetryCollectionXpack,kibanaUsageCollection,securityOss,share,screenshotMode,newsfeed,mapsEms,mapsLegacy,legacyExport,kibanaLegacy,embeddable,uiActionsEnhanced,expressions,charts,esUiShared,bfetch,data,savedObjects,visualizations,visTypeXy,visTypeVislib,visTypeTimelion,features,visTypeTagcloud,visTypeTable,visTypePie,visTypeMetric,visTypeMarkdown,tileMap,regionMap,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,timelion,home,searchprofiler,painlessLab,grokdebugger,graph,visTypeVega,management,watcher,upgradeAssistant,licenseManagement,indexPatternManagement,advancedSettings,discover,discoverEnhanced,dashboard,dashboardEnhanced,visualize,visTypeTimeseries,savedObjectsManagement,spaces,security,transform,savedObjectsTagging,lens,reporting,canvas,lists,ingestPipelines,fileUpload,maps,dataVisualizer,encryptedSavedObjects,dataEnhanced,dashboardMode,cloud,snapshotRestore,fleet,indexManagement,rollup,remoteClusters,crossClusterReplication,indexLifecycleManagement,eventLog,actions,alerting,triggersActionsUi,stackAlerts,ruleRegistry,osquery,ml,cases,timelines,securitySolution,observability,uptime,infra,monitoring,logstash,enterpriseSearch,console,apmOss,apm]"}
[2021-07-29T03:24:10.958Z] {"type":"log","@timestamp":"2021-07-29T03:24:10+00:00","tags":["info","plugins","taskManager"],"pid":1195,"message":"TaskManager is identified by the Kibana UUID: aca82d38-9d4e-401a-a31c-5b51156eb88d"}
[2021-07-29T03:24:21.977Z] {"type":"log","@timestamp":"2021-07-29T03:24:21+00:00","tags":["warning","plugins","security","config"],"pid":1195,"message":"Session cookies will be transmitted over insecure connections. This is not recommended."}
[2021-07-29T03:24:22.028Z] {"type":"log","@timestamp":"2021-07-29T03:24:22+00:00","tags":["warning","plugins","reporting","config"],"pid":1195,"message":"Generating a random key for xpack.reporting.encryptionKey. To prevent sessions from being invalidated on restart, please set xpack.reporting.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command."}
[2021-07-29T03:24:22.043Z] {"type":"log","@timestamp":"2021-07-29T03:24:22+00:00","tags":["warning","plugins","reporting","config"],"pid":1195,"message":"Chromium sandbox provides an additional layer of protection, but is not supported for Linux Red Hat Linux 8.4 OS. Automatically setting 'xpack.reporting.capture.browser.chromium.disableSandbox: true'."}
[2021-07-29T03:24:30.364Z] {"type":"log","@timestamp":"2021-07-29T03:24:30+00:00","tags":["info","plugins","ruleRegistry"],"pid":1195,"message":"Write is disabled, not installing assets"}
[2021-07-29T03:24:30.722Z] {"type":"log","@timestamp":"2021-07-29T03:24:30+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations..."}
[2021-07-29T03:24:31.237Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"Starting saved objects migrations"}
[2021-07-29T03:24:31.502Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana] INIT -> CREATE_NEW_TARGET. took: 82ms."}
[2021-07-29T03:24:31.515Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana_task_manager] INIT -> CREATE_NEW_TARGET. took: 91ms."}
[2021-07-29T03:24:31.757Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana_task_manager] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 243ms."}
[2021-07-29T03:24:31.789Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 286ms."}
[2021-07-29T03:24:31.822Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana_task_manager] MARK_VERSION_INDEX_READY -> DONE. took: 65ms."}
[2021-07-29T03:24:31.823Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana_task_manager] Migration completed after 399ms"}
[2021-07-29T03:24:31.856Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana] MARK_VERSION_INDEX_READY -> DONE. took: 68ms."}
[2021-07-29T03:24:31.856Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","savedobjects-service"],"pid":1195,"message":"[.kibana] Migration completed after 436ms"}
[2021-07-29T03:24:31.978Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","status"],"pid":1195,"message":"Kibana is now unavailable"}
[2021-07-29T03:24:31.979Z] {"type":"log","@timestamp":"2021-07-29T03:24:31+00:00","tags":["info","plugins-system","standard"],"pid":1195,"message":"Starting [109] plugins: [translations,licensing,globalSearch,globalSearchProviders,banners,licenseApiGuard,usageCollection,xpackLegacy,taskManager,telemetryCollectionManager,telemetryCollectionXpack,kibanaUsageCollection,securityOss,share,screenshotMode,newsfeed,mapsEms,mapsLegacy,legacyExport,kibanaLegacy,embeddable,uiActionsEnhanced,expressions,charts,esUiShared,bfetch,data,savedObjects,visualizations,visTypeXy,visTypeVislib,visTypeTimelion,features,visTypeTagcloud,visTypeTable,visTypePie,visTypeMetric,visTypeMarkdown,tileMap,regionMap,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,timelion,home,searchprofiler,painlessLab,grokdebugger,graph,visTypeVega,management,watcher,upgradeAssistant,licenseManagement,indexPatternManagement,advancedSettings,discover,discoverEnhanced,dashboard,dashboardEnhanced,visualize,visTypeTimeseries,savedObjectsManagement,spaces,security,transform,savedObjectsTagging,lens,reporting,canvas,lists,ingestPipelines,fileUpload,maps,dataVisualizer,encryptedSavedObjects,dataEnhanced,dashboardMode,cloud,snapshotRestore,fleet,indexManagement,rollup,remoteClusters,crossClusterReplication,indexLifecycleManagement,eventLog,actions,alerting,triggersActionsUi,stackAlerts,ruleRegistry,osquery,ml,cases,timelines,securitySolution,observability,uptime,infra,monitoring,logstash,enterpriseSearch,console,apmOss,apm]"}
[2021-07-29T03:24:32.040Z] {"type":"log","@timestamp":"2021-07-29T03:24:32+00:00","tags":["info","plugins","monitoring","monitoring"],"pid":1195,"message":"config sourced from: production cluster"}
[2021-07-29T03:24:34.266Z] {"type":"log","@timestamp":"2021-07-29T03:24:34+00:00","tags":["info","http","server","Kibana"],"pid":1195,"message":"http server running at http://0.0.0.0:5601"}
[2021-07-29T03:24:34.597Z] {"type":"log","@timestamp":"2021-07-29T03:24:34+00:00","tags":["info","plugins","monitoring","monitoring","kibana-monitoring"],"pid":1195,"message":"Starting monitoring stats collection"}
[2021-07-29T03:24:35.621Z] {"type":"log","@timestamp":"2021-07-29T03:24:35+00:00","tags":["info","plugins","reporting"],"pid":1195,"message":"Browser executable: /usr/share/kibana/x-pack/plugins/reporting/chromium/headless_shell-linux_x64/headless_shell"}
[2021-07-29T03:24:35.622Z] {"type":"log","@timestamp":"2021-07-29T03:24:35+00:00","tags":["warning","plugins","reporting"],"pid":1195,"message":"Enabling the Chromium sandbox provides an additional layer of protection."}
[2021-07-29T03:24:35.647Z] {"type":"log","@timestamp":"2021-07-29T03:24:35+00:00","tags":["info","plugins","reporting","store"],"pid":1195,"message":"Creating ILM policy for managing reporting indices: kibana-reporting"}
[2021-07-29T03:24:35.853Z] {"type":"log","@timestamp":"2021-07-29T03:24:35+00:00","tags":["info","plugins","securitySolution"],"pid":1195,"message":"Dependent plugin setup complete - Starting ManifestTask"}

Steps to reproduce:

  1. Use a nightly snapshot of Kibana
  2. Start the Kibana instance pointing at an Elasticsearch server which is started
  3. Hit the /api/status endpoint.

Expected behavior:

We expect to be able either hit the endpoint within 5 minutes or see an error in the logs indicating the reason why not.

@cachedout cachedout added the bug Fixes for quality problems that affect the customer experience label Jul 30, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Jul 30, 2021
@cachedout
Copy link
Contributor Author

(Moving a comment by @tylersmalley here so the conversation isn't split)

In regards to if hitting the status endpoint is the right approach to determine if Kibana is up. I know a common practice for load balancer configurations with Kibana is to hit the root / and check the status code. I wonder if that would be more reliable for you in the meantime, or if this is an issue with the HTTP server entirely.

We could do that but we need to ensure that Kibana is fully-functional which is why we chose the status endpoint. If hitting the / endpoint successfully indicates that Kibana is fully operational and is a better approach than hitting the status endpoint, we can certainly do that but it would be good to know if Kibana is fully operational at that point first. Trying to continue with our test framework before Kibana is fully operational would be nearly as problematic as not having Kibana up at all, unfortunately.

@tylersmalley tylersmalley added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jul 30, 2021
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jul 30, 2021
@liza-mae
Copy link
Contributor

Wondering if: elastic/beats#27036 is also related to this issue.

@joshdover
Copy link
Contributor

@cachedout I just tried this with the latest 7.14.0 BCs and I'm not able to reproduce this issue with the default configuration. For me, it took 17s for Kibana to go into an available state, so I suspect that there is something specific to the configuration or environment this is happening in.

A few questions:

  • What Kibana configuration are you all using? Specifically, anything related to task manager or alerting that are being customized?
  • Have you taken a look at the response from the /api/status API to see which components may be blocking Kibana from going into 'available'? FYI you will get more detailed information with the newer format of this API: /api/status?v8format=true
  • Have you enabled verbose logging to see if there's anything obviously failing to initialize? Again, I'd focus on anything from task manager or alerting as those are the components that tend to delay Kibana's availability.

We made a few changes to this API recently, including fixing an issue related to slowness of getting to available. Let's be sure the version of Kibana you're using includes these fixes:

jbudz added a commit to jbudz/kibana that referenced this issue Aug 5, 2021
The socket timeout for testing whether the status page is available or
not is currently 30 seconds.  This test was disabled for being flaky.
Reproducing this locally hasn't been straight forward, but I am seeing
an average of ~20 seconds, which is close enough to the timeout that I'd
like to rule out machine differences.  This gives the status check 120
seconds before dropping the connection.

Related to elastic#106749 and elastic#107300
@cachedout
Copy link
Contributor Author

cachedout commented Aug 9, 2021

Hi @joshdover

Thanks again for this reply.

What Kibana configuration are you all using? Specifically, anything related to task manager or alerting that are being customized?

We're using nightly container snapshots of Kibana which are configured exclusively through environment variables. Here's what we're setting with a few values snipped:

        "ELASTICSEARCH_HOSTS": "http://elasticsearch:9200",
        "SERVER_HOST": "0.0.0.0",
        "SERVER_NAME": "kibana.example.org",
        "TELEMETRY_ENABLED": "false",
        "XPACK_APM_SERVICEMAPENABLED": "true",
        "XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY": "<snipped>",
        "XPACK_FLEET_AGENTS_ELASTICSEARCH_HOST": "http://elasticsearch:9200",
        "XPACK_FLEET_AGENTS_KIBANA_HOST": "http://kibana:5601",
        "XPACK_FLEET_REGISTRYURL": "https://epr-snapshot.elastic.co",
        "XPACK_MONITORING_ENABLED": "true",
        "XPACK_SECURITY_ENCRYPTIONKEY": "<snipped>",
        "XPACK_XPACK_MAIN_TELEMETRY_ENABLED": "false",

Have you taken a look at the response from the /api/status API to see which components may be blocking Kibana from going into 'available'? FYI you will get more detailed information with the newer format of this API: /api/status?v8format=true

Though we see this issue happen regularly in the CI it remains pretty elusive when I try to reproduce it locally. This could certainly very well be an environmental issue in the CI but I have yet to sort out exactly what that might be.

Have you enabled verbose logging to see if there's anything obviously failing to initialize? Again, I'd focus on anything from task manager or alerting as those are the components that tend to delay Kibana's availability.

Will do.

Let's be sure the version of Kibana you're using includes these fixes

We are using the 8.0.0 nightly snapshots so these fixes should be present.

For me, it took 17s for Kibana to go into an available state

In my local testing thus far today, I have yet to see a single instance where it is nearly this quick. What happens typically is that once the HTTP server becomes available, cURL establishes an HTTP connection to Kibana and that connection seems to be held open waiting for a response. While a valid response is eventually returned by Kibana, this typically happens after no less than 30 seconds and I have seen it go as high as 47 seconds. It's still a bit of a stretch to say that a different environment might be taking as long as five minutes but it's certainly possible.

Ideally, I'd think, if the API were not available, Kibana would return an error-code right away instead of holding the connection to the client open. That said, it's still unclear to me whether or not these long waits are the actual root of the problem that we're experiencing. It's important to note that we are seeing this somewhat intermittently. I'll keep looking to see if I can try to find a pattern in the failures. That said, the Kibana configurations don't really vary so this seems like a race condition of some kind.

I'll continue to update this issue as I learn more.

jbudz added a commit that referenced this issue Aug 16, 2021
The socket timeout for testing whether the status page is available or
not is currently 30 seconds.  This test was disabled for being flaky.
Reproducing this locally hasn't been straight forward, but I am seeing
an average of ~20 seconds, which is close enough to the timeout that I'd
like to rule out machine differences.  This gives the status check 120
seconds before dropping the connection.

Related to #106749 and #107300

Co-authored-by: Kibana Machine <[email protected]>
kibanamachine added a commit to kibanamachine/kibana that referenced this issue Aug 16, 2021
The socket timeout for testing whether the status page is available or
not is currently 30 seconds.  This test was disabled for being flaky.
Reproducing this locally hasn't been straight forward, but I am seeing
an average of ~20 seconds, which is close enough to the timeout that I'd
like to rule out machine differences.  This gives the status check 120
seconds before dropping the connection.

Related to elastic#106749 and elastic#107300

Co-authored-by: Kibana Machine <[email protected]>
kibanamachine added a commit that referenced this issue Aug 16, 2021
The socket timeout for testing whether the status page is available or
not is currently 30 seconds.  This test was disabled for being flaky.
Reproducing this locally hasn't been straight forward, but I am seeing
an average of ~20 seconds, which is close enough to the timeout that I'd
like to rule out machine differences.  This gives the status check 120
seconds before dropping the connection.

Related to #106749 and #107300

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Jonathan Budzenski <[email protected]>
@cachedout
Copy link
Contributor Author

Update on this one. We're waiting for new nightly snapshots to be produced which include the changes from this PR and hoping that those resolve this issue.

@afharo
Copy link
Member

afharo commented Aug 23, 2021

Thank you @cachedout! Let us know if we need to put more research into this issue.

@cachedout
Copy link
Contributor Author

cachedout commented Aug 23, 2021

@afharo Sadly, we're still seeing this problem.

We're waiting a full five minutes for Kibana to return a response at this point. It's really starting to seem less as if the /status/ API is taking a long time and more as if all of Kibana has ceased to be responsive entirely. I continue not to be able to reproduce anything like this outside of the regularly scheduled CI runs, so it's quite the mystery. I'll update this issue if we manage to learn more, but right now I don't have anything particularly actionable for the Kibana folks to go on, unless other teams are seeing similar mysterious failures....

@cachedout
Copy link
Contributor Author

@tylersmalley and @afharo

I now believe this probably is almost certainly a symptom of what's being tracked in #110583

How would you like to proceed in terms of tracking this issue? Would you prefer to keep both issues open or would you prefer to close this in favor of #110583 which appears as if it's going to be prioritized. 🤞

@afharo
Copy link
Member

afharo commented Sep 7, 2021

cc @elastic/kibana-core since I'm on leave.

@pgayvallet
Copy link
Contributor

pgayvallet commented Sep 7, 2021

How would you like to proceed in terms of tracking this issue? Would you prefer to keep both issues open or would you prefer to close this in favor of #110583 which appears as if it's going to be prioritized

#110583 is prioritized, and currently in our 'next spring queue', so if we know that this issue is a duplicate or a symptom of #110583, I think it's best to close this one and track #110583 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

6 participants