Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] Stats api is slow on M1 chips #119477

Closed
Bamieh opened this issue Nov 23, 2021 · 7 comments · Fixed by #121437
Closed

[telemetry] Stats api is slow on M1 chips #119477

Bamieh opened this issue Nov 23, 2021 · 7 comments · Fixed by #121437
Assignees
Labels
8.1 candidate Feature:Telemetry impact:critical This issue should be addressed immediately due to a critical level of impact on the product. investigating loe:medium Medium Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@Bamieh
Copy link
Member

Bamieh commented Nov 23, 2021

[Needs debugging]

Original issue

The stats API is slow / hangs on M1 chips

curl -u elastic:changeme http://localhost:5601/api/stats?extended | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:00 --:--:--     0
curl: (52) Empty reply from server

Just from reading #20577, I'm wondering if maybe the Reporting plugin is used to power it? > I'm running on an M1, and the only errors in my logs are:

[2021-11-08T14:37:09.545-06:00][ERROR][plugins.reporting] Error in Reporting start, reporting may not function properly
[2021-11-08T14:37:09.545-06:00][ERROR][plugins.reporting] Error: Unsupported platform: darwin-arm64
    at installBrowser (/Users/seanstory/Desktop/Dev/kibana/x-pack/plugins/reporting/server/browsers/install.ts:34:11)
    at initializeBrowserDriverFactory (/Users/seanstory/Desktop/Dev/kibana/x-> 
 pack/plugins/reporting/server/browsers/index.ts:32:27)
    at /Users/seanstory/Desktop/Dev/kibana/x-pack/plugins/reporting/server/plugin.ts:101:42
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

Context

I've done some debugging and this does not happen on intel chips.
I tried looking if the code is doing anything special that might change based on architecture but didn't get anywhere interesting

I do not have an M1 MacBook handy to test this so @yakhinvadim offered pairing on zoom to debug this issue

CC @seanstory @yakhinvadim

@Bamieh Bamieh added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry loe:medium Medium Level of Effort impact:critical This issue should be addressed immediately due to a critical level of impact on the product. investigating 8.1 candidate labels Nov 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@Bamieh Bamieh self-assigned this Nov 23, 2021
@seanstory
Copy link
Member

We've only noticed this with /stats?expanded=true. Just /stats seems to work fine.

@Bamieh
Copy link
Member Author

Bamieh commented Dec 1, 2021

I've debugged the issue with @yakhinvadim and we figured out the reason this is happening:

Reporting is failing to start due to Unsupported platform: darwin-arm64

The collector's isReady waits for the this.pluginStart$ to emit an event but this never happens since reporting start errors out. This causes the stats API to hang indefinitely while waiting for all collectors to be ready.

Reporting stack trace
[2021-11-30T08:40:58.952-08:00][ERROR][plugins.reporting] Error in Reporting start, reporting may not function properly
[2021-11-30T08:40:58.952-08:00][ERROR][plugins.reporting] Error: Unsupported platform: darwin-arm64
    at installBrowser (./kibana/x-pack/plugins/reporting/server/browsers/install.ts:34:11)
    at initializeBrowserDriverFactory (./kibana/x-pack/plugins/reporting/server/browsers/index.ts:32:27)
    at ./kibana/x-pack/plugins/reporting/server/plugin.ts:101:42
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

The solution I have for this is:

  1. Open a ticket for reporting to enable running reporting on M1 chips (not sure if we already have one for this).
  2. Fix reporting collector isReady rxjs pipe to timeout after certain time.
  3. Add new collector logic to return isReady: false after 10 seconds if a collector hangs. (10s is arbitrary but should be enough). This way we can send usage in all cases.

@mshustov
Copy link
Contributor

mshustov commented Dec 1, 2021

to emit an event but this never happens since monitoring start errors out.

Is it a monitoring or reporting problem? the stack trace ends at installBrowser (./kibana/x-pack/plugins/reporting/server/browsers/install.ts:34:11)
@elastic/kibana-reporting-services shouldn't reporting plugin change its status to degraded?

this.logger.error(`Error in Reporting start, reporting may not function properly`);
this.logger.error(e);

@Bamieh Can we detect these cases automatically to prevent degradation in the future?

@Bamieh
Copy link
Member Author

Bamieh commented Dec 1, 2021

@mshustov yes reporting not motniroing** fixed my comment. Can you clarify what you have in mind when you said 'detect these cases'? every collector can have its own isReady logic. The solution I can think of at the collection level is to return false if the collector is taking a long time to get ready

@afharo
Copy link
Member

afharo commented Dec 1, 2021

Maybe fixing #97788 might tackle issues like this one? If the Telemetry service skips any reports if the status is degraded, we could catch this issue? Is it too drastic?

@mshustov
Copy link
Contributor

mshustov commented Dec 1, 2021

The solution I can think of at the collection level is to return false if the collector is taking a long time to get ready

yeah, something like that would work.

If the Telemetry service skips any reports if the status is degraded, we could catch this issue?

Not likely. As you can see, reportin doesn't set degraded status, so we will have to rely on another signal. As per Ahmad's suggestion, we can have a readiness timeout, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.1 candidate Feature:Telemetry impact:critical This issue should be addressed immediately due to a critical level of impact on the product. investigating loe:medium Medium Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants