Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #104031

Closed
Tracked by #68626
lukeelmers opened this issue Jun 30, 2021 · 6 comments · Fixed by #109820
Labels
Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@lukeelmers
Copy link
Member

lukeelmers commented Jun 30, 2021

Part of #68626

Summary

In order to support a multi-process Kibana (RFC: #94057), we need to update the /stats REST API to support metrics for more than one process.

This endpoint, registered from the usage_collection plugin, is getting these stats from Core's metrics service (getOpsMetrics$), which is also used in the monitoring plugin for stats collection.

The /stats are problematic in that they contain a handful of process metrics which will differ from worker-to-worker:

{
  // ...
  "process": {
    "memory": {
      "heap": {
        "total_bytes": 533581824,
        "used_bytes": 296297424,
        "size_limit": 4345298944
      },
      "resident_set_size_bytes": 563625984
    },
    "pid": 52646,
    "event_loop_delay": 0.22967800498008728,
    "uptime_ms": 1706021.930404
  },
  // ...
}

As each request could be routed to a different worker, different results may come back each time.

Ultimately we'll need to extend the API to provide per-worker stats, but in the interim, we need to at least:

  1. Mark the existing process field as deprecated
  2. Implement a new field to replace it (which we need to discuss in this issue)

One idea

The simplest approach we could take here would be to do something like add a processes field which is simply an array of objects in the same format as the existing process:

{
  // ...,
  // @deprecated
  "process": {
    "memory": {
      "heap": {
        "total_bytes": 533581824,
        "used_bytes": 296297424,
        "size_limit": 4345298944
      },
      "resident_set_size_bytes": 563625984
    },
    "pid": 52646,
    "event_loop_delay": 0.22967800498008728,
    "uptime_ms": 1706021.930404
  },
  // new field, for now would just duplicate the existing `process` as the only item in the array
  "processes": [
    {
      "memory": {
        "heap": {
          "total_bytes": 533581824,
          "used_bytes": 296297424,
          "size_limit": 4345298944
        },
        "resident_set_size_bytes": 563625984
      },
      "pid": 52646,
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404
    }
  ],
  // ...,
}

We could also consider adding some other human-readable identifier (so we don't just have pid), which could be used in the future for a worker.id.

Important to note here that I checked with the ECS team, and while there are patterns for nesting child process objects which point back to a parent (process.parent), there isn't a top-down pattern for doing something like process.children[]. So whatever we end up doing here would be a custom field from an ECS perspective, though we can still maintain the same ECS-compliant body for each individual process object.

The idea would be to then make the breaking change in 8.0, which would be handled as a follow-up task (#104124).

Scope

  • deprecate process field in opts metrics & stats api
  • introduce new field (design to be agreed upon)
  • investigate implications for Metricbeat & monitoring UI to understand what it would take to get them moved over to the new field by 8.0. Open issues if needed. cc @jasonrhodes
@lukeelmers lukeelmers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@lukeelmers
Copy link
Member Author

@joshdover @mshustov @pgayvallet Since we were discussing this topic in the RFC thread, WDYT about the proposed design here?

Is processes[] too confusing? Could also do worker_processes[] and coordinator_process or something along those lines.

@joshdover
Copy link
Contributor

Important to note here that I checked with the ECS team, and while there are patterns for nesting child process objects which point back to a parent (process.parent), there isn't a top-down pattern for doing something like process.children[]. So whatever we end up doing here would be a custom field from an ECS perspective, though we can still maintain the same ECS-compliant body for each individual process object.

It'd probably make sense for the metricbeat implementation to simply ingest separate documents for each process and tie them together with process.parent. This seems more straightforward than having custom fields for subprocesses if it isn't supported by the spec (yet).

Since we were discussing this topic in the RFC thread, WDYT about the proposed design here?

Is processes[] too confusing? Could also do worker_processes[] and coordinator_process or something along those lines.

Array design seems like the general direction we want to go. For differentiating the process types, maybe we add a type: 'coordinator' | 'server_worker' | 'task_worker' | 'reporting_worker' field to each process object in the array. Seems like that would be a better design for consumers (like metricbeat) to be able to handle new unknown process process gracefully without losing the data.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 6, 2021

Is processes[] too confusing? Could also do worker_processes[] and coordinator_process or something along those lines.

Array design seems like the general direction we want to go.

I would have used a map instead of an array here, but maybe it doesn't make sense? I admit I don't really have any precise idea on how the consumers are ingesting this data (and in term of mappings, an array does make way more sense).

"processes": {
   "coordinator": { ... },
   "worker-1": { ... },
   "worker-2": { ... },
  },

@lukeelmers
Copy link
Member Author

lukeelmers commented Jul 6, 2021

It'd probably make sense for the metricbeat implementation to simply ingest separate documents for each process and tie them together with process.parent.

Yeah exactly, I guess my point was that while it would be ideal for our response to be as close to ECS as possible, however there just isn't an ECS-compatible way to do what we need to do here. So my vote would be to get it as close to ECS as we can, and leave it to Metricbeat to handle ingestion.

I would have used a map instead of an array here, but maybe it doesn't make sense?

A map would make sense from a config standpoint -- it would be more consistent with how we are approaching clustering configs -- however I tend to lean toward Josh's suggestion above of keeping a single processes[] array with different types inside:

For differentiating the process types, maybe we add a type: 'coordinator' | 'server_worker' | 'task_worker' | 'reporting_worker' field to each process object in the array.

This has a few benefits:

  • During ingestion, easier handling of unknown process types (as Josh mentions), since you don't need to deal with finding another array for another type
  • Easier in terms of handling mappings (as Pierre mentions)
  • Feels like the most minimal deviation from ECS

So overall I think I'm +1 on Josh's proposal, however since type is already an ECS categorization field, perhaps we should consider using a different field to avoid confusion (name?) EDIT: There's a process.name ECS field which would probably be suitable here.

It might also be nice if these types/names were consistent with the ones provided in the clustering config.

e.g. given this config:

node:
  enabled: true
  workers:
    foo:
      count: 2
      max_old_space_size: 1gb
    bar:
      count: 1
      max_old_space_size: 512mb

we could get these stats:

{
  // ...,
  "processes": [
    {
      "name": "coordinator",
      "pid": 52646,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
    {
      "name": "worker-foo-1",
      "pid": 52647,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
    {
      "name": "worker-foo-2",
      "pid": 52648,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
    {
      "name": "worker-bar-1",
      "pid": 52649,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
  ],
  // ...,
}

@lukeelmers
Copy link
Member Author

Also, as @Bamieh pointed out, maybe while we are introducing a new field, we should consider moving to the new histogram-style event_loop_delays introduced in #101580.

This probably isn't something the metrics service should need to concern itself with, however it might make sense for the /stats endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
4 participants