[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate `process` field #104031

lukeelmers · 2021-06-30T22:47:52Z

Summary

In order to support a multi-process Kibana (RFC: #94057), we need to update the /stats REST API to support metrics for more than one process.

This endpoint, registered from the usage_collection plugin, is getting these stats from Core's metrics service (getOpsMetrics$), which is also used in the monitoring plugin for stats collection.

The /stats are problematic in that they contain a handful of process metrics which will differ from worker-to-worker:

{
  // ...
  "process": {
    "memory": {
      "heap": {
        "total_bytes": 533581824,
        "used_bytes": 296297424,
        "size_limit": 4345298944
      },
      "resident_set_size_bytes": 563625984
    },
    "pid": 52646,
    "event_loop_delay": 0.22967800498008728,
    "uptime_ms": 1706021.930404
  },
  // ...
}

As each request could be routed to a different worker, different results may come back each time.

Ultimately we'll need to extend the API to provide per-worker stats, but in the interim, we need to at least:

Mark the existing process field as deprecated
Implement a new field to replace it (which we need to discuss in this issue)

One idea

The simplest approach we could take here would be to do something like add a processes field which is simply an array of objects in the same format as the existing process:

{
  // ...,
  // @deprecated
  "process": {
    "memory": {
      "heap": {
        "total_bytes": 533581824,
        "used_bytes": 296297424,
        "size_limit": 4345298944
      },
      "resident_set_size_bytes": 563625984
    },
    "pid": 52646,
    "event_loop_delay": 0.22967800498008728,
    "uptime_ms": 1706021.930404
  },
  // new field, for now would just duplicate the existing `process` as the only item in the array
  "processes": [
    {
      "memory": {
        "heap": {
          "total_bytes": 533581824,
          "used_bytes": 296297424,
          "size_limit": 4345298944
        },
        "resident_set_size_bytes": 563625984
      },
      "pid": 52646,
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404
    }
  ],
  // ...,
}

We could also consider adding some other human-readable identifier (so we don't just have pid), which could be used in the future for a worker.id.

Important to note here that I checked with the ECS team, and while there are patterns for nesting child process objects which point back to a parent (process.parent), there isn't a top-down pattern for doing something like process.children[]. So whatever we end up doing here would be a custom field from an ECS perspective, though we can still maintain the same ECS-compliant body for each individual process object.

The idea would be to then make the breaking change in 8.0, which would be handled as a follow-up task (#104124).

Scope

deprecate process field in opts metrics & stats api
introduce new field (design to be agreed upon)
investigate implications for Metricbeat & monitoring UI to understand what it would take to get them moved over to the new field by 8.0. Open issues if needed. cc @jasonrhodes

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-30T22:47:54Z

Pinging @elastic/kibana-core (Team:Core)

lukeelmers · 2021-07-01T17:34:45Z

@joshdover @mshustov @pgayvallet Since we were discussing this topic in the RFC thread, WDYT about the proposed design here?

Is processes[] too confusing? Could also do worker_processes[] and coordinator_process or something along those lines.

joshdover · 2021-07-05T12:37:13Z

Important to note here that I checked with the ECS team, and while there are patterns for nesting child process objects which point back to a parent (process.parent), there isn't a top-down pattern for doing something like process.children[]. So whatever we end up doing here would be a custom field from an ECS perspective, though we can still maintain the same ECS-compliant body for each individual process object.

It'd probably make sense for the metricbeat implementation to simply ingest separate documents for each process and tie them together with process.parent. This seems more straightforward than having custom fields for subprocesses if it isn't supported by the spec (yet).

Since we were discussing this topic in the RFC thread, WDYT about the proposed design here?

Is processes[] too confusing? Could also do worker_processes[] and coordinator_process or something along those lines.

Array design seems like the general direction we want to go. For differentiating the process types, maybe we add a type: 'coordinator' | 'server_worker' | 'task_worker' | 'reporting_worker' field to each process object in the array. Seems like that would be a better design for consumers (like metricbeat) to be able to handle new unknown process process gracefully without losing the data.

pgayvallet · 2021-07-06T07:48:53Z

Is processes[] too confusing? Could also do worker_processes[] and coordinator_process or something along those lines.

Array design seems like the general direction we want to go.

I would have used a map instead of an array here, but maybe it doesn't make sense? I admit I don't really have any precise idea on how the consumers are ingesting this data (and in term of mappings, an array does make way more sense).

"processes": {
   "coordinator": { ... },
   "worker-1": { ... },
   "worker-2": { ... },
  },

lukeelmers · 2021-07-06T21:03:40Z

It'd probably make sense for the metricbeat implementation to simply ingest separate documents for each process and tie them together with process.parent.

Yeah exactly, I guess my point was that while it would be ideal for our response to be as close to ECS as possible, however there just isn't an ECS-compatible way to do what we need to do here. So my vote would be to get it as close to ECS as we can, and leave it to Metricbeat to handle ingestion.

I would have used a map instead of an array here, but maybe it doesn't make sense?

A map would make sense from a config standpoint -- it would be more consistent with how we are approaching clustering configs -- however I tend to lean toward Josh's suggestion above of keeping a single processes[] array with different types inside:

For differentiating the process types, maybe we add a type: 'coordinator' | 'server_worker' | 'task_worker' | 'reporting_worker' field to each process object in the array.

This has a few benefits:

During ingestion, easier handling of unknown process types (as Josh mentions), since you don't need to deal with finding another array for another type
Easier in terms of handling mappings (as Pierre mentions)
Feels like the most minimal deviation from ECS

So overall I think I'm +1 on Josh's proposal, however since type is already an ECS categorization field, ~~perhaps we should consider using a different field to avoid confusion (name?)~~ EDIT: There's a process.name ECS field which would probably be suitable here.

It might also be nice if these types/names were consistent with the ones provided in the clustering config.

e.g. given this config:

node:
  enabled: true
  workers:
    foo:
      count: 2
      max_old_space_size: 1gb
    bar:
      count: 1
      max_old_space_size: 512mb

we could get these stats:

{
  // ...,
  "processes": [
    {
      "name": "coordinator",
      "pid": 52646,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
    {
      "name": "worker-foo-1",
      "pid": 52647,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
    {
      "name": "worker-foo-2",
      "pid": 52648,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
    {
      "name": "worker-bar-1",
      "pid": 52649,
      "memory": {...},
      "event_loop_delay": 0.22967800498008728,
      "uptime_ms": 1706021.930404,
    },
  ],
  // ...,
}

lukeelmers · 2021-07-06T21:24:43Z

Also, as @Bamieh pointed out, maybe while we are introducing a new field, we should consider moving to the new histogram-style event_loop_delays introduced in #101580.

This probably isn't something the metrics service should need to concern itself with, however it might make sense for the /stats endpoint.

lukeelmers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 30, 2021

This was referenced Jun 30, 2021

Use Nodejs clustering to take advantage of multi-core hardware #68626

Closed

[Breaking][core.metrics] Remove process field from ops metrics & /stats API #104124

Closed

Bamieh mentioned this issue Aug 9, 2021

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #107900

Closed

8 tasks

Bamieh mentioned this issue Aug 24, 2021

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #109820

Merged

9 tasks

Bamieh closed this as completed in #109820 Sep 14, 2021

matschaffer mentioned this issue Nov 2, 2021

[8.0] Kibana GET /api/stats breaking changes elastic/beats#27569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate `process` field #104031

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate `process` field #104031

lukeelmers commented Jun 30, 2021 •

edited

Loading

elasticmachine commented Jun 30, 2021

lukeelmers commented Jul 1, 2021

joshdover commented Jul 5, 2021

pgayvallet commented Jul 6, 2021 •

edited

Loading

lukeelmers commented Jul 6, 2021 •

edited

Loading

lukeelmers commented Jul 6, 2021

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #104031

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #104031

Comments

lukeelmers commented Jun 30, 2021 • edited Loading

Summary

One idea

Scope

elasticmachine commented Jun 30, 2021

lukeelmers commented Jul 1, 2021

joshdover commented Jul 5, 2021

pgayvallet commented Jul 6, 2021 • edited Loading

lukeelmers commented Jul 6, 2021 • edited Loading

lukeelmers commented Jul 6, 2021

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate `process` field #104031

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate `process` field #104031

lukeelmers commented Jun 30, 2021 •

edited

Loading

pgayvallet commented Jul 6, 2021 •

edited

Loading

lukeelmers commented Jul 6, 2021 •

edited

Loading