Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetricsRegistry and High Cardinality #1977

Closed
astorm opened this issue Feb 10, 2021 · 1 comment · Fixed by #2163
Closed

MetricsRegistry and High Cardinality #1977

astorm opened this issue Feb 10, 2021 · 1 comment · Fixed by #2163
Assignees
Labels
agent-nodejs Make available for APM Agents project planning.
Milestone

Comments

@astorm
Copy link
Contributor

astorm commented Feb 10, 2021

We've received reports that the current implementation of our metrics registry can create situations where a service or application with a large number of distinct URLs will create a large number of distinct transaction names which, in turn, will create a large number of unique "metric names + labels/dimensions" instances which, over time, can overwhelm APM Server with traffic. This is particularly concerning when a low traffic endpoint creates a metric once that is rarely incremented, but continues to be reported because it was seen once.

In other words, the High Cardinality problem.

The captureBreakdown method is where these problematic metrics originate.

Our current workaround is to recommend that users use the setTransactionName API method to reduce the unique transaction names being generated.

This issue is track whether other users are seeing similiar issues with this and discuss possible work around (ejecting stale metrics, allowing configuration of what breakdown metrics are reported, etc,)

@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Feb 10, 2021
@AlexanderWert AlexanderWert added this to the 7.14 milestone Apr 26, 2021
@AlexanderWert AlexanderWert modified the milestones: 7.14, 7.15 Jun 14, 2021
@astorm astorm self-assigned this Jul 20, 2021
@trentm
Copy link
Member

trentm commented Jul 21, 2021

Notes to myself on some details of the issue

Here is the base metricset that the Node.js APM agent will send every metricsInterval:

    {
        "metricset": {
            "samples": {
                "system.cpu.total.norm.pct": { "value": 0 },
                "system.memory.total": { "value": 68719476736 },
                "system.memory.actual.free": { "value": 36078088192 },
                "system.process.cpu.total.norm.pct": { "value": 0.06212579838674182 },
                "system.process.cpu.system.norm.pct": { "value": 0.007457950524313062 },
                "system.process.cpu.user.norm.pct": { "value": 0.05466784786242876 },
                "system.process.memory.rss.bytes": { "value": 41197568 },
                "nodejs.handles.active": { "value": 3 },
                "nodejs.requests.active": { "value": 0 },
                "nodejs.eventloop.delay.avg.ms": { "value": 0 },
                "nodejs.memory.heap.allocated.bytes": { "value": 17809408 },
                "nodejs.memory.heap.used.bytes": { "value": 11838984 },
                "nodejs.memory.external.bytes": { "value": 1730938 },
                "nodejs.memory.arrayBuffers.bytes": { "value": 234834 }
            },
            "timestamp": 1626884724383000,
            "tags": { "hostname": "pink.local", "env": "development" }
        }
    }

After a transaction we get these breakdown metrics:

    {
        "metricset": {
            "samples": {
                "transaction.duration.count": { "value": 1 },
                "transaction.duration.sum.us": { "value": 4.57 },
                "transaction.breakdown.count": { "value": 1 }
            },
            "timestamp": 1626884754380000,
            "tags": { "hostname": "pink.local", "env": "development" },
            "transaction": { "name": "tx id daa2", "type": "request" }
        }
    }
    {
        "metricset": {
            "samples": {
                "span.self_time.count": { "value": 1 },
                "span.self_time.sum.us": { "value": 4.57 }
            },
            "timestamp": 1626884754380000,
            "tags": { "hostname": "pink.local", "env": "development" },
            "transaction": { "name": "tx id daa2", "type": "request" },
            "span": { "type": "app" }
        }
    }

which forever will be included in metricsInterval metrics reporting even if their value is zero for a that interval:

    {
        "metricset": {
            "samples": {
                "transaction.duration.count": { "value": 0 },
                "transaction.duration.sum.us": { "value": 0 },
                "transaction.breakdown.count": { "value": 0 }
            },
            "timestamp": 1626884784383000,
            "tags": { "hostname": "pink.local", "env": "development" },
            "transaction": { "name": "tx id daa2", "type": "request" }
        }
    }
    {
        "metricset": {
            "samples": {
                "span.self_time.count": { "value": 0 },
                "span.self_time.sum.us": { "value": 0 }
            },
            "timestamp": 1626884784383000,
            "tags": { "hostname": "pink.local", "env": "development" },
            "transaction": { "name": "tx id daa2", "type": "request" },
            "span": { "type": "app" }
        }
    }

If you have an app that has fairly high cardinality of transaction name, as simulated in the following (see "HERE"):

const apm = require('./').start({
  cloudProvider: 'none',
  centralConfig: false,
  captureExceptions: false,
  logLevel: 'debug',
  metricsInterval: 10,
  apiRequestTime: 5
})

const crypto = require('crypto')
const http = require('http')
const hostname = '127.0.0.1'
const port = 3000
const server = http.createServer((req, res) => {
  apm.setTransactionName('tx id ' + crypto.randomBytes(2).toString('hex')) // <--- HERE
  res.writeHead(200, { 'content-type': 'text/plain' })
  res.end('pong')
})
server.listen(port, hostname, () => {
  console.log(`Server running at http://${hostname}:${port}/`)
})

then it gets really bad after a number of requests. Thereafter every metricsInterval reporting period will add all these metricsets (for possibly/likely transaction names with zero values) to the intake request to APM server. E.g.:

[2021-07-21T16:32:29.434Z]  INFO: mockapmserver/23809 on pink.local: request (req.remoteAddress=::ffff:127.0.0.1, req.remotePort=59426, req.bodyLength=31241, res.bodyLength=2)
    POST /intake/v2/events HTTP/1.1
    accept: application/json
    user-agent: elasticapm-node/3.18.0 elastic-apm-http-client/9.8.1 node/12.22.1
    content-type: application/x-ndjson
    content-encoding: gzip
    host: localhost:8200
    connection: keep-alive
    transfer-encoding: chunked

    {"metadata":{"service":{"name":"elastic-apm-node","environment":"development","runtime":{"name":"node","version":"12.22.1"},"language":{"name":"javascript"},"agent":{"name":"nodejs","version":"3.18.0"},"version":"3.18.0"},"process":{"pid":33208,"ppid":30477,"title":"node","argv":["/Users/trentm/.nvm/versions/node/v12.22.1/bin/node","/Users/trentm/el/apm-agent-nodejs8/foo.js"]},"system":{"hostname":"pink.local","architecture":"x64","platform":"darwin"}}}
    {"metricset":{"samples":{"system.cpu.total.norm.pct":{"value":0.04418197725284334},"system.memory.total":{"value":68719476736},"system.memory.actual.free":{"value":35931152384},"system.process.cpu.total.norm.pct":{"value":0.0003277373258445795},"system.process.cpu.system.norm.pct":{"value":0.00012404342888618514},"system.process.cpu.user.norm.pct":{"value":0.00020369389695839435},"system.process.memory.rss.bytes":{"value":43663360},"nodejs.handles.active":{"value":3},"nodejs.requests.active":{"value":0},"nodejs.eventloop.delay.avg.ms":{"value":1.3614698885096708},"nodejs.memory.heap.allocated.bytes":{"value":12169216},"nodejs.memory.heap.used.bytes":{"value":10477320},"nodejs.memory.external.bytes":{"value":2358872},"nodejs.memory.arrayBuffers.bytes":{"value":872516}},"timestamp":1626885144427000,"tags":{"hostname":"pink.local","env":"development"}}}
    {"metricset":{"samples":{"transaction.duration.count":{"value":0},"transaction.duration.sum.us":{"value":0},"transaction.breakdown.count":{"value":0}},"timestamp":1626885144427000,"tags":{"hostname":"pink.local","env":"development"},"transaction":{"name":"tx id daa2","type":"request"}}}
    {"metricset":{"samples":{"span.self_time.count":{"value":0},"span.self_time.sum.us":{"value":0}},"timestamp":1626885144427000,"tags":{"hostname":"pink.local","env":"development"},"transaction":{"name":"tx id daa2","type":"request"},"span":{"type":"app"}}}
    {"metricset":{"samples":{"transaction.duration.count":{"value":0},"transaction.duration.sum.us":{"value":0},"transaction.breakdown.count":{"value":0}},"timestamp":1626885144427000,"tags":{"hostname":"pink.local","env":"development"},"transaction":{"name":"tx id c572","type":"request"}}}
    {"metricset":{"samples":{"span.self_time.count":{"value":0},"span.self_time.sum.us":{"value":0}},"timestamp":1626885144427000,"tags":{"hostname":"pink.local","env":"development"},"transaction":{"name":"tx id c572","type":"request"},"span":{"type":"app"}}}
    ...

astorm added a commit that referenced this issue Jul 21, 2021
* feat: no longer sends counting metrics with a count of zero and removes them from the metrics cache/registry

Closes: #1977
dgieselaar pushed a commit to dgieselaar/apm-agent-nodejs that referenced this issue Sep 10, 2021
* feat: no longer sends counting metrics with a count of zero and removes them from the metrics cache/registry

Closes: elastic#1977
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-nodejs Make available for APM Agents project planning.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants