Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gunicorn with multiple workers breaks metrics #3885

Open
abstractOwl opened this issue Apr 26, 2024 · 5 comments
Open

Gunicorn with multiple workers breaks metrics #3885

abstractOwl opened this issue Apr 26, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@abstractOwl
Copy link

abstractOwl commented Apr 26, 2024

Describe your environment
Python 3.8.5
opentelemetry-distro[otlp]==0.44b.0
Gunicorn 22.0.0

Steps to reproduce

  1. Run a Gunicorn Flask app as described in the documentation, specifying multiple workers.
  2. Visit a route that emits metrics several times
  3. View metrics in metrics portal

What is the expected behavior?
Counter will increase monotonically for each visit to the route

What is the actual behavior?
Counter metric jumps between values from each worker because meter state is not synced between workers.

Example from console exporter (the worker.id resource attribute is set differently for each worker):

{
    "resource_metrics": [
        {
            "resource": {
                "attributes": {
                    "telemetry.sdk.language": "python",
                    "telemetry.sdk.name": "opentelemetry",
                    "telemetry.sdk.version": "1.23.0",
                    "service.instance.id": "localhost",
                    "worker.id": "885db",
                    "service.name": "otel-demo",
                    "telemetry.auto.version": "0.44b0"
                },
[…]
                    "metrics": [
                        {
                            "name": "test.counter",
                            "description": "",
                            "unit": "",
                            "data": {
                                "data_points": [
                                    {
                                        "attributes": {},
                                        "start_time_unix_nano": 1714162262042565771,
                                        "time_unix_nano": 1714162322001856183,
                                        "value": 11
                                    }
                                ],
                                "aggregation_temporality": 2,
                                "is_monotonic": true
                            }
                        }
                    ],
{
    "resource_metrics": [
        {
            "resource": {
                "attributes": {
                    "telemetry.sdk.language": "python",
                    "telemetry.sdk.name": "opentelemetry",
                    "telemetry.sdk.version": "1.23.0",
                    "service.instance.id": "localhost",
                    "worker.id": "818d7",
                    "service.name": "otel-demo",
                    "telemetry.auto.version": "0.44b0"
                },
                "schema_url": ""
            },

[…]

                    "metrics": [
                        {
                            "name": "test.counter",
                            "description": "",
                            "unit": "",
                            "data": {
                                "data_points": [
                                    {
                                        "attributes": {},
                                        "start_time_unix_nano": 1714162262869065166,
                                        "time_unix_nano": 1714162322111501132,
                                        "value": 12
                                    }
                                ],
                                "aggregation_temporality": 2,
                                "is_monotonic": true
                            }
                        }
                    ],

Additional context
This issue is caused by the way each Gunicorn worker needs to be instrumented separately due to the fork process model. I think most developers would expect that multiple workers on the same host be treated as the same instance when it comes to metrics.

A hacky workaround for this would be to use [hostname]:[pid] as service.instance.id and then aggregate on a separate service hostname resource attribute in your metrics portal, but that doesn't seem like a great solution.

@abstractOwl abstractOwl added the bug Something isn't working label Apr 26, 2024
@abstractOwl
Copy link
Author

I see that this issue was referenced previously in this closed issue: #93

@ocelotl
Copy link
Contributor

ocelotl commented May 9, 2024

@abstractOwl appending the pid could be a valid solution, curious why you think it is a hacky approach to this issue?

@abstractOwl
Copy link
Author

@ocelotl I'd prefer a solution where we didn't have to add an extra filter to all of our queries - it breaks many of the default metrics views/generated dashboards that are offered by observability vendors and is overall an inconvenience.

@bowenng
Copy link

bowenng commented Jul 2, 2024

Any thoughts on adding aggregation capabilities in the Prometheus exportor in the OTEL Collector?

I observed that as long as each forked process has its own resource specified, the "debug" exporter actually prints out the correct metrics generated per resource (I.e. per process). It seems to be theoretically possible to aggregate among all the resources in the Prometheus exporter.

I think the issue is that the Prometheus exporter is ignoring the resource and is not doing any aggregation.

@xrmx
Copy link
Contributor

xrmx commented Jul 9, 2024

Had a discussion internally about this and given this excerpt from note [1] from https://opentelemetry.io/docs/specs/semconv/attributes-registry/service/:

For applications running behind an application server (like unicorn), we do not recommend using one identifier for all processes participating in the application.
Instead, it’s recommended each division (e.g. a worker thread in unicorn) to have its own instance.id.

Shouldn't we add a way to set a sensible service.instance.id by default?

Other sdks:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants