Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastapi: fix wrapping of middlewares #3012

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Nov 15, 2024

MRE:

import anyio
import fastapi
import httpx
import uvicorn
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

resource = Resource(attributes={SERVICE_NAME: '🐛'})

traceProvider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(ConsoleSpanExporter())
traceProvider.add_span_processor(processor)
trace.set_tracer_provider(traceProvider)

app = fastapi.FastAPI()

@app.get('/ok')
async def ok() -> int:
    return 123

@app.get('/error')
async def error() -> int:
    raise Exception('An error occurred!')

FastAPIInstrumentor.instrument_app(app)


async def main() -> None:
    config = uvicorn.Config(app=app, port=9999, log_config=None)
    server = uvicorn.Server(config)

    async with anyio.create_task_group() as tg:
        tg.start_soon(server.serve)
        await anyio.sleep(1)  # Wait for the server to start
        async with httpx.AsyncClient(base_url='http://localhost:9999') as client:
            response = await client.get('/ok')
            assert response.status_code == 200, f'{response.status_code}: {response.text}'
            response = await client.get('/error')
            assert response.status_code == 500,  f'{response.status_code}: {response.text}'
            tg.cancel_scope.cancel()


if __name__ == '__main__':
    anyio.run(main)

With this change it shows up properly

Run this and you'll see that there is no "http.status_code": 500 in the logs.

@adriangb adriangb requested a review from a team as a code owner November 15, 2024 15:35
Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comment and test!

@xrmx
Copy link
Contributor

xrmx commented Nov 19, 2024

@adriangb Please take a look at the failing tests

# are handled.
# This should not happen unless there is a bug in OpenTelemetryMiddleware, but if there is we don't want that
# to impact the user's application just because we wrapped the middlewares in this order.
stack = ServerErrorMiddleware(stack)
Copy link
Contributor

@lzchen lzchen Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is the purpose of this pr to handle exceptions generated from the OpenTelemetryMiddleware itself (if there is a bug) or unhandled exceptions thrown from the request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! If there is a bug in OpenTelemetryMiddleware without this double wrapping then the server might not send a 500 to the client and instead abruptly disconnect. In practice I believe every ASGI server I know of does also have it's own handling of unhandled exceptions such that this would not be the case, but I'm not sure we want to rely on that + there would still be a noticeable change (different response content at least, maybe different headers, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since asgi middleware does not have exception handling of requests on it's own, wouldn't legitimate exceptions that are raised from the request itself (that we would want to collect telemetry from) be swallowed up by the new middleware? Or does ServerErrorMiddleware ONLY handle 500s? Apologies, I'm not very familiar with the inner workings of it.

@Kludex
Copy link
Contributor

Kludex commented Nov 20, 2024

fixes #795

@adriangb adriangb force-pushed the fix-asgi-middleware branch from e406be8 to 3d8dd4f Compare November 27, 2024 00:29
@@ -170,7 +170,8 @@ def setUp(self):
self._instrumentor = otel_fastapi.FastAPIInstrumentor()
self._app = self._create_app()
self._app.add_middleware(HTTPSRedirectMiddleware)
self._client = TestClient(self._app)
self._client = TestClient(self._app, base_url="https://testserver:443")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the root cause of tests failures. It turns out every request was being made twice because it was getting redirected to https. Before this PR that wasn't being instrumented correctly, so this was not being caught! I think that's another major bug this PR is fixing.

@adriangb adriangb requested a review from lzchen November 27, 2024 00:35
@adriangb
Copy link
Contributor Author

adriangb commented Nov 27, 2024

@adriangb Please take a look at the failing tests

@xrmx see #3012 (comment)

@adriangb
Copy link
Contributor Author

@Kludex could you update the Starlette instrumentations after we finish up this PR? I'm assuming they have the same bugs.

@Kludex
Copy link
Contributor

Kludex commented Nov 27, 2024

The pipeline has been running for some hours... 👀

app = ServerErrorMiddleware(app)
return app

app._original_build_middleware_stack = app.build_middleware_stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do you think it would be possible to use wrapt for monkeypatching instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not. This is simpler and works just fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xrmx genuine question: what would be the benefit of that in this case?

@vanyae-cqc
Copy link

vanyae-cqc commented Nov 27, 2024

Hello! Just wondering if you'd expect this to enable getting the trace_id in a FastAPI exception_handler?

This has been discussed in more detail here: open-telemetry/opentelemetry-python#3477, but currently OpenTelemetryMiddleware's context seems to be removed by the time an exception gets to the ServerErrorMiddleware.

Would be great to enable users to report back a trace_id when they get an unexpected 500 status response. Have tried to verify this PR provides this by running it locally but haven't had any luck - could be something wrong in my setup though.

[Update]: Have now verified this achieves the above behaviour - this fix is greatly appreciated!

@adriangb
Copy link
Contributor Author

Fixed issue with test shutdown hanging. All checks passing now.

@adriangb adriangb requested a review from xrmx November 27, 2024 16:20
Copy link
Contributor

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust you on the fastapi knowledge but would be nice to add a test to check that the change does what it's supposed to fix

@adriangb
Copy link
Contributor Author

adriangb commented Dec 3, 2024

Ups looks like that got lost, I had added it in 4220b09

@adriangb adriangb force-pushed the fix-asgi-middleware branch from 7497d32 to f825c76 Compare December 3, 2024 16:04
@adriangb
Copy link
Contributor Author

adriangb commented Dec 3, 2024

Added back and updated the branch

tracer_provider=None,
meter_provider=None,
excluded_urls=None,
tracer_provider: TracerProvider | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you use typing.Optional to be consistent with most other places in the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places are out dated.

What should happen is to start switching the other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard agreement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context the from __future__ import annotations let us use this syntax

@xrmx
Copy link
Contributor

xrmx commented Dec 4, 2024

ICYMI the added test is red

@adriangb
Copy link
Contributor Author

adriangb commented Dec 4, 2024

Thanks. I hadn't understood how the tests are parametrized and that _create_app is overridden in a subclass.

@adriangb
Copy link
Contributor Author

adriangb commented Dec 13, 2024

@xrmx @lzchen can we get this merged please? I believe this is a major issue that deserves attention.

@adriangb adriangb requested review from lzchen and xrmx December 13, 2024 15:24
app = ServerErrorMiddleware(app)
return app

app._original_build_middleware_stack = app.build_middleware_stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xrmx genuine question: what would be the benefit of that in this case?

Comment on lines +175 to +178
# run the lifespan, initialize the middleware stack
# this is more in-line with what happens in a real application when the server starts up
self._exit_stack = ExitStack()
self._exit_stack.enter_context(self._client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build_middleware_stack this PR patches is called on the startup event - which the TestClient only runs if called within a context manager, that's why @adriangb is using ExitStack here. FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants