Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't exit ticker thread on collection failure #3788

Closed
wants to merge 2 commits into from

Conversation

rbtz-openai
Copy link
Contributor

@rbtz-openai rbtz-openai commented Mar 19, 2024

Description

Currently, if collect raises an uncaught error, _ticker stops and service's metrics disappear. This PR makes it more robust for the one-off-failures. Persistent failures would also be more visible in logs.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@rbtz-openai rbtz-openai requested a review from a team March 19, 2024 03:19
@srikanthccv
Copy link
Member

Can you help me understand when/where the exception is raised? During the metrics collection, one potential place unexpected Exception could be raised in async instruments which we already addressed here

except Exception: # pylint: disable=broad-except
_logger.exception(
"Callback failed for instrument %s.", self.name
)
.

@rbtz-openai
Copy link
Contributor Author

Can you help me understand when/where the exception is raised?

This is mostly a catchall for things that are not caught by individual callbacks.

We observed collection ticker exiting on exceptions in Exponential Histograms, e.g. the one in:

@srikanthccv
Copy link
Member

That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.

@rbtz-openai
Copy link
Contributor Author

rbtz-openai commented Mar 22, 2024

That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.

This is mostly a behavioral question: in steady state this should never happen. But in cases when something unexpected happens (new feature or regression) how do we want SDK's PeriodicExportingMetricReader to behave? Should it stop on error or should it try its best to continue? Current PR, assumes that it is better to be loud and retry.

@rbtz-openai rbtz-openai closed this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants