Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix orphaned spans on Celery workers #822

Open
robrap opened this issue Oct 8, 2024 · 4 comments
Open

Fix orphaned spans on Celery workers #822

robrap opened this issue Oct 8, 2024 · 4 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Oct 8, 2024

We're seeing orphaned spans on several of our Celery workers, identifiable as service entry spans that are not operation_name:celery.run:

Service Operation names
edx-edxapp-lms-workers celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode
edx-edxapp-cms-workers celery.apply, get_course.split_modulestore, django.template.render
edx-ai-translations-worker celery.apply
license-manager-worker celery.apply
edx-video-encode-manager-worker celery.apply

We noticed this because of missing code owner on root spans. We've restricted our monitors to just celery.run root spans, but we still want to fix this because about 10% of traces are broken.

Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support. Update: Their proposed fix: DataDog/dd-trace-py#11272

@robrap robrap added this to Arch-BOM Oct 8, 2024
@robrap robrap converted this from a draft issue Oct 8, 2024
@robrap
Copy link
Contributor Author

robrap commented Oct 8, 2024

@timmc-edx: I wanted to bring your attention to Alex's note from Slack:

Edit: ah right, it's in the edx-edxapp-lms-workers APM. But why is the apply_async operation associated with the worker? Isn't the webapp the thing that calls apply_async() and does the redis operation to put the task and its args/kwargs on the queue?

This might be additional info about things that are off on the spans you will be looking into. If not, you can ignore as far as this ticket is concerned, other than to report back your findings eventually. Thanks.

@timmc-edx
Copy link
Member

timmc-edx commented Oct 8, 2024

Numbers across resources

On prod LMS for the past 2 days, all celery.apply resource names that at least sometimes are top-level [query] in descending order by frequency:

[A] integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
[S] openedx.core.djangoapps.credentials.tasks.v1.tasks.send_grade_to_credentials
[S] integrated_channels.integrated_channel.tasks.transmit_single_learner_data
[A] openedx.core.djangoapps.programs.tasks.award_program_certificates
[S] openassessment.workflow.tasks.update_workflow_for_submission_task
[S] lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones
[S] openedx.core.djangoapps.programs.tasks.award_course_certificate
[S] openassessment.workflow.tasks.update_workflows_for_ora_block_task
[S] lms.djangoapps.certificates.tasks.generate_certificate
[A] openassessment.workflow.tasks.update_workflows_for_course_task
[S] openedx.core.djangoapps.notifications.tasks.send_notifications
[A] lms.djangoapps.grades.tasks.compute_grades_for_course_v2
[S] lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
[S] lms.djangoapps.bulk_email.tasks.send_course_email
[A] openedx.core.djangoapps.programs.tasks.revoke_program_certificates

Key: [A] = always top-level; [S] = sometimes top-level, sometimes child

Numbers drilldown

Filtering on the most common top-level celery.apply over what I hope is a representative smaller time window in prod (no spikes):

  • env:prod resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
    • @_top_level:* (local root)
      • operation_name:celery.run service:edx-edxapp-lms-workers 35,065
      • operation_name:celery.apply service:edx-edxapp-lms-workers 3,104
    • -@_top_level:* (child)
      • operation_name:celery.apply service:edx-edxapp-lms-workers 31,711
      • operation_name:celery.apply service:edx-edxapp-lms 30

So we have:

  • A handful of celery.apply on the webapp; 3 orders of magnitude more than the total on the workers. Probably mostly kicked off as a result of a different celery task. No celery.run on the webapp, as expected since it is called with apply_async.
  • On the worker, about as many celery.apply as celery.run.
  • On the worker, celery.apply is 10x as common a child than a root span. (Spot check: parent is lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3).
  • Approx same number of celery.apply root as celery.run (to within 1%).

However, over a different time period that contained some spikes:

  • env:prod resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
    • @_top_level:* (local root)
      • operation_name:celery.run service:edx-edxapp-lms-workers 1,179
      • operation_name:celery.apply service:edx-edxapp-lms-workers 937
    • -@_top_level:* (child)
      • operation_name:celery.apply service:edx-edxapp-lms-workers 1470
      • operation_name:celery.apply service:edx-edxapp-lms 1

Here, the celery.apply count is over twice the celery.run count, and the celery.apply is about as common as child as root.

Trace-level analysis

The transmit task has a recalculate span as parent. Analysis of this relationship over a recent time period:

a = env:prod service:edx-edxapp-lms-workers resource_name:lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
b = env:prod service:edx-edxapp-lms-workers resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
  • When the is parent/child (a => b) the parent is always celery.run and the child is celery.apply.
  • I'd like to check if there are cases where the parent does not have this child (a && NOT(a => b)) but Datadog has a bug that is making it silently return zero results. So instead, we'll have to look for traces with a and b missing/present anywhere and assume that when present, it's always parent/child.
    • a && b 1.58k
    • a && NOT(b) 6.62k
    • NOT(a) && b 4.58k
  • It's possible that a doesn't always call b—or that b is being orphaned some of the time. Unclear.

@timmc-edx
Copy link
Member

Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support.

@timmc-edx
Copy link
Member

timmc-edx commented Oct 10, 2024

Answering the question "do other celery workers have this problem...

  • Services that have top-level celery.run, which are therefore probably celery workers: Service entry spans, by service, matching env:prod operation_name:celery.run
  • The subset of those that also have other top-level spans: Service entry spans, by service and operation name, matching env:prod -operation_name:celery.run service:(a OR b OR ...)

Results, filtered down to edX services:

Service Operation names
edx-edxapp-lms-workers celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode
edx-edxapp-cms-workers celery.apply, get_course.split_modulestore, django.template.render
edx-ai-translations-worker celery.apply
license-manager-worker celery.apply
edx-video-encode-manager-worker celery.apply

So edxapp has multiple kinds of top-level spans, but the other workers have at most celery.apply. (That doesn't necessarily mean that edxapp is different, though, other than having a wide variety of spans that can be orphaned.)

@timmc-edx timmc-edx changed the title Discovery: Do we have orphaned celery.apply spans Fix orphaned spans on Celery workers Oct 15, 2024
@timmc-edx timmc-edx moved this from In Progress to Blocked in Arch-BOM Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Blocked
Development

No branches or pull requests

2 participants