-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remap shared services for edx-edxapp in DD #737
Comments
Answers from DD Support ticket: Coming back from engineering to let you know answers to your queries:
Downsides:
Preferred Option:
Regarding service Mapping and operation names:
In this case I would recommend:
|
The following ticket is related to this ticket, because it would drop some of the services that need to be remapped (e.g. |
@timmc-edx: Should this wait on the trace concatenation fix so we don't bring in a lot of new spans? Not sure how this will affect things. |
Hmm... I don't know that the two will interact that much. |
I'll be using https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping to build out something ADR-like and track testing in edx-platform. |
@timmc-edx: I noticed the following chart on the Service Catalog will have varying usefulness depending on the direction we choose. - - The default view is for Downstream Services.
|
OK, I think I've come around to suffixing—both because of how things will work in DD's UI and because I realized I'd made an incorrect assumption earlier that edxapp was like other services. :-P It wouldn't be |
...actually no, that doesn't make sense at all, as these are client side spans and we want to distinguish LMS vs. CMS client-side traffic. None of the options we have really make sense, so I'm going to open a ticket with Datadog to see how they would recommend solving this. |
Opened https://help.datadoghq.com/hc/en-us/requests/1873842 with options and pros/cons as we understand them, asking for guidance. |
I've indicated interest in the Inferred Services Beta using the form linked at the top of https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/ |
@timmc-edx: Should this be marked as blocked while waiting in DD? Also, should we follow-up with Shannon from DD (when she is back) to see if we can get help moving this along? |
Yes, moved to blocked. Agreed that we should ask Shannon to look into why we haven't gotten a response after I requested access to the beta. |
I sent an email to DD, and we'll see what happens. Also added the |
We should be in the beta program now. |
Combined with agent changes, this will allow for unified service naming so that we can query DB metrics for a service. When enabled for an environment, this change will change all spans in edxapp to use the same service name rather than integration-specific names like `service:defaultdb`. See edx/edx-arch-experiments#737 and https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/?tab=python
Combined with agent changes, this will allow for unified service naming so that we can query DB metrics for a service. When enabled for an environment, this change will change all spans in edxapp to use the same service name rather than integration-specific names like `service:defaultdb`. See edx/edx-arch-experiments#737 and https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/?tab=python
We've configured Inferred Services on stage. Before moving to edge and prod I want to check if anyone has monitors or dashboards that are using Method: I used The only edX thing I could find was monitor 145569239 "prod elasticsearch cluster latency" from the Cosmonauts team, referencing Edit: Monitor has been adjusted to remove the |
Moving to blocked for the moment; I found a bug in ddtrace that makes Inferred Services somewhat unappealing for production use (django.cache spans are still getting service:django instead of the base service name). I've reported it to Datadog in https://help.datadoghq.com/hc/en-us/requests/1958899 along with a likely fix, and hopefully they can release the fix soon. |
We should be unblocked from deploying to prod once edx/configuration#131 rolls out. It's a temporary fix, though, and I've added reverting that to the A/C. |
It's on in stage, edge, and prod, but not sandboxes yet. See edx/edx-arch-experiments#737
It's on in stage, edge, and prod, but not sandboxes yet. See edx/edx-arch-experiments#737
The problem is that default DD configuration for mysql, django (cache), defaultdb, requests, and potentially other libraries has each of these going to a shared service of the same name that is used across all of our IDA services.
Problems with this default approach:
Acceptance Criteria
django.requests
, etc.).timmc/datadog-local-testing
is updated)DD_DJANGO_CACHE_SERVICE_NAME
once IDAs are on ddtrace 2.19.1 or higher. (See private ticket https://help.datadoghq.com/hc/en-us/requests/1958899 and fix PR fix: add schematize service name for django caching DataDog/dd-trace-py#11843.)peer_tags_aggregation
/DD_APM_PEER_TAGS_AGGREGATION
andcompute_stats_by_span_kind
/DD_APM_COMPUTE_STATS_BY_SPAN_KIND
once we've upgraded to at least https://github.com/DataDog/datadog-agent/releases/tag/7.60.0 in both EC2 and k8s envs, since those are now enabled by default.Notes:
DD_SERVICE_MAPPING
is a simpler method of remapping, rather than using a variety of different settings.django
tocache
might be risky, becausedjango
could be used for something else in the future, so in this particular case we might want to useDD_DJANGO_CACHE_SERVICE_NAME
instead.service:edx-edxapp-lms
(spans would still have a different operation_name), orservice:edx-edxapp-lms-cache
(a new DD service catalog service).edx-edxapp-lms
if it no longer defaults tooperation_name:django.requests
. See these DD docs for configuring the primary operation.service:edx-edxapp-lms
service:edx-edxapp-lms-cache
(wasdjango
)service:edx-edxapp-lms-defaultdb
service:edx-edxapp-lms-workers
rather thanservice:edx-edxapp-workers-lms
.service:edx-edxapp-workers-lms-cache
, etc., and if we used a search likeservice:edx-edxapp-lms*
to pick up all services related toservice:edx-edxapp-lms
, we would not want that to also pick up the worker service and all its sub-services.elasticsearch
,redis
,read_replicadb
. Please review in DD to get the full list of affect dependencies. Note that some general shared services (e.g. aws.s3) probably should remain shared, but this could be discussed.ADRish page: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping
The text was updated successfully, but these errors were encountered: