Add detail for GCP Cloud Run Job execution #3378

damemi · 2023-04-10T17:58:55Z

Changes

Add a note for Cloud Run Jobs on how to identify faas.version.

Related issues
Downstream PR proposing this semconv change: GoogleCloudPlatform/opentelemetry-operations-go#465

Related OTEP(s) #

damemi · 2023-04-10T18:06:45Z

/cc @punya @jsuereth @aabmass @psx95

damemi · 2023-04-14T11:35:25Z

ping @carlosalberto ptal

semantic_conventions/resource/faas.yaml

Oberon00

I read the linked page https://cloud.google.com/run/docs/managing/job-executions and it seems to confirm what I already suspected from the name "job execution". It does not really fit the version, it sounds like it would better fit the instance or invocation, or maybe cloud run jobs don't fit FaaS conventions at all.

A version should be something more or less permanent , that you can refer to, e.g. when you invoke a function (these conventions are about functions, not jobs), some version gets selected, and a particular version may be launched repeatedly.

E.g. when I read the docs here:

https://cloud.google.com/run/docs/managing/jobs#delete

Deleting a job cancels all running job executions of that job.

A job execution is something running, that's what an instance or invocation is, a version is something that can be run (multiple times).

See also https://cloud.google.com/run/docs/resource-model#job-executions and https://cloud.google.com/run/docs/resource-model#jobs

So I would say that CloudRun jobs don't really fit FaaS conventions well, but if we want to force them into the FaaS conventions, the job execution would match both an invocation_id and instance. I'm leaning slightly towards invocation_id, but not sure

damemi · 2023-04-18T12:23:44Z

Versions, instances, and invocations are all functionally similar (you could even argue they are the same) in the isolated context of Jobs, due to the inherent nature of Jobs. In effect, each version corresponds to one or more instances triggered by an invocation. So, you could map Execution to any one of version, instance, or invocation based solely on how it behaves. What's critical is what it represents.

In the broader context of Cloud Run, if you consider a job to be just a short-lived, single-request Service, then the concepts of version, instance, and invocation are equivalent:

Invocation: User request made to the application.
Instance: Container instance ID within the application.
Version: Each new deployment of the application.

We can rule out invocation, because that is not what we want to represent with this attribute. This attribute is meant to represent the Job itself, not the request made to it.

Instance maps better to Execution (because each Task runs as its own instance). But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute. Mapping Instance to both Version and Instance is redundant and leaves no room for us to provide the Execution.

So with Invocation ruled out, and Instance mapped to Instance, that logically leaves (Job)Execution to map to (Service)Revision. We at the very least need somewhere to expose the Job Execution. And from a Cloud Run user's perspective, it would be expected in the same place as Service Revision.

For users on Cloud Run, the same resource detector should be able to fill in the same fields whether for a Cloud Run Service or Cloud Run Job. If only for consistency with the existing Services conventions, we should not fill in fields based on different metadata sources within Cloud Run.

So I would say that CloudRun jobs don't really fit FaaS conventions well

I understand that I'm being very self-referential by using the Service conventions to map to Jobs. I'm assuming that these existing conventions set a precedent for the inclusion of Cloud Run apps as FaaS concepts. So if the conventions for Services are valid, then this proposal for Jobs is also valid. Because you can, in theory, reduce a Service to a Job, and a Job to a Function.

Oberon00 · 2023-04-18T12:35:00Z

Actually "invocation" was just recently renamed and used to be named "execution" #3209. I think there is also not really a difference between "the Job itself" and "the request made to it". After all, the same applies to the invocation_id: It identifies the executing invocation, not the request causing the invocation (for example, on AWS Lambda, the request ID is generated by the Lambda backend, not by the requester).

Regarding "instance":

But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute.

What is that own attribute? The term "instance" has many meanings, but a "faas.instance" is an instance of a FaaS execution environment, whatever that encompasses (even multiple "instances" of something else). According to the Google Documentation, a job may invoke multiple containers, so container IDs cannot be used in that attribute anyway (since then a single faas.invocation_id would span multiple faas.instances which is not how these concepts are supposed to be used)

In the end, it might also be worth it to introduce a special attribute gcp.job_execution. I just don't see how "version" would fit.

damemi · 2023-04-18T13:21:16Z

Actually "invocation" was just recently renamed and used to be named "execution" #3209.

I don't know the background for this change, but that PR description says it was changed because execution was "too generic and confusing". Maybe I'm misinterpreting, but that implies to me that the name change is intentionally meant to focus it on one thing (the request?)

Is there any discussion on how the term "invocation" was decided, or a definition of it for this context? Josh's #3378 (comment) defined it as the request, so that's what I'm basing this on.

I think there is also not really a difference between "the Job itself" and "the request made to it".

The Job is a resource that exists in the GCP Cloud Run API and handles requests. The request is an event initiated by a user to that resource. I think the fact that execution was renamed to invocation to make the term more specific means we should draw the distinction between the two, which I'm comparing to the definition in this section on the difference between instance and invocation.

But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute.

What is that own attribute?

faas.instance

The term "instance" has many meanings, but a "faas.instance" is an instance of a FaaS execution environment, whatever that encompasses (even multiple "instances" of something else).

A container is the instance of a Cloud Run execution environment. In Cloud Run Services, this container can serve multiple invocations. In Jobs, the container only lives for a single invocation but is technically still the execution environment.

According to the Google Documentation, a job may invoke multiple containers, so container IDs cannot be used in that attribute anyway (since then a single faas.invocation_id would span multiple faas.instances which is not how these concepts are supposed to be used)

As a span attribute, does invocation_id cross multiple instances? eg, a user calls a function that calls another function, is that the same invocation_id across both of those instances, or is the invocation_id meant to change whenever it hits a new instance?

Oberon00 · 2023-04-18T14:34:11Z

that execution was renamed to invocation to make the term more specific means we should draw the distinction between the two

The renaming definitely did not have the intention to restrict the meaning of the attribute such that execution and invocation could now be two separate things. At least that is my interpretation. @tylerbenson do you agree? Or should we re-introduce execution (or execution_id) as an additional attribute? I don't think so...

But I can see that the current definition of invocation_id is unclear with regard to your use case. You could adapt it though.

Oberon00 · 2023-04-18T14:38:00Z

As a span attribute, does invocation_id cross multiple instances? eg, a user calls a function that calls another function, is that the same invocation_id across both of those instances, or is the invocation_id meant to change whenever it hits a new instance?

It would usually be different for each function, but that might depend on the cloud provider. The attribute was originally modeled after the AWS Lambda Request ID https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html#runtimes-api-next.

Oberon00 · 2023-04-18T14:39:47Z

But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute.

What is that own attribute?

faas.instance

OK, now I get that I still don't get this aspect, sorry to keep bugging you here, but I think it might be important. So, what do you set as value here? The instance ID of the host that executes the container? Is it then possible that the same Job execution spans multiple instances?

damemi · 2023-04-18T15:15:28Z

OK, now I get that I still don't get this aspect, sorry to keep bugging you here, but I think it might be important. So, what do you set as value here?

No problem. We use the InstanceID() gce metadata function https://pkg.go.dev/cloud.google.com/go/compute/metadata#InstanceID which returns the current VM's numeric instance ID, which in Cloud Run is a Unique identifier of the container instance https://cloud.google.com/run/docs/container-contract#metadata-server, iow the container ID.

Is it then possible that the same Job execution spans multiple instances?

Yes

The renaming definitely did not have the intention to restrict the meaning of the attribute such that execution and invocation could now be two separate things. At least that is my interpretation. @tylerbenson do you agree? Or should we re-introduce execution (or execution_id) as an additional attribute?

I don't mean that the renaming meant to split execution and invocation. I mean that it meant to more clearly define invocation as a request, rather than the execution. My argument here is that the execution is already represented by the version in Cloud Run.

It would usually be different for each function, but that might depend on the cloud provider. The attribute was originally modeled after the AWS Lambda Request ID https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html#runtimes-api-next.

That doc convinces me more that invocation is meant to map to the request event.

Oberon00 · 2023-04-18T15:27:53Z

Is it then possible that the same Job execution spans multiple instances?

Yes

I think this is not correct then. You should probably use host.id to store that. faas.instance is meant for the exectution environment of the function call, not part of it. That we even have to consider the difference, is a hint that faas conventions might really not be suited for CloudRun jobs.

That doc convinces me more that invocation is meant to map to the request event.

You may be right.

Regarding the version though, maybe a litmus test if you can use it could be "can you invoke the version multiple times?". If not, then it is probably not a version.

damemi · 2023-04-18T15:31:05Z

I think this is not correct then. You should probably use host.id to store that. faas.instance is meant for the exectution environment of the function call, not part of it.

This is the container ID on Cloud Run. That is the execution environment. The Host is not visible to the user, and is not appropriate for this.

Regarding the version though, maybe a litmus test if you can use it could be "can you invoke the version multiple times?". If not, then it is probably not a version.

Is this a requirement, or can a version auto-increment with each request?

Oberon00 · 2023-04-18T15:35:00Z

Is this a requirement, or can a version auto-increment with each request?

If the version auto-increments with every request, it would be something that can be used to uniquely identify the request, which would be usable as as invocation_id...

It is not a requirement that is codified in the semantic conventions (yet), but is for the version concept I had in my mind.

Regarding host.id, it is defined as:

Unique host ID. For Cloud, this must be the instance_id assigned by the cloud provider. For non-containerized systems [...]

But there is also container.id here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/container.md

tylerbenson · 2023-04-18T15:40:42Z

(Can I suggest adding feedback as inline comments so it is easier to thread the conversation better. This format makes it difficult to follow the conversation and respond to individual items. I also suggest this would be a good topic for the FAAS SIG meeting later today. Perhaps you both could attend or send a representative?)

Oberon00 · 2023-04-18T15:45:12Z

Unfortunately that meeting is very late in my timezone, I won't be able to attend. But I'm very likely to accept the outcome reached there, if you discuss it and other FaaS SIG members had opinions on the topic. 😃

tylerbenson · 2023-04-18T15:52:34Z

@Oberon00 understand. Perhaps you could summarize your concerns into a single paragraph that I can paste into the meeting agenda? (Or you're welcome to add it directly yourself.)

jsuereth · 2023-04-18T15:57:52Z

@Oberon00 My read of invocation_id is that it's a concept way too tightly tied to AWS lambda. It does not fit the runtime model of Cloud Run, and it's not something we would want to promote or encourage OTEL users to assume our instance ID is actually a viable "invocation_id" as defined.

Specifically:

faas.instance.id is described as The execution environment ID as a string, that will be potentially reused for other invocations to the same function/function version.
You claim: What is that own attribute? The term "instance" has many meanings, but a "faas.instance" is an instance of a FaaS execution environment, whatever that encompasses (even multiple "instances" of something else). According to the Google Documentation, a job may invoke multiple containers, so container IDs cannot be used in that attribute anyway (since then a single faas.invocation_id would span multiple faas.instances which is not how these concepts are supposed to be used)

You cannot just change this wording in your definition. The way I read the description, everything Mike has proposed is inline with the current definition. In fact given the potentially reused for other invocation, I'd argue your definition is strictly invalid here. We can use an id that changes.

Additionally, the term "execution environment" here is highly suspect. Perhaps in AWS there's some abstraction between your deployment and a running container. In cloud run that abstraction is NOT identified or called out as something you can look at. There's just (deletable) does this function name / job name exist, or there's a running container instance.

+1 to @tylerbenson comments. Right now I'd like an ordered list of your concerns with this PR so that they can be discussed and addressed.

tylerbenson · 2023-04-18T20:11:15Z

Quick summary from the SIG meeting:
GCR jobs have a "generation" version associated with them, but it isn't exposed in a way that is useful to the user. As such, we decided that it is better to not include it. (One potential reason to include it still would be to signal to the observability system that a code/config change precipitated a problem even though it is difficult to tie back to the specific version in the UI.)
We decided that the information that was going to version should be instead populated in some combination of faas. invocation_id or faas.instance instead. @damemi is going to provide more details on how this information is exposed to the user, and examples of what the values will look like in various scenarios.

Oberon00 · 2023-04-19T07:37:51Z

My concern is simple, and there is only a single one: A job execution ID sounds like a totally different concept from a function version.

Using invocation_id or instance were suggestions. I am also OK if you use another more fitting attribute, like gcp.cloud_run.job_execution_id.

damemi · 2023-04-25T14:42:32Z

Thanks for all the feedback so far. I do appreciate everyone's patience and help with this.

I pushed a new commit 5b58946 that reverts the originally proposed Job Run version definition. I added a clarification that the existing version definition only applies to Services.

I then added a new cloud provider resource directory and 2 new attributes for Cloud Run based on Josh's suggestion #3378 (comment):

gcp.cloud_run.job.execution
gcp.cloud_run.job.task_index

We will need to update our resource detectors to differentiate between Jobs and Services for Cloud Run. These detectors are stable so we can't change how Services are handled, but I think we can add Jobs as a subset without breaking the existing Service workflow.

Thanks again for all the thoughtful input on this, hopefully we can reach a resolution soon.

Main concerns addressed. Thank you!

Oberon00 · 2023-04-25T14:53:37Z

@tylerbenson @Aneurysm9 @jsuereth What do you think about reverting the renaming / redefinition of faas.execution to faas.invocation_id done in #3209, or adjusting the changes in such a way that they could also be used to store the Cloud Watch Job execution ID, so that we don't need a new attribute?
EDIT: Maybe a small clarification to the description of the invocation_id would be enough? (I don't think we want to distinguish request ID vs invocation ID vs execution ID on the generic faas level)

damemi · 2023-04-25T19:45:19Z

@Oberon00 faas.invocation/execution is a span attribute, but what we want is a resource attribute so would that also involve generalizing it to the resource level?

Oberon00 · 2023-04-25T21:28:48Z

We actually already have many FaaS attributes that can be set at either the Span or resource level (like faas.id). They are all "usually resource", so this would be the first "usually span" attribute explicitly being allowed in both positions, but I don't see a problem with it.

(Note that you will still want at least one FaaS attribute on your FaaS root span that is always on the span, otherwise you can't determine which of your spans represents the FaaS invocation, if everything is on the resource. E.g. faas.trigger might be used as such a "marker attribute" if you put everything else on the Resource (or you might duplicate it on Span+Resource))

jsuereth · 2023-05-01T15:08:13Z

@Oberon00 I'm still not certain invocation_id / execution_id make sense in my head vs. a trace_id.

I think it's worth having that discussion on a new bug/channel. Given this PR now introduces gcp specific attributes, we can always double-write IDs if we define an "exeuction_id" we're happy with.

I think fundamentally, the "model" used by the FAAS semconv group should be well documented so we can better understand how to align our solution against it. Happy to continue that conversation.

For now, we'd love to see this PR go through so Cloud Run Jobs have valid identities in OTEL in the meantime.

Oberon00 · 2023-05-02T08:27:52Z

I think fundamentally, the "model" used by the FAAS semconv group should be well documented so we can better understand how to align our solution against it. Happy to continue that conversation.

Since the model is not formally defined in all details, we can also define it in such a way that your concerns about "not fitting" attributes are resolved

EDIT: There is a bit of definition language at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/faas.md#difference-between-invocation-and-instance that should be taken into account/moved/changed when we do this.

specification/resource/semantic_conventions/cloud_provider/gcp/cloud_run.md

semantic_conventions/resource/cloud_provider/gcp/cloud_run.yaml

damemi · 2023-05-05T11:31:25Z

The link check failure looks unrelated to this change, is this good to merge?

reyang · 2023-05-09T04:18:30Z

FYI @damemi @jsuereth @reyang this is missing a changelog entry, we need to make sure the changelog entry is added in the new https://github.com/open-telemetry/semantic-conventions (to be created this week) repo.

Related to #3474 (review).

damemi · 2023-05-12T19:34:35Z

@reyang should the new repo have its own CHANGELOG.md just for semantic conventions?

reyang · 2023-05-12T19:40:34Z

@reyang should the new repo have its own CHANGELOG.md just for semantic conventions?

I think the new repo should have a single CHANGELOG.md file, if there isn't one, it should be created.

damemi · 2023-05-12T19:50:26Z

@reyang should the new repo have its own CHANGELOG.md just for semantic conventions?

I think the new repo should have a single CHANGELOG.md file, if there isn't one, it should be created.

sounds good, opened that PR here open-telemetry/semantic-conventions#18

damemi requested review from a team April 10, 2023 17:58

github-actions bot assigned carlosalberto Apr 10, 2023

damemi mentioned this pull request Apr 10, 2023

detectors/gcp: support Cloud Run jobs GoogleCloudPlatform/opentelemetry-operations-go#465

Merged

damemi force-pushed the gcp-cloud-run-service-version branch 2 times, most recently from 1d2542c to 5b37817 Compare April 10, 2023 18:12

punya approved these changes Apr 10, 2023

View reviewed changes

damemi force-pushed the gcp-cloud-run-service-version branch from 5b37817 to 3dbd919 Compare April 10, 2023 18:14

Oberon00 reviewed Apr 14, 2023

View reviewed changes

semantic_conventions/resource/faas.yaml Outdated Show resolved Hide resolved

jsuereth approved these changes Apr 17, 2023

View reviewed changes

semantic_conventions/resource/faas.yaml Outdated Show resolved Hide resolved

Oberon00 previously requested changes Apr 18, 2023

View reviewed changes

damemi added 2 commits April 25, 2023 14:31

Add detail for GCP Cloud Run Job version

38bf5ae

Revert Job version def + create resource attrs for Cloud Run

5b58946

damemi force-pushed the gcp-cloud-run-service-version branch from 36c082f to 5b58946 Compare April 25, 2023 14:31

tylerbenson approved these changes May 1, 2023

View reviewed changes

Oberon00 approved these changes May 2, 2023

View reviewed changes

specification/resource/semantic_conventions/cloud_provider/gcp/cloud_run.md Show resolved Hide resolved

semantic_conventions/resource/cloud_provider/gcp/cloud_run.yaml Show resolved Hide resolved

semantic_conventions/resource/cloud_provider/gcp/cloud_run.yaml Outdated Show resolved Hide resolved

feedback

92070cc

damemi force-pushed the gcp-cloud-run-service-version branch from f2d4075 to 92070cc Compare May 2, 2023 18:59

Merge branch 'main' into gcp-cloud-run-service-version

0f8111a

damemi changed the title ~~Add detail for GCP Cloud Run Job version~~ Add detail for GCP Cloud Run Job execution May 5, 2023

Merge branch 'main' into gcp-cloud-run-service-version

52805c5

Merge branch 'main' into gcp-cloud-run-service-version

c0f3f91

reyang added the area:semantic-conventions Related to semantic conventions label May 9, 2023

reyang merged commit ccac3a2 into open-telemetry:main May 9, 2023

damemi mentioned this pull request May 9, 2023

Add GCP Cloud Run resource attributes open-telemetry/opentelemetry-go#4079

Closed

damemi mentioned this pull request May 12, 2023

Add CHANGELOG.md open-telemetry/semantic-conventions#18

Merged

zchee mentioned this pull request Jun 12, 2023

Fix 'Description' section for GCP Cloud Run Job execution open-telemetry/semantic-conventions#102

Merged

damemi mentioned this pull request Sep 4, 2024

Update Cloud Run Job task_id to avoid high cardinality? GoogleCloudPlatform/opentelemetry-operations-go#874

Closed

carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024

Add detail for GCP Cloud Run Job execution (open-telemetry#3378)

bb18377

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detail for GCP Cloud Run Job execution #3378

Add detail for GCP Cloud Run Job execution #3378

damemi commented Apr 10, 2023

damemi commented Apr 10, 2023

damemi commented Apr 14, 2023

Oberon00 left a comment •

edited

Loading

damemi commented Apr 18, 2023

Oberon00 commented Apr 18, 2023 •

edited

Loading

damemi commented Apr 18, 2023

Oberon00 commented Apr 18, 2023

Oberon00 commented Apr 18, 2023

Oberon00 commented Apr 18, 2023 •

edited

Loading

damemi commented Apr 18, 2023 •

edited

Loading

Oberon00 commented Apr 18, 2023 •

edited

Loading

damemi commented Apr 18, 2023

Oberon00 commented Apr 18, 2023 •

edited

Loading

tylerbenson commented Apr 18, 2023

Oberon00 commented Apr 18, 2023

tylerbenson commented Apr 18, 2023

jsuereth commented Apr 18, 2023 •

edited

Loading

tylerbenson commented Apr 18, 2023

Oberon00 commented Apr 19, 2023 •

edited

Loading

damemi commented Apr 25, 2023

Oberon00 commented Apr 25, 2023 •

edited

Loading

damemi commented Apr 25, 2023

Oberon00 commented Apr 25, 2023 •

edited

Loading

jsuereth commented May 1, 2023

Oberon00 commented May 2, 2023 •

edited

Loading

damemi commented May 5, 2023

reyang commented May 9, 2023

damemi commented May 12, 2023

reyang commented May 12, 2023

damemi commented May 12, 2023

Add detail for GCP Cloud Run Job execution #3378

Add detail for GCP Cloud Run Job execution #3378

Conversation

damemi commented Apr 10, 2023

Changes

damemi commented Apr 10, 2023

damemi commented Apr 14, 2023

Oberon00 left a comment • edited Loading

Choose a reason for hiding this comment

damemi commented Apr 18, 2023

Oberon00 commented Apr 18, 2023 • edited Loading

damemi commented Apr 18, 2023

Oberon00 commented Apr 18, 2023

Oberon00 commented Apr 18, 2023

Oberon00 commented Apr 18, 2023 • edited Loading

damemi commented Apr 18, 2023 • edited Loading

Oberon00 commented Apr 18, 2023 • edited Loading

damemi commented Apr 18, 2023

Oberon00 commented Apr 18, 2023 • edited Loading

tylerbenson commented Apr 18, 2023

Oberon00 commented Apr 18, 2023

tylerbenson commented Apr 18, 2023

jsuereth commented Apr 18, 2023 • edited Loading

tylerbenson commented Apr 18, 2023

Oberon00 commented Apr 19, 2023 • edited Loading

damemi commented Apr 25, 2023

Oberon00 commented Apr 25, 2023 • edited Loading

damemi commented Apr 25, 2023

Oberon00 commented Apr 25, 2023 • edited Loading

jsuereth commented May 1, 2023

Oberon00 commented May 2, 2023 • edited Loading

damemi commented May 5, 2023

reyang commented May 9, 2023

damemi commented May 12, 2023

reyang commented May 12, 2023

damemi commented May 12, 2023

Oberon00 left a comment •

edited

Loading

Oberon00 commented Apr 18, 2023 •

edited

Loading

Oberon00 commented Apr 18, 2023 •

edited

Loading

damemi commented Apr 18, 2023 •

edited

Loading

Oberon00 commented Apr 18, 2023 •

edited

Loading

Oberon00 commented Apr 18, 2023 •

edited

Loading

jsuereth commented Apr 18, 2023 •

edited

Loading

Oberon00 commented Apr 19, 2023 •

edited

Loading

Oberon00 commented Apr 25, 2023 •

edited

Loading

Oberon00 commented Apr 25, 2023 •

edited

Loading

Oberon00 commented May 2, 2023 •

edited

Loading