Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(plugin): OTEL exporting with translate #10332

Merged
merged 11 commits into from
Feb 23, 2023
Merged

Conversation

StarlightIbuki
Copy link
Contributor

@StarlightIbuki StarlightIbuki commented Feb 21, 2023

Summary

We represent trace spans with a universal format for all kinds of mainstream tracers and translate them to different formats when propagating headers and exporting traces. However, OTEL misses the handling of trace ID length.

Checklist

Issue reference

Fix KAG-652

CHANGELOG.md Outdated Show resolved Hide resolved
@flrgh
Copy link
Contributor

flrgh commented Feb 23, 2023

I've updated the PR with a couple fixes. I think it meets the correct requirements now, but there are some warts:

  1. We only have test coverage for the trace ID zero-pad case. There is no test for the trace ID truncate case, because an incoming trace ID that is too long will be discarded as invalid.
  2. The way we are copying the span table seems like it may have unintended consequences. Spans use table pools under the hood, so the PDK expects to manage their lifecycle. Copying spans outside of the PDK can therefore lead to situations where tables are orphaned and gc-ed instead of being returned to the pool as intended.

@dndx dndx requested a review from chronolaw February 23, 2023 02:55
@StarlightIbuki
Copy link
Contributor Author

StarlightIbuki commented Feb 23, 2023

I've updated the PR with a couple fixes. I think it meets the correct requirements now, but there are some warts:

  1. We only have test coverage for the trace ID zero-pad case. There is no test for the trace ID truncate case, because an incoming trace ID that is too long will be discarded as invalid.
  2. The way we are copying the span table seems like it may have unintended consequences. Spans use table pools under the hood, so the PDK expects to manage their lifecycle. Copying spans outside of the PDK can therefore lead to situations where tables are orphaned and gc-ed instead of being returned to the pool as intended.

Answering question 2: the copied span won't survive the request so it will be GC-ed, and its reference to the members will also be gone. So the only reference to those members should be from the original span, which will be returned to the pool. If the PDK decides to also reuse the members, it will still be fine as the table returned to the pool has reference to the members.

PS: table pool is just referencing returned tables to prevent them from GC-ed and has no special mechanism, thus we cannot "GC a table that is supposed to be returned to the pool". The worst situation is that we returned tables that are used somewhere else unaware of them being recycled which cannot happen here.

@StarlightIbuki
Copy link
Contributor Author

I've updated the PR with a couple fixes. I think it meets the correct requirements now, but there are some warts:

  1. We only have test coverage for the trace ID zero-pad case. There is no test for the trace ID truncate case, because an incoming trace ID that is too long will be discarded as invalid.
  2. The way we are copying the span table seems like it may have unintended consequences. Spans use table pools under the hood, so the PDK expects to manage their lifecycle. Copying spans outside of the PDK can therefore lead to situations where tables are orphaned and gc-ed instead of being returned to the pool as intended.

To answer question 1: I agree that we need a test for the truncating case. But the code you're referring to (handling to OT header) does not mean truncating doesn't work. We're translating traces, and a hypothetical trace format may have a valid trace ID larger than that.

@StarlightIbuki
Copy link
Contributor Author

However there's not a type of ID that is long enough to go through the truncating path, so we skip it for now.

StarlightIbuki and others added 7 commits February 23, 2023 13:14
We represent trace spans with a universal format for all kinds of mainstream tracers and translate them to different formats when propagating headers and exporting traces. However, OTEL misses the handling of trace ID length.

Fix KAG-652
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@chronolaw chronolaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do more code clean for it, but I think it is OK now.

@@ -72,9 +77,39 @@ local function transform_events(events)
return pb_events
end

-- this function is to translate the span to otlp span, but of universal span format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what "but of universal span format" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original design to support arbitrary types of tracing plugins has not made it clear:

  1. We have an internal representation for span, and it should preserve what it received from downstream requests at best effort;
  2. when exporting the data to collectors, it needs to transform the span to meet the requirement with its field (DD, OTEL, etc), and then translate it into a format used for the reporting API (gRPC, JSONRPC, etc). This function does the first step but not the second one.

Comment on lines +102 to +104
span = translated

return span
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, not needed if we don't do a shallow copy.

Suggested change
span = translated
return span
return translated

new_id = NULL:rep(TRACE_ID_LEN - len) .. trace_id
end

local translated = shallow_copy(span)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a shallow_copy needed?

I would have expected this function to do something like:

local function adjust_trace_id_for_oltp(trace_id)

And the usage would be, when creating the root span:

root_span.trace_id = adjust_trace_id_for_oltp(trace_id)

And inside transform_span:

local pb_span = {
    trace_id = adjust_trace_id_for_oltp(span.trace_id),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't recommend using abstract names (like transform_span or translate_span) when the function is only doing one modification on one field. The tradeoff ("making the code easier to modify in the future if we find that we need to do more modifications in the future") is not worth the upfront cost ("making the code more difficult to read now"). Future us can always group several modifications into one if the need arises.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a shallow_copy needed?

The span passed in may also be used by other plugins. We need to preserve as much original information as possible, and only do the transformation for exporting/propagating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree naming is a problem, but transform_span is used as some kind of interface for tracing plugins (we can see a function with the same name from datadog-tracing).
I'm adding comments and renaming the internal function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the function is in the file "otlp.lua" so I think we do not need to suffix it with "_otlp".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the usage would be, when creating the root span:

We cannot just modify the root_span given the reason: https://github.com/Kong/kong/pull/10332/files#r1115411177

@kikito kikito merged commit 6fa55b3 into master Feb 23, 2023
@kikito kikito deleted the fix/otel-trace-id-len branch February 23, 2023 15:35
mashapedeployment pushed a commit that referenced this pull request Feb 23, 2023
Co-authored-by: Chrono <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
(cherry picked from commit 6fa55b3)
@tyler-ball tyler-ball removed this from the 3.2.1 milestone Feb 23, 2023
hanshuebner pushed a commit that referenced this pull request Feb 24, 2023
Co-authored-by: Chrono <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
(cherry picked from commit 6fa55b3)

Co-authored-by: Xumin <[email protected]>
Co-authored-by: Enrique García Cota <[email protected]>
This was referenced Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants