Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment reference and reporting overhaul #50

Merged
merged 9 commits into from
May 14, 2021
Merged

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented May 12, 2021

There are many things in this PR most of which were discussed with @wu-sheng in a separate channel, they are as follows:

  • Spans which are ignored due to suffix or endpoint name pattern match will propagate their ignored status downstream via dummy span and sending an empty "sw8" header. If such an empty header is received then a dummy context and span is created instead of a new actively traced segment. This is only a behavior change for this node agent and is compatible with other language agents, they will just act as they have always done.
  • If a span is created in code which is an async child of a span, which has already finished and been sent to the server, then it will now correctly create a new segment and reference the segment which created it and keep the parent-child relationship in the trace. Previously behavior was undefined mostly losing the new span.
  • The method of limiting the number of segments which are recorded has been changed from post-record check to pre-record check to avoid unnecessary instrumentation if the number of segments already exceeds the threshold and the data would be thrown away anyway. This also allows propagation of ignored status of segment downstream via mechanism described above to avoid broken "VNode" segments in trace view.
  • Setting logging functions to no-ops when logging level indicates since the logging module seems to be stringifying everything which is sent to it regardless of logging level set, which led to up to 2x performance hit in extreme case of very simple endpoints.
  • Search for top-level module now stops when it finds skywalking since an app may be run by a framework which imports it into its own node context and would wind up with the wrong top-level directory to import from.
  • "peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

@kezhenxu94 kezhenxu94 added core enhancement New feature or request labels May 12, 2021
@kezhenxu94 kezhenxu94 added this to the 0.3.0 milestone May 12, 2021
@wu-sheng
Copy link
Member

Setting logging functions to no-ops when logging level indicates since the logging module seems to be stringifying everything which is sent to it regardless of logging level set, which led to up to 2x performance hit in extreme case of very simple endpoints.

Personally, not about this PR, I think in the future, nodejs/python could adopt log service rather than span's log, which Java agent has done as log gRPC reporter. In there, you still could have all context and collaboration between tracing and logging, but as log could be sent as soon as possible separately, it has much better performance.
Of course, follow the log level makes sense. And should consider dynamic-configuration feature implementation in the agent to change the behavior dynamically.

@tom-pytel
Copy link
Contributor Author

Setting logging functions to no-ops when logging level indicates since the logging module seems to be stringifying everything which is sent to it regardless of logging level set, which led to up to 2x performance hit in extreme case of very simple endpoints.

Personally, not about this PR, I think in the future, nodejs/python could adopt log service rather than span's log, which Java agent has done as log gRPC reporter. In there, you still could have all context and collaboration between tracing and logging, but as log could be sent as soon as possible separately, it has much better performance.
Of course, follow the log level makes sense. And should consider dynamic-configuration feature implementation in the agent to change the behavior dynamically.

It seems to me like the current logging was just a quick initial draft to get something working, not sure if it was meant to last?

@kezhenxu94
Copy link
Member

The logging changes in this PR is all about the agent itself, not the instrumented services. They're totally different things.

@wu-sheng
Copy link
Member

It seems to me like the current logging was just a quick initial draft to get something working, not sure if it was meant to last?

Span's logs are something you want to inject dynamically in the plugin, real log protocol(https://github.com/apache/skywalking-data-collect-protocol/blob/master/logging/Logging.proto) should be used to collect user(target app)'s logs. Make sense?

@wu-sheng
Copy link
Member

The logging changes in this PR is all about the agent itself, not the instrumented services. They're totally different things.

OK, got it.

@wu-sheng
Copy link
Member

"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

If there is no peer, the sw8 header could not be assembled correctly. How do the test cases work?

@tom-pytel
Copy link
Contributor Author

tom-pytel commented May 12, 2021

"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

If there is no peer, the sw8 header could not be assembled correctly. How do the test cases work?

There is a peer it is just being set explicitly after the call rather than passing as parameter to newExitSpan() and being set in this function. A minor change to allow consolidating common functionality of newExitSpan, newEntrySpan and newLocalSpan.

@wu-sheng
Copy link
Member

"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

If there is no peer, the sw8 header could not be assembled correctly. How do the test cases work?

There is a peer it is just being set explicitly after the call rather than passing as parameter to newExitSpan() and being set in this function. A minor change to allow consolidating common functionality of newExitSpan, newEntrySpan and newLocalSpan.

This was designed from day one, because you want to make sure you have this, rather than facing no-peer in creating headers. Do we expect to have an error if it is really missed?

@wu-sheng
Copy link
Member

The method of limiting the number of segments

I don't know nodejs agent core. But please make sure all sampling mechanisms optional and could be assigned through agent#start

@tom-pytel
Copy link
Contributor Author

The method of limiting the number of segments

I don't know nodejs agent core. But please make sure all sampling mechanisms optional and could be assigned through agent#start

Which sampling mechanism are you referring to?

@tom-pytel
Copy link
Contributor Author

There is a peer it is just being set explicitly after the call rather than passing as parameter to newExitSpan() and being set in this function. A minor change to allow consolidating common functionality of newExitSpan, newEntrySpan and newLocalSpan.

This was designed from day one, because you want to make sure you have this, rather than facing no-peer in creating headers. Do we expect to have an error if it is really missed?

All the plugins set it, and the chance of not setting it as a programming error is the same as the chance of not setting any other parameter like layer or component ID. All of which are non-desirable cases obviously but we don't give those parameters any special treatment...

@wu-sheng
Copy link
Member

Which sampling mechanism are you referring to?

I mean, this is actually a sampling mechanism. Typically, it is called adoptive sampling. Because sampling bases on the volume of network and backend capabilities.

@tom-pytel
Copy link
Contributor Author

Which sampling mechanism are you referring to?

I mean, this is actually a sampling mechanism. Typically, it is called adoptive sampling. Because sampling bases on the volume of network and backend capabilities.

You mean the "throttling" or "limiting" mechanism I changed? Doesn't matter what its called, it is settable, it uses the same SW_AGENT_MAX_BUFFER_SIZE or config.maxBufferSize to set the sampling size as the previous method, so no change there.

@wu-sheng
Copy link
Member

Which sampling mechanism are you referring to?

I mean, this is actually a sampling mechanism. Typically, it is called adaptive sampling. Because sampling bases on the volume of network and backend capabilities.

You mean the "throttling" or "limiting" mechanism I changed? Doesn't matter what its called, it is settable, it uses the same SW_AGENT_MAX_BUFFER_SIZE or config.maxBufferSize to set the sampling size as the previous method, so no change there.

I think should have a setting called config.sampling=adaptive, then your codes work. Also, we could extend to have config-sampling=rate etc.

@tom-pytel
Copy link
Contributor Author

I think should have a setting called config.sampling=adaptive, then your codes work. Also, we could extend to have config-sampling=rate etc.

In this PR I only changed how the current sampling method is implemented, not how it is defined, but those are all ideas for future PR extensions. One idea might be a general rule system for this and the various "ignore" options which specifies which endpoints to record and which to exclude which might include sampling frequency and max counts. But again, that is for a future PR...

@wu-sheng
Copy link
Member

I think should have a setting called config.sampling=adaptive, then your codes work. Also, we could extend to have config-sampling=rate etc.

In this PR I only changed how the current sampling method is implemented, not how it is defined, but those are all ideas for future PR extensions. One idea might be a general rule system for this and the various "ignore" options which specifies which endpoints to record and which to exclude which might include sampling frequency and max counts. But again, that is for a future PR...

Could you update the document about this?

@tom-pytel
Copy link
Contributor Author

SW_AGENT_MAX_BUFFER_SIZE is already documented in the README and I didn't change how it behaves, just how it is implemented internally. Or do you mean another document? Though you are right and I will add about how the ignore option is propagated downstream since people might otherwise be confused.

@wu-sheng
Copy link
Member

Or do you mean another document? Though you are right and I will add about how the ignore option is propagated downstream since people might otherwise be confused.

Yes, I mean about how and when this works.

@tom-pytel
Copy link
Contributor Author

@kezhenxu94: I've added two small cleanups:

  • Moved ContextManager.spansDup() from the Context.new?Span() functions to the Context.start() function to be more symmetric with stop(), async() and resync() but also because the spansDup() needs to happen before the spans.push() and there is no guarantee the user will call start() just after the Context.new?Span() call.
  • Removed the Buffer class since without the size check it is just a wrapper for a JS array.

@kezhenxu94 kezhenxu94 merged commit 090072e into apache:master May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants