-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tracer] reduce memory usage on high-cardinality traces #247
Conversation
ffad65a
to
a4af593
Compare
a9552bc
to
43cb1ac
Compare
lib/ddtrace/context.rb
Outdated
roots, marked_ids = partial_roots() | ||
return nil unless roots | ||
|
||
return unless roots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a duplicate of line 144?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this, indeed, it is.
lib/ddtrace/context.rb
Outdated
def partial_roots | ||
return nil unless @current_span | ||
|
||
marked_ids = Hash[([@current_span.span_id] + @current_span.parent_ids).map { |id| [id, true] }] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a Set
DS would be more idiomatic here: http://ruby-doc.org/stdlib-2.4.2/libdoc/set/rdoc/Set.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
lib/ddtrace/context.rb
Outdated
ids.reject! { |id| marked_ids.key? id } | ||
ids.each do |id| | ||
if roots_spans.key?(id) | ||
unfinished[id] = true unless span.finished? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this (and avoid unnecessary lookups) a little bit by doing:
root_spans.delete(id) unless span.finished?
That way we can get rid of theunfinished
hash and roots_spans.reject! { |id| unfinished.key? id }
altogether. We will also skip some work for spans that belong to that subtree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try this for sure, sounds good.
lib/ddtrace/context.rb
Outdated
|
||
# Return a hash containting all sub traces which are candidates for | ||
# a partial flush. | ||
def partial_roots_spans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do some whiteboarding later, but I suspect we can simplify and improve the complexity of this algorithm a little bit if we build a graph and then search for all reachable nodes from the partial_roots
instead of the other way around (traversing the subtrees of all spans in search for the partial_roots
). I might be wrong, but I think its worth trying it before we merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, possibly, this was just a 1st naive impl., it's fine to iterate on that obviously.
02cf38f
to
1400031
Compare
624769a
to
a5ecc11
Compare
end | ||
|
||
# Iterate on each span within the trace. This is thread safe. | ||
def each_span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
lib/ddtrace/tracer.rb
Outdated
@@ -289,8 +292,15 @@ def record(context) | |||
context = context.context if context.is_a?(Datadog::Span) | |||
return if context.nil? | |||
trace, sampled = context.get | |||
ready = !trace.nil? && !trace.empty? && sampled | |||
write(trace) if ready | |||
if sampled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, but it seems to me that if the current trace has an unfinished Span
, Context#check_finished_spans
will return false
and Context#get
will return [nil, nil]
. Doesn't that prevent the partial flush?
test/context_flush_test.rb
Outdated
|
||
public :partial_roots | ||
public :partial_roots_spans | ||
public :partial_flush |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think of private methods as implementation details. If we rely on them for testing, then we're testing more the implementation than the "behavior" provided, making this code almost impossible to refactor in the future. I think it would be better if we could use something like a FauxWriter
and inspect if the buffer contains something like [partial_trace1, partial_trace2, partial_trace3]
(that is, testing the "side-effect" of the ContextFlush
). I completely understand that we might be short on time now, but I think that's a topic worth discussing at some point :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, will give it a try anyway, I think now that we have this each_partial_trace
entry, which is fine to keep public I think, then I could rewrite part of the tests to rely on that without changing the whole world.
lib/ddtrace/context_flush.rb
Outdated
return nil unless roots | ||
|
||
roots_spans = Hash[roots.map { |id| [id, []] }] | ||
unfinished = Set.new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: that suggestion I gave about the unfinished
variable didn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading this again... and... no, but the code here does not work either. Some code paths are obviously not triggered by any tests yet. Fixing.
8bf7b1d
to
53cad3f
Compare
|
||
context = tracer.call_context | ||
tracer.configure(min_spans_before_partial_flush: context.max_length, | ||
max_spans_before_partial_flush: context.max_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today, context max_length and context_flush boundaries are the same, but it's not necessarily that way. In the future, hard limit could be higher, and this test has no reason to fail test, it just tests that when hard and soft limit are the same, hard limit prevails.
b1a8c32
to
53cad3f
Compare
53cad3f
to
004caf3
Compare
004caf3
to
13484c9
Compare
13484c9
to
e63e190
Compare
# by default, soft and hard limits are the same | ||
DEFAULT_MAX_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH | ||
# by default, never do a partial flush | ||
DEFAULT_MIN_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set this to 10 to avoid anything smaller that 10 spans to be flushed, even partially.
# It performs memory flushes when required. | ||
class ContextFlush | ||
# by default, soft and hard limits are the same | ||
DEFAULT_MAX_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set this to 1000 to have partial flush be triggered for anything bigger that 1000 spans.
e63e190
to
2c743d6
Compare
lib/ddtrace/context_flush.rb
Outdated
# We need to reject by span ID and not by value, because a span | ||
# value may be altered (typical example: it's finished by some other thread) | ||
# since we lock only the context, not all the spans which belong to it. | ||
context.delete_span_if { |span| flushed_ids.include? span.span_id } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous in the case that writing fails. If I'm understanding it correctly, first we get all partial traces, deleting them permanently from the context, then iterate on each span via each_partial_trace
. But when each_partial_trace
calls write(trace)
, if that fails (say there's a blip in connectivity), then an exception will be thrown, and the deleted spans won't actually be flushed (with no means of recovering them.)
Could we do this after the yield
on line 124? Or within the tracer as a part of the write operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the point technically stands, the write
operation is only writing to a buffer right now, and in the case of the SyncWriter
, partial flushing doesn't apply. The odds of it failing are thus really low. So we probably don't have to change this.
2c743d6
to
9ffe0b5
Compare
For big traces (typically, long-running traces with one enclosing span and many sub-spans, possibly several thousands) the library could keep everything in memory waiting for an hypothetical flush. This patch partially flushes consistent parts of traces, so that they don't fill up the RAM.
…_flush or configuration options to enable.
f30d5af
to
9c75df2
Compare
9c75df2
to
d1c61e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only change Context
methods and I think we're fine. With this change we're:
- adding an upper bound to the
Context
size (based on count and not on the size, but it's good enough for now) - adding an experimental feature that if not activated, the tracer behaves like before
@@ -798,6 +798,7 @@ Available options are: | |||
- ``env``: set the environment. Rails users may set it to ``Rails.env`` to use their application settings. | |||
- ``tags``: set global tags that should be applied to all spans. Defaults to an empty hash | |||
- ``log``: defines a custom logger. | |||
- ``partial_flush``: set to ``true`` to enable partial trace flushing (for long running traces.) Disabled by default. *Experimental.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this experimental for some releases. Then we may improve it, make it fully supported, or replace with a new behavior.
lib/ddtrace/context.rb
Outdated
@@ -161,6 +182,50 @@ def attach_sampling_priority | |||
) | |||
end | |||
|
|||
# Return the start time of the root span, or nil if there are no spans or this is undefined. | |||
def start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make these methods "private" or anyway use a convention to avoid people using them? since it's internal machinery, I don't want that developers use an API that could be removed later. If we use the __start_time
, let's also add in the comment that this is an internal API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good to me! Thank you very much!
For big traces (typically, long-running traces with one enclosing span and many sub-spans, possibly several thousands) the library could keep everything in memory waiting for an hypothetical flush.
This patch partially flushes consistent parts of traces, so that they don't fill up the RAM.