Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracer] reduce memory usage on high-cardinality traces #247

Merged
merged 4 commits into from
Apr 6, 2018

Conversation

ufoot
Copy link
Contributor

@ufoot ufoot commented Nov 9, 2017

For big traces (typically, long-running traces with one enclosing span and many sub-spans, possibly several thousands) the library could keep everything in memory waiting for an hypothetical flush.

This patch partially flushes consistent parts of traces, so that they don't fill up the RAM.

@ufoot ufoot added bug Involves a bug core Involves Datadog core libraries labels Nov 9, 2017
@ufoot ufoot added this to the 0.9.2 milestone Nov 9, 2017
@ufoot ufoot requested a review from palazzem November 9, 2017 21:12
@palazzem palazzem changed the title [tracer] fix memory leak on high-cardinality traces [tracer] fix memory usage on high-cardinality traces Nov 10, 2017
@ufoot ufoot changed the title [tracer] fix memory usage on high-cardinality traces [tracer] reduce memory usage on high-cardinality traces Nov 10, 2017
@ufoot ufoot force-pushed the christian/limittracesize branch 3 times, most recently from a9552bc to 43cb1ac Compare November 10, 2017 15:09
roots, marked_ids = partial_roots()
return nil unless roots

return unless roots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a duplicate of line 144?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this, indeed, it is.

@ufoot ufoot modified the milestones: 0.9.2, 0.10.0 Nov 14, 2017
@ufoot ufoot requested a review from p-lambert November 14, 2017 13:50
def partial_roots
return nil unless @current_span

marked_ids = Hash[([@current_span.span_id] + @current_span.parent_ids).map { |id| [id, true] }]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a Set DS would be more idiomatic here: http://ruby-doc.org/stdlib-2.4.2/libdoc/set/rdoc/Set.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

ids.reject! { |id| marked_ids.key? id }
ids.each do |id|
if roots_spans.key?(id)
unfinished[id] = true unless span.finished?
Copy link
Member

@p-lambert p-lambert Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this (and avoid unnecessary lookups) a little bit by doing:

root_spans.delete(id) unless span.finished?

That way we can get rid of theunfinished hash and roots_spans.reject! { |id| unfinished.key? id } altogether. We will also skip some work for spans that belong to that subtree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try this for sure, sounds good.


# Return a hash containting all sub traces which are candidates for
# a partial flush.
def partial_roots_spans
Copy link
Member

@p-lambert p-lambert Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do some whiteboarding later, but I suspect we can simplify and improve the complexity of this algorithm a little bit if we build a graph and then search for all reachable nodes from the partial_roots instead of the other way around (traversing the subtrees of all spans in search for the partial_roots). I might be wrong, but I think its worth trying it before we merge this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, possibly, this was just a 1st naive impl., it's fine to iterate on that obviously.

end

# Iterate on each span within the trace. This is thread safe.
def each_span
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -289,8 +292,15 @@ def record(context)
context = context.context if context.is_a?(Datadog::Span)
return if context.nil?
trace, sampled = context.get
ready = !trace.nil? && !trace.empty? && sampled
write(trace) if ready
if sampled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but it seems to me that if the current trace has an unfinished Span, Context#check_finished_spans will return false and Context#get will return [nil, nil]. Doesn't that prevent the partial flush?


public :partial_roots
public :partial_roots_spans
public :partial_flush
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think of private methods as implementation details. If we rely on them for testing, then we're testing more the implementation than the "behavior" provided, making this code almost impossible to refactor in the future. I think it would be better if we could use something like a FauxWriter and inspect if the buffer contains something like [partial_trace1, partial_trace2, partial_trace3] (that is, testing the "side-effect" of the ContextFlush). I completely understand that we might be short on time now, but I think that's a topic worth discussing at some point :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, will give it a try anyway, I think now that we have this each_partial_trace entry, which is fine to keep public I think, then I could rewrite part of the tests to rely on that without changing the whole world.

return nil unless roots

roots_spans = Hash[roots.map { |id| [id, []] }]
unfinished = Set.new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: that suggestion I gave about the unfinished variable didn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this again... and... no, but the code here does not work either. Some code paths are obviously not triggered by any tests yet. Fixing.

@ufoot ufoot force-pushed the christian/limittracesize branch 4 times, most recently from 8bf7b1d to 53cad3f Compare November 17, 2017 14:43

context = tracer.call_context
tracer.configure(min_spans_before_partial_flush: context.max_length,
max_spans_before_partial_flush: context.max_length,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, context max_length and context_flush boundaries are the same, but it's not necessarily that way. In the future, hard limit could be higher, and this test has no reason to fail test, it just tests that when hard and soft limit are the same, hard limit prevails.

@palazzem palazzem removed this from the 0.10.0 milestone Nov 23, 2017
# by default, soft and hard limits are the same
DEFAULT_MAX_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH
# by default, never do a partial flush
DEFAULT_MIN_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set this to 10 to avoid anything smaller that 10 spans to be flushed, even partially.

# It performs memory flushes when required.
class ContextFlush
# by default, soft and hard limits are the same
DEFAULT_MAX_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set this to 1000 to have partial flush be triggered for anything bigger that 1000 spans.

@palazzem palazzem added the do-not-merge/WIP Not ready for merge label Feb 7, 2018
@palazzem palazzem requested a review from delner March 9, 2018 14:34
@palazzem palazzem added this to the 0.12.0 milestone Mar 9, 2018
@delner delner changed the base branch from master to 0.12-dev March 12, 2018 19:10
# We need to reject by span ID and not by value, because a span
# value may be altered (typical example: it's finished by some other thread)
# since we lock only the context, not all the spans which belong to it.
context.delete_span_if { |span| flushed_ids.include? span.span_id }
Copy link
Contributor

@delner delner Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous in the case that writing fails. If I'm understanding it correctly, first we get all partial traces, deleting them permanently from the context, then iterate on each span via each_partial_trace. But when each_partial_trace calls write(trace), if that fails (say there's a blip in connectivity), then an exception will be thrown, and the deleted spans won't actually be flushed (with no means of recovering them.)

Could we do this after the yield on line 124? Or within the tracer as a part of the write operation?

Copy link
Contributor

@delner delner Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the point technically stands, the write operation is only writing to a buffer right now, and in the case of the SyncWriter, partial flushing doesn't apply. The odds of it failing are thus really low. So we probably don't have to change this.

@palazzem palazzem removed the do-not-merge/WIP Not ready for merge label Apr 5, 2018
ufoot and others added 2 commits April 5, 2018 17:35
For big traces (typically, long-running traces with one enclosing span and
many sub-spans, possibly several thousands) the library could keep everything
in memory waiting for an hypothetical flush.

This patch partially flushes consistent parts of traces,
so that they don't fill up the RAM.
Copy link
Contributor

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only change Context methods and I think we're fine. With this change we're:

  • adding an upper bound to the Context size (based on count and not on the size, but it's good enough for now)
  • adding an experimental feature that if not activated, the tracer behaves like before

@@ -798,6 +798,7 @@ Available options are:
- ``env``: set the environment. Rails users may set it to ``Rails.env`` to use their application settings.
- ``tags``: set global tags that should be applied to all spans. Defaults to an empty hash
- ``log``: defines a custom logger.
- ``partial_flush``: set to ``true`` to enable partial trace flushing (for long running traces.) Disabled by default. *Experimental.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this experimental for some releases. Then we may improve it, make it fully supported, or replace with a new behavior.

@@ -161,6 +182,50 @@ def attach_sampling_priority
)
end

# Return the start time of the root span, or nil if there are no spans or this is undefined.
def start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these methods "private" or anyway use a convention to avoid people using them? since it's internal machinery, I don't want that developers use an API that could be removed later. If we use the __start_time, let's also add in the comment that this is an internal API.

Copy link
Contributor

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good to me! Thank you very much!

@delner delner merged commit b3a85cd into 0.12-dev Apr 6, 2018
@delner delner deleted the christian/limittracesize branch April 6, 2018 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants