-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track and emit segment loading rate for HttpLoadQueuePeon on Coordinator #16691
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks good to me. I left a few small suggestions and questions (mostly out of curiosity).
responses.stream() | ||
.filter(response -> response.getStatus().getState() == SegmentChangeStatus.State.SUCCESS) | ||
.map(DataSegmentChangeResponse::getRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor optimization: To minimize the time spent inside the lock, we can avoid iterating over the responses
again by computing the loadSize
in the caller itself where we are already switching based on the response state.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can make that change.
responses.stream() | ||
.filter(response -> response.getStatus().getState() == SegmentChangeStatus.State.SUCCESS) | ||
.map(DataSegmentChangeResponse::getRequest) | ||
.filter(req -> req instanceof SegmentChangeRequestLoad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also worth tracking the drop rate separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this, but it is not really going to be useful because:
- DROP actions rarely spend time in the queue. This is because DROPs are always prioritized over LOADs and are executed immediately.
- For a DROP action, rate in kbps is not very meaningful since the size of the segment doesn't exactly play a role in the time taken to delete the file from disk. I guess we would have to track drops per second, but that doesn't have any physical significance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the explanation
.put("segmentsToLoadSize", loadSize) | ||
.put("segmentsToDropSize", dropSize) | ||
.put("expectedLoadTimeMillis", expectedLoadTimeMillis) | ||
.build(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why some of these stats are only available in the simple
mode. I see with full
, the response being returned currently is less flexible to add new things in this code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the full
mode is currently broken. It just returns the entire LoadQueuePeon
object in the response.
CuratorLoadQueuePeon
had some @JsonProperty
fields that were serialized out into the response.
But HttpLoadQueuePeon
has never had any @JsonProperty
fields since it was first written, so we just get an empty object in the response of the API /druid/v1/coordinator/loadQueue?full
.
Since no one has ever reported this issue, I assume no one is using it (the web-console certainly doesn't use it).
I will create a separate PR to either fix it or just get rid of it completely.
server/src/test/java/org/apache/druid/server/http/CoordinatorResourceTest.java
Outdated
Show resolved
Hide resolved
updatedTotal.increment(-evictedHead.bytes, -evictedHead.millisElapsed); | ||
} | ||
|
||
if (updatedTotal.bytes > 0 && updatedTotal.millisElapsed > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant check that will always be true given we check for input bytes
and millisElapsed
in line 46
|
||
public synchronized void updateProgress(long bytes, long millisElapsed) | ||
{ | ||
if (bytes >= 0 && millisElapsed > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be bytes > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that would be fair.
Although, I am not sure about tombstone segments. They would have size zero but it would still take non-zero time to load time. I guess in that case it would make sense to account for the time in the overall window average.
Let me know what you think.
server/src/main/java/org/apache/druid/server/coordinator/loading/LoadingRateTracker.java
Outdated
Show resolved
Hide resolved
docs/operations/metrics.md
Outdated
@@ -325,6 +325,7 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina | |||
|`segment/dropSkipped/count`|Number of segments that could not be dropped from any server.|`dataSource`, `tier`, `description`|Varies| | |||
|`segment/loadQueue/size`|Size in bytes of segments to load.|`server`|Varies| | |||
|`segment/loadQueue/count`|Number of segments to load.|`server`|Varies| | |||
|`segment/loading/rateKbps`|Moving average rate of segment loading on a server in kbps.|`server`|Varies| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like spellcheck is failing because of kbps
. The spelling seems valid, so we can ignore it by adding it to the .spelling
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also document the window constraints (window size and bytes) in the user-facing docs? I think mentioning it will help users better interpret this moving average metric.
responses.stream() | ||
.filter(response -> response.getStatus().getState() == SegmentChangeStatus.State.SUCCESS) | ||
.map(DataSegmentChangeResponse::getRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can make that change.
responses.stream() | ||
.filter(response -> response.getStatus().getState() == SegmentChangeStatus.State.SUCCESS) | ||
.map(DataSegmentChangeResponse::getRequest) | ||
.filter(req -> req instanceof SegmentChangeRequestLoad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this, but it is not really going to be useful because:
- DROP actions rarely spend time in the queue. This is because DROPs are always prioritized over LOADs and are executed immediately.
- For a DROP action, rate in kbps is not very meaningful since the size of the segment doesn't exactly play a role in the time taken to delete the file from disk. I guess we would have to track drops per second, but that doesn't have any physical significance.
|
||
public synchronized void updateProgress(long bytes, long millisElapsed) | ||
{ | ||
if (bytes >= 0 && millisElapsed > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that would be fair.
Although, I am not sure about tombstone segments. They would have size zero but it would still take non-zero time to load time. I guess in that case it would make sense to account for the time in the overall window average.
Let me know what you think.
.put("segmentsToLoadSize", loadSize) | ||
.put("segmentsToDropSize", dropSize) | ||
.put("expectedLoadTimeMillis", expectedLoadTimeMillis) | ||
.build(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the full
mode is currently broken. It just returns the entire LoadQueuePeon
object in the response.
CuratorLoadQueuePeon
had some @JsonProperty
fields that were serialized out into the response.
But HttpLoadQueuePeon
has never had any @JsonProperty
fields since it was first written, so we just get an empty object in the response of the API /druid/v1/coordinator/loadQueue?full
.
Since no one has ever reported this issue, I assume no one is using it (the web-console certainly doesn't use it).
I will create a separate PR to either fix it or just get rid of it completely.
server/src/main/java/org/apache/druid/server/coordinator/loading/LoadingRateTracker.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/loading/LoadingRateTracker.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Abhishek Radhakrishnan <[email protected]>
@abhishekrb19 , just realized that there is actually an error in the logic to compute the time taken to load. 😛 Imagine the following scenario:
So total time measured by Coordinator = I am trying to figure out how to fix this. Will update once I have a solution. Update:There is already a |
Update: Logic is fixed, need to test it out a little more. Once that is done, I will update the description and documentation here. |
@abhishekrb19 , I have updated the docs, description in the PR. |
Could you please share the motivation behind these changes? |
Thanks for the feedback, @AmatyaAvadhanula !
Multiple loading threads. If a server has a single loading thread, the time taken by individual loads can be added up
That said, if there is only one loading thread on the server (which is often the case as seen in the formula below),
Let me know what you think, @abhishekrb19 , @AmatyaAvadhanula . (Even the naive logic can be improved in the future to account for
Segment loads are not tied to a coordinator cycle. "Coordinator cycle" or "coordinator run" simply refers to a single invocation of a duty |
Reverted to simpler logic that works well for single threads. |
@kfaraz, thanks for the updates, I haven't looked into the latest changes yet. But a quick question based on the comments:
Does this mean that the simpler logic now only accounts for the default case of a single loading thread ( |
Yes, @abhishekrb19 .
Currently, there is no way for the coordinator to know the number of loading threads being used by a historical. |
Got it. Yeah, making it an experimental feature makes sense. We can also call out the caveat about multiple loading threads in the docs. I will try to look at the latest updates soon! |
@kfaraz, apologies for the delay in getting back.
The docs recommend having at least 16 vCPUs for data servers, so there will be at least 2 loading threads by default in production clusters. As to how much overlap there is between the time spent by loading threads, I'm not sure. Here are a few exploratory thoughts/ideas to simplify and track this more accurately:
I think one downside to this is that the rate computed on the historicals won't account for the end-to-end time (e.g., time spent in the queue, etc). If that is significant, we can perhaps track a metric separately on the coordinator?
Please let me know what you think. |
This reverts commit a131073.
@abhishekrb19 , based on our offline discussion, I have updated the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall to me 👍. Thanks @kfaraz!
/** | ||
* Total stats for the whole window. This includes the total from the current batch as well. | ||
*/ | ||
private final AtomicReference<Entry> windowTotal = new AtomicReference<>(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: they're equivalent:
private final AtomicReference<Entry> windowTotal = new AtomicReference<>(null); | |
private final AtomicReference<Entry> windowTotal = new AtomicReference<>(); |
public class LoadingRateTracker | ||
{ | ||
public static final int MOVING_AVERAGE_WINDOW_SIZE = 10; | ||
public static final long MIN_ENTRY_SIZE_BYTES = 1 << 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this is 1 GB? Or maybe initialize it as 1024 * 1024 * 1024
which is a bit more straightforward.
@@ -173,7 +175,7 @@ private void doSegmentManagement() | |||
while (newRequests.size() < batchSize && queuedSegmentIterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentTimeMillis
above can be removed as it is unused now
* The loading rate is computed as a moving average of the last | ||
* {@link #MOVING_AVERAGE_WINDOW_SIZE} segment batches (or more if any batch was | ||
* smaller than {@link #MIN_ENTRY_SIZE_BYTES}). A batch is defined as a set of | ||
* segments added to the load queue together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a few more things should be called out in this class level javadoc, specifically around the usage of functions.
-
It would be helpful to document what a batch is more concretely in terms of this specific implementation. For example, something like:
multiple updates invoked between a markBatchLoadingStarted() and a markBatchLoadingFinished() call constitute a batch
. -
Clarify the difference between
reset()
andmarkBatchLoadingStopped()
or when these should be used; this can just be method level javadocs if you will. For instance, inHttpLoadQueuePeon
, I expected to see areset()
to reset the tracker's state once there are no more segments to be loaded, but we explicitly callmarkBatchLoadingStopped()
. Also, what do you think about renamingreset()
tostop()
?
} | ||
} | ||
|
||
public void reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this stop()
to avoid any confusion with markBatchLoadingFinished()
? stop()
would also align with how it's used in the LoadQueuePeon
. At the very least, a javadoc here will be helpful
/** | ||
* Total stats for the whole window. This includes the total from the current batch as well. | ||
*/ | ||
private final AtomicReference<Entry> windowTotal = new AtomicReference<>(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is windowTotal
an atomic reference because it's also used by the API getMovingAverageLoadRateKbps()
? As the javadoc notes, all the other state keeping is thread safe viz HttpLoadQueuePeon
.
Asking because LoadingRateTracker
is marked @NotThreadSafe
, so it would be useful to add a comment for windowTotal
to explain why there is a synchronization primitive to avoid confusion.
Thanks a lot for the feedback, @abhishekrb19 ! 😃 |
Design
The loading rate is computed as a moving average of at least the last 10 GiB of successful segment loads.
To account for multiple loading threads on a server, we use the concept of a batch to track load times.
A batch is a set of segments added by the coordinator to the load queue of a server in one go.
Changes
LoadingRateTracker
which computes a moving average load rate based onthe last few GBs of successful segment loads.
segment/loading/rateKbps
from the Coordinator. In the future, we mayalso consider emitting this metric from the historicals themselves.
expectedLoadTimeMillis
to response of API/druid/coordinator/v1/loadQueue?simple
Testing done
Release note
segment/loading/rateKbps
from the Coordinator which tracksthe current segment loading rate of a server. Dimensions:
server
expectedLoadTimeMillis
to response of API/druid/coordinator/v1/loadQueue?simple
Future work
The
expectedLoadTimeMillis
can be used to show expected time on the web-console.This PR does not include web-console changes and the screenshot below is only for reference.
This PR has: