-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks: Re-factored benchmark infra #7602
Conversation
Major re-factoring to use a dual-channel strategy for executing benchmarks. Uses cluster metadata for managing lifecycle events, but transport channel to send benchmark definitions and results between master and executor nodes. This commit also adds: - Wildcard requests for pause/resume/abort. - Fixes transport request ACTION string to conform to security naming conventions. - Fixed bug that incorrectly calculated total requested iterations in cases where summary computation was called more than once. - Rename field 'total_completed_queries' for better readability. - Clear values for slowest when re-calculating summary results. - Simplified use of barriers and semaphores in testing logic to allow suspending benchmark execution for testing various states. - Handle cases where executor nodes drop from the cluster during execution.
Re-construct benchmark search requests to use the request headers and context from the originating request. This is to ensure we execute with the proper credentials.
} | ||
|
||
return newSearchRequests; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor refactoring in response to PR criticism - Subclass AbstractComponent instead of AbstractLifecycleComponent - Move utility method out of AbstractBenchmarkService
The mapping of competitor names to semaphores never changes. It is better to make it an immutable map, which is what this commit does.
Changed comment about propagating headers and context to be more general.
Adding explanatory comments for lock usage in BenchmarkExecutor. Minor readability improvements to conditional logic.
When computing the slowest requests, use the average time instead of the maximum time.
|
||
final InternalCoordinatorState ics = new InternalCoordinatorState(request, nodeIds, listener); | ||
|
||
ics.onReady = new OnReadyStateChangeListener(ics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a really hard time to understand all these *StateChangeListeners
and I think they should all be folded into the InternalCoordinatorState
It looks a lot like C where you pass a struct into functions to work around the lack of OO concepts. I think it complicates this class a lot, can you please move it out?
I also don't understand the synchronisation on the ChangeListeners since they should not be accessed concurrently? If you wanna protect the state it should be syncing on the state instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest commit changes this pattern per your comments. Would you take a look and let me know if the changes are in-line with your thinking?
Moved *StateChangeListener methods into the per-benchmark State class. Renamed InternalCoordinatorState to State. Added BatchedResponder<T> for handling wildcard responses. Moved log() method to toString() in BenchmarkMetaData.
Make the State class private, not protected.
Replace a big if block with an EnumSet.contains().
Simplified concurrency controls for BenchmarkState. Removed locks from BenchmarkExecutor and BenchmarkState and add concurrency control to BenchmarkExecutorService
"documentation": "http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html", | ||
"methods": ["PUT"], | ||
"url": { | ||
"path": "/_bench", | ||
"path": "/_bench/_submit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since interfaces changed, it would be great to update docs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
final int numMeasurements = settings.multiplier() * searchRequests.size(); | ||
final long[] timeBuckets = new long[numMeasurements]; | ||
final long[] docBuckets = new long[numMeasurements]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like timeBuckets
and docBuckets
can be combined into some sort of stats
class and a lot of bucket manipulation code sprinkled through BenchmarkExecutor can be nicely encapsulated there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these into a Measurements class and cleaned up the code to not pass around arrays.
When all the executor nodes for a running benchmark drop out of the cluster, we need to force the benchmark to fail and return to the caller. Also fixes a bug that could sometimes cause listener.onResponse() to be called twice
A slight variation of the same failing scenario:
|
builder.field(Fields.TOTAL_QUERIES, totalQueries); | ||
builder.field(Fields.CONCURRENCY, concurrency); | ||
builder.field(Fields.MULTIPLIER, multiplier); | ||
builder.field(Fields.AVG_WARMUP_TIME, avgWarmupTime); | ||
builder.field(Fields.AVG_WARMUP_TIME, Double.valueOf(formatter.format(avgWarmupTime))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to allow clients to reduce precision if needed? I realize that the json output looks much more readable this way, but it just seems strange and arbitrary to reduce precision to 2 decimal points in the serialization code.
Refactored state management in the executor. All state is now managed in the executor service class.
Add Measurements class instead of passing around arrays of long everywhere.
Moved all state management into State class. Previously it was scattered throughout the executor service.
if (entry.nodeStateMap().get(nodeId()) == BenchmarkMetaData.Entry.NodeState.COMPLETED && | ||
entry.nodeStateMap().get(nodeId()) == BenchmarkMetaData.Entry.NodeState.ABORTED && | ||
entry.nodeStateMap().get(nodeId()) == BenchmarkMetaData.Entry.NodeState.FAILED) { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible?
Changed API endpoints to not use ‘_’ in the path twice. Old endpoint format was: ‘/_bench/_{action}’ New format is: ‘/_bench/{action}’
Detect cases where the master node changed. In such cases we abort the benchmark, clean out the cluster state, and send a failure back to the caller.
Waiting for #6914, marked as stalled |
This PR is already hard to merge due to recent changes, and this won't get better until #6914 is in. Closing: we will have to do things from scratch again anyway. |
Major re-factoring to use a dual-channel strategy for executing
benchmarks. Uses cluster metadata for managing lifecycle events, but
transport channel to send benchmark definitions and results between
master and executor nodes.
This commit also adds:
conventions.
cases where summary computation was called more than once.
suspending benchmark execution for testing various states.
execution.