Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks: Re-factored benchmark infra #7602

Closed
wants to merge 17 commits into from
Closed

Conversation

aleph-zero
Copy link
Contributor

Major re-factoring to use a dual-channel strategy for executing
benchmarks. Uses cluster metadata for managing lifecycle events, but
transport channel to send benchmark definitions and results between
master and executor nodes.

This commit also adds:

  • Wildcard requests for pause/resume/abort.
  • Fixes transport request ACTION string to conform to security naming
    conventions.
  • Fixed bug that incorrectly calculated total requested iterations in
    cases where summary computation was called more than once.
  • Rename field 'total_completed_queries' for better readability.
  • Clear values for slowest when re-calculating summary results.
  • Simplified use of barriers and semaphores in testing logic to allow
    suspending benchmark execution for testing various states.
  • Handle cases where executor nodes drop from the cluster during
    execution.

Andrew Selden added 2 commits September 4, 2014 15:00
Major re-factoring to use a dual-channel strategy for executing
benchmarks. Uses cluster metadata for managing lifecycle events, but
transport channel to send benchmark definitions and results between
master and executor nodes.

This commit also adds:

- Wildcard requests for pause/resume/abort.
- Fixes transport request ACTION string to conform to security naming
conventions.
- Fixed bug that incorrectly calculated total requested iterations in
cases where summary computation was called more than once.
- Rename field 'total_completed_queries' for better readability.
- Clear values for slowest when re-calculating summary results.
- Simplified use of barriers and semaphores in testing logic to allow
suspending benchmark execution for testing various states.
- Handle cases where executor nodes drop from the cluster during
execution.
Re-construct benchmark search requests to use the request headers and
context from the originating request. This is to ensure we execute with
the proper credentials.
}

return newSearchRequests;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uboness @javanna Could you comment as to whether this is the correct way to pass the original request headers and context through to sub-searches?

Andrew Selden added 5 commits September 4, 2014 16:53
Minor refactoring in response to PR criticism

- Subclass AbstractComponent instead of AbstractLifecycleComponent
- Move utility method out of AbstractBenchmarkService
The mapping of competitor names to semaphores never changes. It is
better to make it an immutable map, which is what this commit does.
Changed comment about propagating headers and context to be more
general.
Adding explanatory comments for lock usage in BenchmarkExecutor.
Minor readability improvements to conditional logic.
When computing the slowest requests, use the average time instead of the
maximum time.

final InternalCoordinatorState ics = new InternalCoordinatorState(request, nodeIds, listener);

ics.onReady = new OnReadyStateChangeListener(ics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a really hard time to understand all these *StateChangeListeners and I think they should all be folded into the InternalCoordinatorState It looks a lot like C where you pass a struct into functions to work around the lack of OO concepts. I think it complicates this class a lot, can you please move it out?

I also don't understand the synchronisation on the ChangeListeners since they should not be accessed concurrently? If you wanna protect the state it should be syncing on the state instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit changes this pattern per your comments. Would you take a look and let me know if the changes are in-line with your thinking?

Andrew Selden added 4 commits September 18, 2014 16:13
Moved *StateChangeListener methods into the per-benchmark State class.

Renamed InternalCoordinatorState to State.
Added BatchedResponder<T> for handling wildcard responses.
Moved log() method to toString() in BenchmarkMetaData.
Make the State class private, not protected.
Replace a big if block with an EnumSet.contains().
Simplified concurrency controls for BenchmarkState. Removed locks from
BenchmarkExecutor and BenchmarkState and add concurrency control to
BenchmarkExecutorService
"documentation": "http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html",
"methods": ["PUT"],
"url": {
"path": "/_bench",
"path": "/_bench/_submit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since interfaces changed, it would be great to update docs as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


final int numMeasurements = settings.multiplier() * searchRequests.size();
final long[] timeBuckets = new long[numMeasurements];
final long[] docBuckets = new long[numMeasurements];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like timeBuckets and docBuckets can be combined into some sort of stats class and a lot of bucket manipulation code sprinkled through BenchmarkExecutor can be nicely encapsulated there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these into a Measurements class and cleaned up the code to not pass around arrays.

When all the executor nodes for a running benchmark drop out of the
cluster, we need to force the benchmark to fail and return to the
caller.

Also fixes a bug that could sometimes cause listener.onResponse() to be
called twice
@imotov
Copy link
Contributor

imotov commented Sep 24, 2014

A slight variation of the same failing scenario:

  • start node with -Des.node.bench=false as a master on port 9200
  • start node with -Des.node.bench=true as a master on port 9201
  • start benchmark by sending it to port 9201
curl -XPUT 'localhost:9201/_bench/_submit?pretty=true' -d '{
    "name": "my_benchmark",
    "competitors": [ {
        "name": "my_competitor",
        "requests": [ {
            "query": {
                "match": { "_all": "a*" }
            }
        } ]
    } ]
}'
  • while this benchmark is running, kill node running on port 9200 using kill -9
  • run curl 'localhost:9201/_bench/_status?pretty' notice that my_benchmark is now stuck with COMPLETED status in cluster state.
  • run curl -XPOST 'localhost:9201/_bench/_abort/my_benchmark?pretty' - it never returns, and you can see in the log files on the server message like this:
org.elasticsearch.ElasticsearchIllegalStateException: benchmark [my_benchmark]: missing internal state
    at org.elasticsearch.action.benchmark.BenchmarkCoordinatorService$5.onResponse(BenchmarkCoordinatorService.java:355)
    at org.elasticsearch.action.benchmark.BenchmarkCoordinatorService$5.onResponse(BenchmarkCoordinatorService.java:341)
    at org.elasticsearch.action.benchmark.BenchmarkStateManager$UpdateTask.clusterStateProcessed(BenchmarkStateManager.java:172)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:466)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)

builder.field(Fields.TOTAL_QUERIES, totalQueries);
builder.field(Fields.CONCURRENCY, concurrency);
builder.field(Fields.MULTIPLIER, multiplier);
builder.field(Fields.AVG_WARMUP_TIME, avgWarmupTime);
builder.field(Fields.AVG_WARMUP_TIME, Double.valueOf(formatter.format(avgWarmupTime)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to allow clients to reduce precision if needed? I realize that the json output looks much more readable this way, but it just seems strange and arbitrary to reduce precision to 2 decimal points in the serialization code.

Andrew Selden added 3 commits September 24, 2014 16:22
Refactored state management in the executor. All state is now managed
in the executor service class.
Add Measurements class instead of passing around arrays of long
everywhere.
Moved all state management into State class. Previously it was
scattered throughout the executor service.
if (entry.nodeStateMap().get(nodeId()) == BenchmarkMetaData.Entry.NodeState.COMPLETED &&
entry.nodeStateMap().get(nodeId()) == BenchmarkMetaData.Entry.NodeState.ABORTED &&
entry.nodeStateMap().get(nodeId()) == BenchmarkMetaData.Entry.NodeState.FAILED) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible?

Andrew Selden added 2 commits October 7, 2014 13:11
Changed API endpoints to not use ‘_’ in the path twice.

Old endpoint format was:
‘/_bench/_{action}’
New format is:
‘/_bench/{action}’
Detect cases where the master node changed. In such cases we abort the
benchmark, clean out the cluster state, and send a failure back to the
caller.
@javanna
Copy link
Member

javanna commented Mar 20, 2015

Waiting for #6914, marked as stalled

@jpountz
Copy link
Contributor

jpountz commented Sep 11, 2015

This PR is already hard to merge due to recent changes, and this won't get better until #6914 is in. Closing: we will have to do things from scratch again anyway.

@jpountz jpountz closed this Sep 11, 2015
@jasontedor jasontedor deleted the feature/bench branch August 12, 2016 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants