Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ALLUXIO-2182] Added a metrics master for aggregating the client metrics #7251

Merged
merged 29 commits into from
May 18, 2018

Conversation

yupeng9
Copy link
Contributor

@yupeng9 yupeng9 commented May 7, 2018

https://alluxio.atlassian.net/browse/ALLUXIO-2182

Each client sends over the metrics to the master in its heartbeat. And the new metrics master will aggregate the metrics defined by the aggregators.

This PR adds a client-master heartbeat to send the client-side metrics info. More test coverage improvement will be added once we agree on the architecture.

@yupeng9 yupeng9 requested review from calvinjia and aaudiber May 7, 2018 21:13
@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

Copy link
Contributor

@aaudiber aaudiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level review:

  • We might lose metrics from short-lived clients because they exit before sending the final heartbeat to master. Do you think it makes sense to send one last metric heartbeat before letting the JVM exit?
  • It could be useful if to make the client ID configurable so that users can use the same ID across their application and then aggregate to see only metrics for their application.
  • Could we add a metric to track how many master threads are being consumed by client metric heartbeats?

@yupeng9
Copy link
Contributor Author

yupeng9 commented May 7, 2018

#1 We could add a shutdown hook for the last heartbeat. However, I feel this is a bit overengineered, as the metrics, imo, might not necessary to be exact but more best efforts. Losing some metrics in 1-2s might not be too problematic.
#2 We can have it as a configurable value, but before we see some real use cases or requests, I feel it is not a must-have feature to be in this framework PR.
#3 We could. But I want to ramp up metrics in future PRs, and this one is to set up the foundation.

@calvinjia
Copy link
Contributor

@yupeng9 What's the overhead of the client to master heartbeat, could this be a problem in environments with 1000s of clients?

@aaudiber
Copy link
Contributor

aaudiber commented May 7, 2018

  1. If the client runs for shorter than the heartbeat interval, it will send no metrics. I think that could be problematic, especially if we need to increase the heartbeat interval to reduce pressure on the master.
  2. It seems like a very low-hanging fruit that's much more useful than per-jvm randomly generated ids. We can do it in a different PR if you prefer, but the ID used to identify clients seems like it fits in this PR as a part of the foundation/framework PR for client metrics.
  3. Ok, future PR is fine

@yupeng9
Copy link
Contributor Author

yupeng9 commented May 8, 2018

@aaudiber
hmm, for 1, I'm not that concerned, because the number of metrics that can happen within 1-2s should not be large. But I'm okay with adding it if you insist.

For 2, I prefer a separate PR for adding this optional client id.

@yupeng9
Copy link
Contributor Author

yupeng9 commented May 8, 2018

@calvinjia that's why I make the frequency of heartbeat configurable. The metrics collection, IMO, is not necessary to be too frequent given its nature. For 1 thousand service handling every a few seconds should be very doable. However, it does take connections to serve the heartbeat request, so increasing the size of thrift thread pool might be needed.

@yupeng9
Copy link
Contributor Author

yupeng9 commented May 8, 2018

@calvinjia in addition, I can put this feature behind a flag, so for large-scale applications, metrics can be turned off if it affects performance.

@calvinjia
Copy link
Contributor

@yupeng9 Have you verified there are threadpool issues when running with a large number of clients and does increasing the threadpool solve the issue without any side effects? I think it is important to have a solution which works for larger clusters (as opposed to just turning it off).

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/19531/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@@ -133,6 +149,9 @@ public static FileSystemContext create(Subject subject, MasterInquireClient mast
*/
private FileSystemContext(Subject subject) {
mParentSubject = subject;
mExecutorService = Executors.newFixedThreadPool(1,
ThreadFactoryUtils.build("metrics-master-heartbeat-%d", true));
mId = UUID.randomUUID().toString().replace("-", "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an IdUtils method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* If the task fails to heartbeat to the master, it will destroy its old master client and recreate
* it before retrying.
*/
@NotThreadSafe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we access this from multiple threads despite it being not thread safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a heartbeat executor, and it's only invoked once per JVM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also access it in the shutdown hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true.

*/
public ClientMasterSync(MetricsMasterClient masterClient) {
mMasterClient = Preconditions.checkNotNull(masterClient, "masterClient");
mHeartbeatTimeoutMs = (int) Configuration.getMs(PropertyKey.WORKER_BLOCK_HEARTBEAT_TIMEOUT_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout value seems unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @return the unique id of the context
*/
public String getId() {
return mId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to determine the ID of a process, how will users be able to map an ID to a process after seeing the metrics be abnormal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but there does not seem a good approach commonly agreed on: https://stackoverflow.com/questions/35842/how-can-a-java-program-get-its-own-process-id?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

We could log the ID after it's created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging sounds good for now.

throw new RuntimeException("Master heartbeat timeout exceeded: " + mHeartbeatTimeoutMs);
}
LOG.error("Master heartbeat timeout exceeded: " + mHeartbeatTimeoutMs);
System.exit(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to system exit if the metrics heartbeat times out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and made it best effort to send the metrics info.

@@ -40,6 +40,7 @@
public static final String WORKER_FILESYSTEM_MASTER_SYNC = "Worker FileSystemMaster Sync";
public static final String WORKER_PIN_LIST_SYNC = "Worker Pin List Sync";
public static final String WORKER_SPACE_RESERVER = "Worker Space Reserver";
public static final String METRICS_MASTER_SYNC = "Metrics Master Sync";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps name this similar to the others Master ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public final class ClientMetrics {
/**
* Total number of bytes short-circuit read from local storage.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** ... */ format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

4: i64 value
3: string instanceId
4: string name
5: i64 value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is backward incompatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric is added in this branch, so it's not public yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks

public void clientHeartbeat(String clientId, String hostname, List<Metric> metrics) {
if (metrics.isEmpty()) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this logic in putClientMetrics, since it should be able to handle an empty set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param hostname the client hostname
* @param metrics client-side metrics
*/
void clientHeartbeat(String clientId, String hostname, List<Metric> metrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since clientId is in metrics, we could make this one API. If we keep multiple APIs, could we name them symmetrically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? This API only handles heartbeat from the client, while the worker heartbeats are handled in the BlockMaster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the clientHeartbeat and putWorkerMetrics methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@Alluxio Alluxio deleted a comment from AmplabJenkins May 18, 2018
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • Valid pull request title: PASS
  • Contains link to JIRA ticket: PASS
  • Commits associated with Github account: PASS

All checks passed!

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • Valid pull request title: PASS
  • Contains link to JIRA ticket: PASS
  • Commits associated with Github account: PASS

All checks passed!

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/19695/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/19690/

Failed Tests: 17

org.alluxio:alluxio-tests: 17


Test FAILed.

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • Valid pull request title: PASS
  • Contains link to JIRA ticket: PASS
  • Commits associated with Github account: PASS

All checks passed!

Copy link
Contributor

@aaudiber aaudiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/19696/

Failed Tests: 17

org.alluxio:alluxio-tests: 17


Test FAILed.

@yupeng9 yupeng9 merged commit 7360bb0 into Alluxio:metrics-enhancement May 18, 2018
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/19697/
Test PASSed.

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • Valid pull request title: PASS
  • Contains link to JIRA ticket: PASS
  • Commits associated with Github account: PASS

All checks passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants