-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ALLUXIO-2182] Added a metrics master for aggregating the client metrics #7251
Conversation
Merged build finished. Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level review:
- We might lose metrics from short-lived clients because they exit before sending the final heartbeat to master. Do you think it makes sense to send one last metric heartbeat before letting the JVM exit?
- It could be useful if to make the client ID configurable so that users can use the same ID across their application and then aggregate to see only metrics for their application.
- Could we add a metric to track how many master threads are being consumed by client metric heartbeats?
#1 We could add a shutdown hook for the last heartbeat. However, I feel this is a bit overengineered, as the metrics, imo, might not necessary to be exact but more best efforts. Losing some metrics in 1-2s might not be too problematic. |
@yupeng9 What's the overhead of the client to master heartbeat, could this be a problem in environments with 1000s of clients? |
|
@aaudiber For 2, I prefer a separate PR for adding this optional client id. |
@calvinjia that's why I make the frequency of heartbeat configurable. The metrics collection, IMO, is not necessary to be too frequent given its nature. For 1 thousand service handling every a few seconds should be very doable. However, it does take connections to serve the heartbeat request, so increasing the size of thrift thread pool might be needed. |
@calvinjia in addition, I can put this feature behind a flag, so for large-scale applications, metrics can be turned off if it affects performance. |
@yupeng9 Have you verified there are threadpool issues when running with a large number of clients and does increasing the threadpool solve the issue without any side effects? I think it is important to have a solution which works for larger clusters (as opposed to just turning it off). |
Merged build finished. Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
@@ -133,6 +149,9 @@ public static FileSystemContext create(Subject subject, MasterInquireClient mast | |||
*/ | |||
private FileSystemContext(Subject subject) { | |||
mParentSubject = subject; | |||
mExecutorService = Executors.newFixedThreadPool(1, | |||
ThreadFactoryUtils.build("metrics-master-heartbeat-%d", true)); | |||
mId = UUID.randomUUID().toString().replace("-", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an IdUtils method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* If the task fails to heartbeat to the master, it will destroy its old master client and recreate | ||
* it before retrying. | ||
*/ | ||
@NotThreadSafe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we access this from multiple threads despite it being not thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a heartbeat executor, and it's only invoked once per JVM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also access it in the shutdown hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true.
*/ | ||
public ClientMasterSync(MetricsMasterClient masterClient) { | ||
mMasterClient = Preconditions.checkNotNull(masterClient, "masterClient"); | ||
mHeartbeatTimeoutMs = (int) Configuration.getMs(PropertyKey.WORKER_BLOCK_HEARTBEAT_TIMEOUT_MS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout value seems unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @return the unique id of the context | ||
*/ | ||
public String getId() { | ||
return mId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to determine the ID of a process, how will users be able to map an ID to a process after seeing the metrics be abnormal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, but there does not seem a good approach commonly agreed on: https://stackoverflow.com/questions/35842/how-can-a-java-program-get-its-own-process-id?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
We could log the ID after it's created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging sounds good for now.
throw new RuntimeException("Master heartbeat timeout exceeded: " + mHeartbeatTimeoutMs); | ||
} | ||
LOG.error("Master heartbeat timeout exceeded: " + mHeartbeatTimeoutMs); | ||
System.exit(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to system exit if the metrics heartbeat times out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and made it best effort to send the metrics info.
@@ -40,6 +40,7 @@ | |||
public static final String WORKER_FILESYSTEM_MASTER_SYNC = "Worker FileSystemMaster Sync"; | |||
public static final String WORKER_PIN_LIST_SYNC = "Worker Pin List Sync"; | |||
public static final String WORKER_SPACE_RESERVER = "Worker Space Reserver"; | |||
public static final String METRICS_MASTER_SYNC = "Metrics Master Sync"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps name this similar to the others Master ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public final class ClientMetrics { | ||
/** | ||
* Total number of bytes short-circuit read from local storage. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** ... */
format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
4: i64 value | ||
3: string instanceId | ||
4: string name | ||
5: i64 value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is backward incompatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric is added in this branch, so it's not public yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks
public void clientHeartbeat(String clientId, String hostname, List<Metric> metrics) { | ||
if (metrics.isEmpty()) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do this logic in putClientMetrics
, since it should be able to handle an empty set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param hostname the client hostname | ||
* @param metrics client-side metrics | ||
*/ | ||
void clientHeartbeat(String clientId, String hostname, List<Metric> metrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since clientId
is in metrics
, we could make this one API. If we keep multiple APIs, could we name them symmetrically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? This API only handles heartbeat from the client, while the worker heartbeats are handled in the BlockMaster
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the clientHeartbeat
and putWorkerMetrics
methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Merged build finished. Test FAILed. |
Merged build finished. Test FAILed. |
Merged build finished. Test FAILed. |
Automated checks report:
All checks passed! |
Automated checks report:
All checks passed! |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Automated checks report:
All checks passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged build finished. Test FAILed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Automated checks report:
All checks passed! |
https://alluxio.atlassian.net/browse/ALLUXIO-2182
Each client sends over the metrics to the master in its heartbeat. And the new metrics master will aggregate the metrics defined by the aggregators.
This PR adds a client-master heartbeat to send the client-side metrics info. More test coverage improvement will be added once we agree on the architecture.