-
Notifications
You must be signed in to change notification settings - Fork 0
PDP 45 Pravega Healthcheck
This PDP(Pravega Design Proposal) proposes a design of Pravega HealthCheck. It covers the requirements of the features, the main considerations and concerns behind the design, the system architecture, the Java API and the REST API of the framework, some typical integration and usage of HealthCheck on different levels, and the implementations of some HealthAspects.
The Readiness Check is invoked periodically to determine whether the target service instance should start receiving requests or not. If Readiness Check fails, the service instance will not be killed; instead request routing mechanism just stops sendings service requests to the instance.
The Health Check is invoked periodically to determine whether the target service instance is functioning as expected. If Heath Check failed, the service instance will be killed by management process, such as Kubernetes or system operator.
HealthInfos are usually collected from various components, hence certain aggregation rules are needed to determine the health of each Health Aspect and the entire service instance.
- Supports Health (Liveness) Check
- Supports Readiness Check
- Returned HealthInfo should contain both specific status for machine reading and details for human reading
- The design should reflect the layered notion of Pravega Health - from individual unit (e.g. a segment container), to aspect (e.g. all the segment containers in an SegmentStore instance form an aspect), to individual service instance (e.g. one Segmentstore instance), up to the service level (e.g. all the Segmentstore instances as Segmentstore service)
- REST API client interface
- PULL mode - HealthCheck will only be invoked upon request from client
- HealthInfo will be cached to reduce the consumption of system resource
- HealthInfo - Object to storage HealthCheck result, which is compriosed of status code and details
- HealthUnit - smallest system component to provide HealthInfo
- HealthAspect - An HealthAspect is comprised of zero, one or more HealthUnits with the same health concern. e.g. all Segment Containers form a SegmentContainer HealthAspect; Metric HealthAspect may contain zero HealthUnit if metrics is turned off. There is one and only one HealthUnit to form System HealthAspect
- Aspect Level Aggregation - HealthAspect must have an aggregation rule defined to aggregate potential multiple HealthInfos received to determine the overall healthiness of the aspect. E.g. For SegmentContainer HealthAspect we could apply majority rule to determine the healthiness of Segment Container as an aspect
- Instance Level Aggregation - with HealthInfo returned from all HealthAspects, the rule to determine the overall healthiness of the service instance
- HealthRegistry - A container to hold references to all the HealthUnits. Upon HealthCheck request, the registry pulls all the HealthUnits for HealthInfo, then aggregates HealthInfos on Aspect and Instance levels in order to return the final HealthInfo to HealthCheck client
- Those system components with the ability to provide HealthInfo could create one or more HealthUnit objects and store them inside the component. The component needs to register HealthUnit objects upon component's initialization and unregister the HealthUnit objects upon the component's closure.
- Each HealthUnit object created must specify which HealthAspect it is coming from
- Upon HealthCheck request, HealthRegistry polls all the registered HealthUnits to retrieve HealthInfo
- After receiving all the available HealthInfo, Aggregation Rules are applied on HealthAspect level and the instance level to get the final HealthInfo
- REST interface is provided to client
- Each HealthCheck is also a metrics event, so user could view Healthcheck history and HealthInfo distribution at backend, such as Grafana
Each HealthUnit only holds a HealthInfo Supplier lambda for returning HealthInfo. It doesn't hold other resources for HealthCheck purpose.
The HealthInfo Supplier should be implemented in a light and non-blocking way, using information immediately available to the component as much as possible.
HealthRegistry holds weak references to the registered HealthUnits, so if an HealthUnit becomes Garbege Collection available,
it will also be removed automatically from HealthRegistry to prevent memory leaking.
HealthUnits are pulled periodically using separate thread, giving no burden to system components holding HealthUnit.
The final HealthInfo will also be cached, so HealthCheck is essentially throttled. By default we could set the internal to 10 seconds.
Given the above measurement and consideration, the HealthCheck process should be lightweight, non-blocking, using minimum memory and CPU resources.
@Data
public class HealthInfo {
public enum Status {
/* The result of the health-check is considered healthy */
HEALTH,
/* The result of the health-check is considered unhealthy */
UNHEALTH,
/* The result of the health-check is unknown, due to time-out, interruption or other exception happened */
UNKNOWN
}
/*
* The status of the health-check
*/
private final Status status;
/*
* The details of the health-check
*/
private final String details;
}
/**
* HealthAspect is Pravega's notion of health-check on top of individual HealthUnit.
*
* For a highly distributed system such as Pravega, the failure of one or more components is completely expected or
* sometimes even designed, so in addition to the health-check of individual HealthUnit,
* we have to aggregate all the health-check results from the aspect to determine the healthiness of the aspect.
*
* For example, during a scaling-up period, we may see some Segment Containers being shut down while some other
* being created. We have to collect the HealthInfo from all the Segment Containers (HealthUnit) in order to determine
* the healthiness of the overall Segment Container aspect.
*/
public enum HealthAspect {
SYSTEM("System", healthInfos -> {
return HealthInfoAggregationRules.singleOrNone(healthInfos);
}),
CONTROLLER("Controller", healthInfos -> {
return HealthInfoAggregationRules.majority(healthInfos);
}),
SEGMENT_CONTAINER("Segment Container", healthInfos -> {
return HealthInfoAggregationRules.majority(healthInfos);
}),
CACHE("Cache Manager", healthInfos -> {
return HealthInfoAggregationRules.oneVeto(healthInfos);
}),
LONG_TERM_STORAGE("Long Term Storage", healthInfos -> {
return HealthInfoAggregationRules.singleOrNone(healthInfos);
}),
METRICS("Metrics", healthInfos -> {
return HealthInfoAggregationRules.singleOrNone(healthInfos);
});
private final String name;
private final Function<Collection<HealthInfo>, Optional<HealthInfo>> aspectAggregationRule;
/**
*
* @param name - the name of the aspect
* @param aspectAggregationRule - the rule to determine aspect level healthiness
*/
HealthAspect(String name, Function<Collection<HealthInfo>, Optional<HealthInfo>> aspectAggregationRule) {
Preconditions.checkArgument(aspectAggregationRule != null, "Aspect Aggregation Rule cannot be null");
this.name = name;
this.aspectAggregationRule = aspectAggregationRule;
}
/**
* Get the HealthAspect name.
*
* @return the HealthAspect name
*/
public String getName() {
return this.name;
}
/**
* Get the rule for the aggregation of all the HealthInfo from the HealthAspect.
*
* @return the Function to aggregate all HealthInfo from the HealthAspect
*/
public Function<Collection<HealthInfo>, Optional<HealthInfo>> getAspectAggregationRule() {
return this.aspectAggregationRule;
}
}
@Data
public class HealthUnit {
/**
* Id to uniquely identify the HealthUnit from the aspect it belongs to.
* Usually this id can be derived from an existing id, such as the id of the hosting component.
*/
final String healthUnitId;
/**
* The HealthAspect this HealthUnit is coming from.
*/
final HealthAspect healthAspect;
/**
* Supplier to supply HealthInfo of the hosting component upon health-check request.
*/
final Supplier<HealthInfo> healthInfoSupplier;
}
/**
* The interface of the container holds HealthUnit references, which must provide the ability to
* register and unregister HealthUnit.
*/
public interface HealthRegistry {
/**
* Register an HealthUnit.
*
* @param unit HealthUnit object
*/
void registerHealthUnit(HealthUnit unit);
/**
* Unregister an HealthUnit.
*
* @param unit HealthUnit object
*/
void unregisterHealthUnit(HealthUnit unit);
}
request | Response |
---|---|
/health | {"health": 0} {"health": -1} |
/ready | {"ready": 0} {"ready": -1} |
/healthDetails | {"health status": -1, "SegmentContainerAspect": "5 healthy, 1 unhealthy", "SystemAspect": "memory 12G/16G", "Long Term Storage Aspect": "ECS, storage full", "Cache Aspect": "throttling at 10s" "Operation Log":"Unknown"} |
public class SampleSystemComponent implements AutoCloseable {
final String componentId;
final HealthRegistry healthRegistry;
final HealthUnit systemHealthUnit;
final HealthUnit segmentContainerHealthUnit;
public SampleSystemComponent(String id, HealthRegistry healthRegistry) {
this.componentId = id;
this.healthRegistry = healthRegistry;
systemHealthUnit = new HealthUnit(this.componentId, HealthAspect.SYSTEM, () -> new HealthInfo(...));
segmentContainerHealthUnit = new HealthUnit(this.componentId, HealthAspect.SEGMENT_CONTAINER, () -> new HealthInfo(...));
this.healthRegistry.registryHealthUnit(systemHealthUnit);
this.healthRegistry.registryHealthUnit(segmentContainerHealthUnit);
}
@Override
public void close() {
healthRegistry.unregisterHealthUnit(systemHealthUnit);
healthRegistry.unregisterHealthUnit(segmentContainerHealthUnit);
}
}
Access Level | Query Example | Response Example | Use Cases |
---|---|---|---|
Local | curl http://localhost:10080/health | {"health": 0} {"health": -1} |
Fundamental Healthcheck for SegmentStore |
Local | curl http://localhost:10080/ready | {"ready": 0} | Fundamental Readiness Check for SegmentStore |
Local | curl http://localhost:10090/healthDetails | {"health": -1, "details": "No Active SegmentContainer"} |
Fundamental Healthcheck for Controller |
K8S pod | curl -v http://10.100.200.125:10080/health; curl -v http://10.100.200.125:10090/ready | Troubleshooting inside K8S | |
Operator | {LivenessProbe: exec: Command: curl -v /health ReadinessProbe: Exec: Command: curl -v /ready |
Operator exposes Pravega Liveness and Readiness check to K8S |
|
Service (CLI) | health -[Segmentstore|Controller|All] | {"health": 0, "details": "5 stores healthy, 1 store unhealthy"} |
Service level healthcheck aggregation |
Database (Influxdb) |
SELECT health-check-events from PravegaMetricsStore ... |
History and distribution of health information available now |
|
Metrics (Grafana) |
Metrics backend User Interface | Integration with BK/ZK metrics possible now |
Pravega - Streaming as a new software defined storage primitive
- Contributing
- Guidelines for committers
- Testing
-
Pravega Design Documents (PDPs)
- PDP-19: Retention
- PDP-20: Txn timeouts
- PDP-21: Protocol revisioning
- PDP-22: Bookkeeper based Tier-2
- PDP-23: Pravega Security
- PDP-24: Rolling transactions
- PDP-25: Read-Only Segment Store
- PDP-26: Ingestion Watermarks
- PDP-27: Admin Tools
- PDP-28: Cross routing key ordering
- PDP-29: Tables
- PDP-30: Byte Stream API
- PDP-31: End-to-end Request Tags
- PDP-32: Controller Metadata Scalability
- PDP-33: Watermarking
- PDP-34: Simplified-Tier-2
- PDP-35: Move controller metadata to KVS
- PDP-36: Connection pooling
- PDP-37: Server-side compression
- PDP-38: Schema Registry
- PDP-39: Key-Value Tables
- PDP-40: Consistent order guarantees for storage flushes
- PDP-41: Enabling Transport Layer Security (TLS) for External Clients
- PDP-42: New Resource String Format for Authorization
- PDP-43: Large Events
- PDP-44: Lightweight Transactions
- PDP-45: Healthcheck
- PDP-46: Read Only Permissions For Reading Data
- PDP-47: Pravega Message Queues