Service-level Indicators (SLIs) are the measurements used to calculate the performance for the goal. It is a direct measurement of a service’s behaviour and helps us and the users to evaluate whether our system has been running within SLO. The metrics captured as part of SLI for Firehose are described below.
- Type Details
- Overview
- Pods Health
- Kafka Consumer Details
- Error
- Memory
- Error
- Garbage Collection
- Retry
- HTTP Sink
- Filter
- Blob Sink
- Bigquery Sink
Collection of all the generic configurations in a Firehose.
- The type of sink of the Firehose. It could be 'log', 'HTTP', 'DB', 'redis', 'influx' or 'Elasticsearch'
- Team who has the ownership for the given Firehose.
- The proto class used for creating the Firehose
- The stream where the input topic is read from
Some of the most important metrics related to Firehose that gives you an overview of the current state of it.
- The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
- Sum of all messages received from Kafka per pod.
- Messages sent successfully to the sink per batch per pod.
- Messages failed to be pushed into the sink per batch per pod. In case of HTTP sink, if status code is not in retry codes configured, the records will be dropped.
- In case of HTTP sink, when status code is not in retry codes configured, the records are dropped. This metric captures the dropped messages count.
- 99p of batch size distribution for pulled and pushed messages per pod.
- Latency introduced by Firehose (time before sending to sink - time when reading from Kafka). Note: It could be high if the response time of the sink is higher as subsequent batches could be delayed.
- Time difference between Kafka ingestion and sending to sink (Time before sending to sink - Time of Kafka ingestion)
- Different percentile of the response time of the sink.
Since Firehose runs on Kube, this gives a nice health details of each pods.
- JVM Uptime of each pod.
- Returns the "recent cpu usage" for the Java Virtual Machine process. This value is a double in the [0.0,1.0] interval. A value of 0.0 means that none of the CPUs were running threads from the JVM process during the recent period of time observed, while a value of 1.0 means that all CPUs were actively running threads from the JVM 100% of the time during the recent period being observed. Threads from the JVM include the application threads as well as the JVM internal threads. All values betweens 0.0 and 1.0 are possible depending of the activities going on in the JVM process and the whole system. If the Java Virtual Machine recent CPU usage is not available, the method returns a negative value.
- Returns the CPU time used by the process on which the Java virtual machine is running. The returned value is of nanoseconds precision but not necessarily nanoseconds accuracy.
Listing some of the Kafka consumer metrics here.
- Consumer Group Metrics: The number of partitions currently assigned to this consumer (per pod).
- Global Request Metrics: The average number of requests sent per second per pod.
- Topic-level Fetch Metrics: The average number of records consumed per second for a specific topic per pod.
- Topic-level Fetch Metrics: The average number of bytes consumed per second per pod.
- Fetch Metrics: The number of fetch requests per second per pod.
- Fetch Metrics: The max time taken for a fetch request per pod.
- Fetch Metrics: The average time taken for a fetch request per pod.
- Fetch Metrics: The average number of bytes fetched per request per pod.
- Fetch Metrics: The max number of bytes fetched per request per pod.
- Consumer Group Metrics: The number of commit calls per second per pod.
- Global Connection Metrics: The current number of active connections per pod.
- Global Connection Metrics: New connections established per second in the window per pod.
- Global Connection Metrics: Connections closed per second in the window per pod.
- Global Request Metrics: The average number of outgoing bytes sent per second to all servers per pod.
- Average time spent between poll per pod.
- Max time spent between poll per pod.
- Consumer Group Metrics: The number of group syncs per second per pod. Group synchronization is the second and last phase of the rebalance protocol. Similar to join-rate, a large value indicates group instability.
- The average number of network operations (reads or writes) on all connections per second per pod
- Rate of rebalance the consumer.
- Consumer Group Metrics: The average time taken for a commit request per pod
- Consumer Group Metrics: The max time taken for a commit request per pod.
- Average Rebalance Latency for the consumer per pod.
- Max Rebalance Latency for the consumer per pod.
This gives you a nice insight about the critical and noncritical exceptions happened in the Firehose.
- Count of all the exception raised by the pods which can restart the Firehose.
- Count of all the exception raised by the Firehose which will not restart the Firehose and Firehose will keep retrying.
Details on memory used by the Firehose for different tasks.
-
Details of heap memory usage:
Max: The amount of memory that can be used for memory management Used: The amount of memory currently in use
-
Details of non-heap memory usage:
Max: The amount of memory that can be used for memory management Used: The amount of memory currently in use
- For a garbage-collected memory pool, the amount of used memory includes the memory occupied by all objects in the pool including both reachable and unreachable objects. This is for all the names in the type: MemoryPool.
- Peak usage of GC memory usage.
- Total usage of GC memory usage.
All JVM Garbage Collection Details.
- The total number of collections that have occurred per pod. Rather than showing the absolute value we are showing the difference to see the rate of change more easily.
- The approximate accumulated collection elapsed time in milliseconds per pod. Rather than showing the absolute value we are showing the difference to see the rate of change more easily.
- daemonThreadCount: Returns the current number of live daemon threads per pod peakThreadCount: Returns the peak live thread count since the Java virtual machine started or peak was reset per pod threadCount: Returns the current number of live threads including both daemon and non-daemon threads per pod.
- loadedClass: Displays number of classes that are currently loaded in the Java virtual machine per pod unloadedClass: Displays the total number of classes unloaded since the Java virtual machine has started execution.
- The code cache memory usage in the memory pools at the end of a GC per pod.
- The compressed class space memory usage in the memory pools at the end of a GC per pod.
- The metaspace memory usage in the memory pools at the end of a GC per pod.
- The eden space memory usage in the memory pools at the end of a GC per pod.
- The survivor space memory usage in the memory pools at the end of a GC per pod.
-
The tenured space memory usage in the memory pools at the end of a GC per pod.
File Descriptor
-
Number of file descriptor per pod
Open: Current open file descriptors Max: Based on config max allowed
If you have configured retries this will give you some insight about the retries.
- Request retries per min per pod.
- Time spent per pod backing off.
HTTP Sink response code details.
- Total number of 2xx response received by Firehose from the HTTP service,
- Total number of 4xx response received by Firehose from the HTTP service.
- Total number of 5xx response received by Firehose from the HTTP service.
- Total number of No response received by Firehose from the HTTP service.
Since Firehose supports filtration based on some data, these metrics give some information related to that.
- Type of filter in the Firehose. It will be one of the "none", "key", "message".
- Sum of all the messages filtered because of the filter condition per pod.
A gauge, total number of local file that is currently being opened.
Total number of local file that being closed and ready to be uploaded, excluding local file that being closed prematurely due to consumer restart.
Duration of local file closing time.
Total number of records that written to all files that have been closed and ready to be uploaded.
Size of file in bytes.
Total number file that successfully being uploaded.
Duration of file upload.
Total Size of the uploaded file in bytes.
Total number records inside files that successfully being uploaded to blob storage.