Skip to content

Commit

Permalink
[Improvement][Metrics] Apply micrometer naming convention all curren…
Browse files Browse the repository at this point in the history
…t metrics (apache#10432)
  • Loading branch information
EricGao888 committed Jun 19, 2022
1 parent 25a3192 commit a45b9e6
Show file tree
Hide file tree
Showing 6 changed files with 144 additions and 143 deletions.
247 changes: 124 additions & 123 deletions docs/docs/en/guide/metrics/metrics.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# Introduction

Apache DolphinScheduler has export some metrics to monitor the system. We use micrometer for the exporter facade, and
the default exporter is prometheus, more exporter is coming soon.
Apache DolphinScheduler exports metrics for system observability. We use [micrometer](https://micrometer.io/) as application metrics facade.
Currently, we only support `Prometheus Exporter` but more are coming soon.

## Quick Start
## Quick Start

You can add the following config in master/worker/alert/api's yaml file to open the metrics exporter.
- Standalone mode

## Configuration

You can add the following config in master/worker/alert/api's yaml file to enable the metrics exporter.

```yaml
metrics:
enabled: true
```
Once you open the metrics exporter, you can access the metrics by the url: `http://ip:port/actuator/prometheus`
Once you enable the metrics exporter, you can access the metrics by the url: `http://ip:port/actuator/prometheus`

The exporter port is the `server.port` defined in application.yaml, e.g: master: `server.port: 5679`, worker: `server.port: 1235`, alert: `server.port: 50053`, api: `server.port: 12345`.

Expand All @@ -34,121 +38,118 @@ Then you can access the grafana by the url: `http://localhost/3001`
![image.png](../../../../img/metrics/metrics-worker.png)
![image.png](../../../../img/metrics/metrics-datasource.png)

## Master Metrics

Master metrics are exported by the DolphinScheduler master server.

### System Metrics

* dolphinscheduler_master_overload_count: Indicates the number of times the master has been overloaded.
* dolphinscheduler_master_consume_command_count: Indicates the number of commands has consumed.

### Process Metrics

* dolphinscheduler_create_command_count: Indicates the number of command has been inserted.
* dolphinscheduler_process_instance_submit_count: Indicates the number of process has been submitted.
* dolphinscheduler_process_instance_running_gauge: Indicates the number of process are running now.
* dolphinscheduler_process_instance_timeout_count: Indicates the number of process has been timeout.
* dolphinscheduler_process_instance_finish_count: Indicates the number of process has been finished, include success or
failure.
* dolphinscheduler_process_instance_success_count: Indicates the number of process has been successful.
* dolphinscheduler_process_instance_stop_count: Indicates the number of process has been stopped.
* dolphinscheduler_process_instance_failover_count: Indicates the number of process has been failed over.

### Task Metrics

* dolphinscheduler_task_timeout_count: Indicates the number of tasks has been timeout.
* dolphinscheduler_task_finish_count: Indicates the number of tasks has been finished, include success or failure.
* dolphinscheduler_task_success_count: Indicates the number of tasks has been successful.
* dolphinscheduler_task_timeout_count: Indicates the number of tasks has been timeout.
* dolphinscheduler_task_retry_count: Indicates the number of tasks has been retry.
* dolphinscheduler_task_failover_count: Indicates the number of tasks has been failover.
* dolphinscheduler_task_dispatch_count: Indicates the number of tasks has been dispatched to worker.
* dolphinscheduler_task_dispatch_failed_count: Indicates the number of tasks dispatched failed, if dispatched failed
will retry.
* dolphinscheduler_task_dispatch_error_count: Indicates the number of tasks dispatched error, if dispatched error, means
there are exception occur.

## Worker Metrics

Worker metrics are exported by the DolphinScheduler worker server.

### System Metrics

* dolphinscheduler_worker_overload_count: Indicates the number of times the worker has been overloaded.
* dolphinscheduler_worker_submit_queue_is_full_count: Indicates the number of times the worker's submit queue has been
full.

### Task Metrics

* dolphinscheduler_task_execute_count: Indicates the number of times a task has been executed, it contains a tag -
`task_type`.
* dolphinscheduler_task_execution_count: Indicates the total number of task has been executed.
* dolphinscheduler_task_execution_timer: Indicates the time spent executing tasks.

## Default System Metrics

In each server, there are some default metrics related to the system instance.

### Database Metrics

* hikaricp_connections_creation_seconds_max: Connection creation time max.
* hikaricp_connections_creation_seconds_count: Connection creation time count.
* hikaricp_connections_creation_seconds_sum: Connection creation time sum.
* hikaricp_connections_acquire_seconds_max: Connection acquire time max.
* hikaricp_connections_acquire_seconds_count: Connection acquire time count.
* hikaricp_connections_acquire_seconds_sum: Connection acquire time sum.
* hikaricp_connections_usage_seconds_max: Connection usage max.
* hikaricp_connections_usage_seconds_count: Connection usage time count.
* hikaricp_connections_usage_seconds_sum: Connection usage time sum.
* hikaricp_connections_max: Max connections.
* hikaricp_connections_min Min connections
* hikaricp_connections_active: Active connections.
* hikaricp_connections_idle: Idle connections.
* hikaricp_connections_pending: Pending connections.
* hikaricp_connections_timeout_total: Timeout connections.
* hikaricp_connections: Total connections
* jdbc_connections_max: Maximum number of active connections that can be allocated at the same time.
* jdbc_connections_min: Minimum number of idle connections in the pool.
* jdbc_connections_idle: Number of established but idle connections.
* jdbc_connections_active: Current number of active connections that have been allocated from the data source.

### JVM Metrics

* jvm_buffer_total_capacity_bytes: An estimate of the total capacity of the buffers in this pool.
* jvm_buffer_count_buffers: An estimate of the number of buffers in the pool.
* jvm_buffer_memory_used_bytes: An estimate of the memory that the Java virtual machine is using for this buffer pool.
* jvm_memory_committed_bytes: The amount of memory in bytes that is committed for the Java virtual machine to use.
* jvm_memory_max_bytes: The maximum amount of memory in bytes that can be used for memory management.
* jvm_memory_used_bytes: The amount of used memory.
* jvm_threads_peak_threads: The peak live thread count since the Java virtual machine started or peak was reset.
* jvm_threads_states_threads: The current number of threads having NEW state.
* jvm_gc_memory_allocated_bytes_total: Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next.
* jvm_gc_max_data_size_bytes: Max size of long-lived heap memory pool.
* jvm_gc_pause_seconds_count: Time spent count in GC pause.
* jvm_gc_pause_seconds_sum: Time spent sum in GC pause.
* jvm_gc_pause_seconds_max: Time spent max in GC pause.
* jvm_gc_live_data_size_bytes: Size of long-lived heap memory pool after reclamation.
* jvm_gc_memory_promoted_bytes_total: Count of positive increases in the size of the old generation memory pool before GC to after GC.
* jvm_classes_loaded_classes: The number of classes that are currently loaded in the Java virtual machine.
* jvm_threads_live_threads: The current number of live threads including both daemon and non-daemon threads.
* jvm_threads_daemon_threads: The current number of live daemon threads.
* jvm_classes_unloaded_classes_total: The total number of classes unloaded since the Java virtual machine has started execution.
* process_cpu_usage: The "recent cpu usage" for the Java Virtual Machine process.
* process_start_time_seconds: Start time of the process since unix epoch.
* process_uptime_seconds: The uptime of the Java virtual machine.


## Other Metrics
* jetty_threads_config_max: The maximum number of threads in the pool.
* jetty_threads_config_min: The minimum number of threads in the pool.
* jetty_threads_current: The total number of threads in the pool.
* jetty_threads_idle: The number of idle threads in the pool.
* jetty_threads_busy: The number of busy threads in the pool.
* jetty_threads_jobs: Number of jobs queued waiting for a thread.
* process_files_max_files: The maximum file descriptor count.
* process_files_open_files: The open file descriptor count.
* system_cpu_usage: The "recent cpu usage" for the whole system.
* system_cpu_count: The number of processors available to the Java virtual machine.
* system_load_average_1m: The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time.
* logback_events_total: Number of level events that made it to the logs
## Name Mapping

### Prometheus

## Metrics List

- We categorize metrics by dolphin scheduler components such as `master server`, `worker server`, `api server` and `alert server`.
- Although task / workflow related metrics exported by `master server` and `worker server`, we categorize them separately for users to find them more conveniently.

### Task Related Metrics

- ds.task.timeout.count: (counter) the number of timeout tasks
- ds.task.finish.count: (counter) the number of finished tasks, both succeeded and failed included
- ds.task.success.count: (counter) the number of successful tasks
- ds.task.retry.count: (counter) the number of retried tasks
- ds.task.failover.count: (counter) the number of task fail-overs
- ds.task.dispatch.count: (counter) the number of tasks dispatched to worker
- ds.task.dispatch.failure.count: (counter) the number of tasks failed to dispatch, retry failure included
- ds.task.dispatch.error.count: (counter) the number of task dispatch errors
- ds.task.execution.count.by.type: (counter) the number of task executions grouped by tag `task_type`
- ds.task.running: (gauge) the number of running tasks
- ds.task.execution.count: (histogram) the number of executed tasks
- ds.task.execution.duration: (histogram) duration of task executions


### Workflow Related Metrics

- ds.workflow.create.command.count: (counter) the number of commands created and inserted by workflows
- ds.workflow.instance.submit.count: (counter) the number of submitted workflow instances
- ds.workflow.instance.running: (gauge) the number of running workflow instances
- ds.workflow.instance.timeout.count: (counter) the number of timeout workflow instances
- ds.workflow.instance.finish.count: (counter) indicates the number of finished workflow instances, both successes and failures included
- ds.workflow.instance.success.count: (counter) the number of successful workflow instances
- ds.workflow.instance.stop.count: (counter) the number of stopped workflow instances
- ds.workflow.instance.failover.count: (counter) the number of workflow instance fail-overs

### Master Server Metrics

- ds.master.overload.count: the number of times the master overloaded
- ds.master.consume.command.count: the number of commands consumed by master

### Worker Server Metrics

- ds.worker.overload.count: the number of times the worker overloaded
- ds.worker.full.submit.queue.count: the number of times the worker's submit queue being full


### Api Server Metrics

### Alert Server Related

In each server, there are some default system-level metrics related to `database connection`, `JVM`, etc. We list them below for your reference:

### Database Related Metrics (Default)

- hikaricp_connections_creation_seconds_max: Connection creation time max.
- hikaricp_connections_creation_seconds_count: Connection creation time count.
- hikaricp_connections_creation_seconds_sum: Connection creation time sum.
- hikaricp_connections_acquire_seconds_max: Connection acquire time max.
- hikaricp_connections_acquire_seconds_count: Connection acquire time count.
- hikaricp_connections_acquire_seconds_sum: Connection acquire time sum.
- hikaricp_connections_usage_seconds_max: Connection usage max.
- hikaricp_connections_usage_seconds_count: Connection usage time count.
- hikaricp_connections_usage_seconds_sum: Connection usage time sum.
- hikaricp_connections_max: Max connections.
- hikaricp_connections_min Min connections
- hikaricp_connections_active: Active connections.
- hikaricp_connections_idle: Idle connections.
- hikaricp_connections_pending: Pending connections.
- hikaricp_connections_timeout_total: Timeout connections.
- hikaricp_connections: Total connections
- jdbc_connections_max: Maximum number of active connections that can be allocated at the same time.
- jdbc_connections_min: Minimum number of idle connections in the pool.
- jdbc_connections_idle: Number of established but idle connections.
- jdbc_connections_active: Current number of active connections that have been allocated from the data source.

### JVM Related Metrics (Default)

- jvm_buffer_total_capacity_bytes: An estimate of the total capacity of the buffers in this pool.
- jvm_buffer_count_buffers: An estimate of the number of buffers in the pool.
- jvm_buffer_memory_used_bytes: An estimate of the memory that the Java virtual machine is using for this buffer pool.
- jvm_memory_committed_bytes: The amount of memory in bytes that is committed for the Java virtual machine to use.
- jvm_memory_max_bytes: The maximum amount of memory in bytes that can be used for memory management.
- jvm_memory_used_bytes: The amount of used memory.
- jvm_threads_peak_threads: The peak live thread count since the Java virtual machine started or peak was reset.
- jvm_threads_states_threads: The current number of threads having NEW state.
- jvm_gc_memory_allocated_bytes_total: Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next.
- jvm_gc_max_data_size_bytes: Max size of long-lived heap memory pool.
- jvm_gc_pause_seconds_count: Time spent count in GC pause.
- jvm_gc_pause_seconds_sum: Time spent sum in GC pause.
- jvm_gc_pause_seconds_max: Time spent max in GC pause.
- jvm_gc_live_data_size_bytes: Size of long-lived heap memory pool after reclamation.
- jvm_gc_memory_promoted_bytes_total: Count of positive increases in the size of the old generation memory pool before GC to after GC.
- jvm_classes_loaded_classes: The number of classes that are currently loaded in the Java virtual machine.
- jvm_threads_live_threads: The current number of live threads including both daemon and non-daemon threads.
- jvm_threads_daemon_threads: The current number of live daemon threads.
- jvm_classes_unloaded_classes_total: The total number of classes unloaded since the Java virtual machine has started execution.
- process_cpu_usage: The "recent cpu usage" for the Java Virtual Machine process.
- process_start_time_seconds: Start time of the process since unix epoch.
- process_uptime_seconds: The uptime of the Java virtual machine.

### Others (Default)

- jetty_threads_config_max: The maximum number of threads in the pool.
- jetty_threads_config_min: The minimum number of threads in the pool.
- jetty_threads_current: The total number of threads in the pool.
- jetty_threads_idle: The number of idle threads in the pool.
- jetty_threads_busy: The number of busy threads in the pool.
- jetty_threads_jobs: Number of jobs queued waiting for a thread.
- process_files_max_files: The maximum file descriptor count.
- process_files_open_files: The open file descriptor count.
- system_cpu_usage: The "recent cpu usage" for the whole system.
- system_cpu_count: The number of processors available to the Java virtual machine.
- system_load_average_1m: The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time.
- logback_events_total: Number of level events that made it to the logs
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ private MasterServerMetrics() {
* Used to measure the master server is overload.
*/
private static final Counter MASTER_OVERLOAD_COUNTER =
Counter.builder("dolphinscheduler_master_overload_count")
Counter.builder("ds.master.overload.count")
.description("Master server overload count")
.register(Metrics.globalRegistry);

/**
* Used to measure the number of process command consumed by master.
*/
private static final Counter MASTER_CONSUME_COMMAND_COUNTER =
Counter.builder("dolphinscheduler_master_consume_command_count")
Counter.builder("ds.master.consume.command.count")
.description("Master server consume command count")
.register(Metrics.globalRegistry);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,22 +30,22 @@ private ProcessInstanceMetrics() {
}

private static final Counter PROCESS_INSTANCE_SUBMIT_COUNTER =
Counter.builder("dolphinscheduler_process_instance_submit_count")
Counter.builder("ds.workflow.instance.submit.count")
.description("Process instance submit total count")
.register(Metrics.globalRegistry);

private static final Counter PROCESS_INSTANCE_TIMEOUT_COUNTER =
Counter.builder("dolphinscheduler_process_instance_timeout_count")
Counter.builder("ds.workflow.instance.timeout.count")
.description("Process instance timeout total count")
.register(Metrics.globalRegistry);

private static final Counter PROCESS_INSTANCE_FINISH_COUNTER =
Counter.builder("dolphinscheduler_process_instance_finish_count")
Counter.builder("ds.workflow.instance.finish.count")
.description("Process instance finish total count")
.register(Metrics.globalRegistry);

private static final Counter PROCESS_INSTANCE_SUCCESS_COUNTER =
Counter.builder("dolphinscheduler_process_instance_success_count")
Counter.builder("ds.workflow.instance.success.count")
.description("Process instance success total count")
.register(Metrics.globalRegistry);

Expand All @@ -55,17 +55,17 @@ private ProcessInstanceMetrics() {
.register(Metrics.globalRegistry);

private static final Counter PROCESS_INSTANCE_STOP_COUNTER =
Counter.builder("dolphinscheduler_process_instance_stop_count")
Counter.builder("ds.workflow.instance.stop.count")
.description("Process instance stop total count")
.register(Metrics.globalRegistry);

private static final Counter PROCESS_INSTANCE_FAILOVER_COUNTER =
Counter.builder("dolphinscheduler_process_instance_failover_count")
Counter.builder("ds.workflow.instance.failover.count")
.description("Process instance failover total count")
.register(Metrics.globalRegistry);

public static synchronized void registerProcessInstanceRunningGauge(Supplier<Number> function) {
Gauge.builder("dolphinscheduler_process_instance_running_gauge", function)
Gauge.builder("ds.workflow.instance.running", function)
.description("The current running process instance count")
.register(Metrics.globalRegistry);
}
Expand Down
Loading

0 comments on commit a45b9e6

Please sign in to comment.