Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1007] Improve JVM metrics naming and add ThreadStates metrics #1939

Closed
wants to merge 3 commits into from

Conversation

onebox-li
Copy link
Contributor

@onebox-li onebox-li commented Sep 27, 2023

What changes were proposed in this pull request?

Since we use codahale metrics to expose JVM metrics, the name without prefix is not clear and it‘s not easy to make a grafana template for these metrics because it adds collector name or pool name in names rather than labels.

So here I add jvm metric prefixes, remove pool info from name and obtain the pool name as labels if needed.
And add ThreadStates metrics additionally.

Why are the changes needed?

Make jvm metrics easy to understand and get template

Does this PR introduce any user-facing change?

Yes,jvm metrics naming is changed,expose threads state additionally.

change examples like below:
For GarbageCollectorMetricSet, G1-Old-Generation.time -> jvm.gc.time{name="G1-Old-Generation"}
For MemoryUsageGaugeSet, total.init -> jvm.memory.total.init ; pools.Metaspace.usage -> jvm.memory.pools.usage{name="Metaspace"}
For BufferPoolMetricSet, direct.count -> jvm.direct.count
For ThreadStatesGaugeSet, add jvm.thread.count.

For G1, the jvm metrics exposed now:
metrics_jvm_gc_time_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588
metrics_jvm_gc_count_Value{name="G1-Young-Generation",role="Worker"} 2 1695731141588
metrics_jvm_gc_time_Value{name="G1-Young-Generation",role="Worker"} 74 1695731141588
metrics_jvm_gc_count_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588

metrics_jvm_heap_committed_Value{role="Worker"} 2109734912 1695731141588
metrics_jvm_non_heap_used_Value{role="Worker"} 47700056 1695731141588
metrics_jvm_heap_used_Value{role="Worker"} 82801184 1695731141588
metrics_jvm_total_committed_Value{role="Worker"} 2160263168 1695731141588
metrics_jvm_total_init_Value{role="Worker"} 2112290816 1695731141588
metrics_jvm_non_heap_max_Value{role="Worker"} -1 1695731141588
metrics_jvm_heap_usage_Value{role="Worker"} 0.009639326483011246 1695731141588
metrics_jvm_total_used_Value{role="Worker"} 130502480 1695731141589
metrics_jvm_heap_init_Value{role="Worker"} 2109734912 1695731141589
metrics_jvm_non_heap_committed_Value{role="Worker"} 50528256 1695731141589
metrics_jvm_non_heap_init_Value{role="Worker"} 2555904 1695731141589
metrics_jvm_non_heap_usage_Value{role="Worker"} -4.7701296E7 1695731141589
metrics_jvm_heap_max_Value{role="Worker"} 8589934592 1695731141589
metrics_jvm_total_max_Value{role="Worker"} 8589934591 1695731141589
metrics_jvm_memory_pool_used_Value{name="Code-Cache",role="Worker"} 10314368 1695731141588
metrics_jvm_memory_pool_committed_Value{name="Code-Cache",role="Worker"} 10944512 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Eden-Space",role="Worker"} 111149056 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Old-Gen",role="Worker"} 8589934592 1695731141588
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_used_Value{name="Compressed-Class-Space",role="Worker"} 4440192 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Metaspace",role="Worker"} 0.9449504192610433 1695731141588
metrics_jvm_memory_pool_max_Value{name="Metaspace",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Survivor-Space",role="Worker"} 0 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Eden-Space",role="Worker"} 96468992 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Survivor-Space",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Compressed-Class-Space",role="Worker"} 0.004135251045227051 1695731141588
metrics_jvm_memory_pool_usage_Value{name="G1-Survivor-Space",role="Worker"} 1.0 1695731141588
metrics_jvm_memory_pool_max_Value{name="Code-Cache",role="Worker"} 251658240 1695731141588
metrics_jvm_memory_pool_init_Value{name="Compressed-Class-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Eden-Space",role="Worker"} 0.34782608695652173 1695731141589
metrics_jvm_memory_pool_init_Value{name="Metaspace",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_max_Value{name="G1-Eden-Space",role="Worker"} -1 1695731141589
metrics_jvm_memory_pool_usage_Value{name="Code-Cache",role="Worker"} 0.04098917643229167 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Eden-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_init_Value{name="Code-Cache",role="Worker"} 2555904 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Compressed-Class-Space",role="Worker"} 4718592 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Eden-Space",role="Worker"} 33554432 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Old-Gen",role="Worker"} 34566688 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Old-Gen",role="Worker"} 0.004024092108011246 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Old-Gen",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Metaspace",role="Worker"} 34865152 1695731141589
metrics_jvm_memory_pool_init_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141589
metrics_jvm_memory_pool_used_Value{name="Metaspace",role="Worker"} 32945840 1695731141589
metrics_jvm_memory_pool_max_Value{name="Compressed-Class-Space",role="Worker"} 1073741824 1695731141589

metrics_jvm_direct_count_Value{role="Worker"} 8 1695731141589
metrics_jvm_direct_capacity_Value{role="Worker"} 1036 1695731141589
metrics_jvm_direct_used_Value{role="Worker"} 1037 1695731141589
metrics_jvm_mapped_used_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_capacity_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_count_Value{role="Worker"} 0 1695731141589

metrics_jvm_thread_timed_waiting_count_Value{role="Worker"} 23 1695731141589
metrics_jvm_thread_deadlock_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_count_Value{role="Worker"} 78 1695731141589
metrics_jvm_thread_waiting_count_Value{role="Worker"} 45 1695731141589
metrics_jvm_thread_daemon_count_Value{role="Worker"} 75 1695731141589
metrics_jvm_thread_new_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_blocked_count_Value{role="Worker"} 0 1695731141590
metrics_jvm_thread_deadlocks_Value{role="Worker"} [] 1695731141590
metrics_jvm_thread_runnable_count_Value{role="Worker"} 10 1695731141590
metrics_jvm_thread_terminated_count_Value{role="Worker"} 0 1695731141590

How was this patch tested?

UT and cluster test with g1, PS-Scavenge/PS-MarkSweep and ParNew/CMS

@onebox-li
Copy link
Contributor Author

If this pr is approved, I will add a simple grafana template for CELEBORN-688 according to this change later.

@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Merging #1939 (ab044ae) into main (aa9dfd0) will increase coverage by 0.01%.
Report is 1 commits behind head on main.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #1939      +/-   ##
==========================================
+ Coverage   46.60%   46.60%   +0.01%     
==========================================
  Files         164      164              
  Lines       10293    10325      +32     
  Branches      938      943       +5     
==========================================
+ Hits         4796     4811      +15     
- Misses       5184     5203      +19     
+ Partials      313      311       -2     
Files Coverage Δ
...che/celeborn/common/metrics/source/JVMSource.scala 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the refine! cc @AngersZhuuuu you may be interested in this PR because the metric names are changed, but I think it's more reasonable. Merging to main/0.3

waitinfuture pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?
Since we use codahale metrics to expose JVM metrics, the name without prefix is not clear and it‘s not easy to make a grafana template for these metrics because it adds collector name or pool name in names rather than labels.

So here I add jvm metric prefixes, remove pool info from name and obtain the pool name as labels if needed.
And add ThreadStates metrics additionally.

### Why are the changes needed?
Make jvm metrics easy to understand and get template

### Does this PR introduce _any_ user-facing change?
Yes,jvm metrics naming is changed,expose threads state additionally.

change examples like below:
For GarbageCollectorMetricSet, G1-Old-Generation.time -> jvm.gc.time{name="G1-Old-Generation"}
For MemoryUsageGaugeSet, total.init -> jvm.memory.total.init ; pools.Metaspace.usage -> jvm.memory.pools.usage{name="Metaspace"}
For BufferPoolMetricSet, direct.count -> jvm.direct.count
For ThreadStatesGaugeSet, add jvm.thread.count.

For G1, the jvm metrics exposed now:
metrics_jvm_gc_time_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588
metrics_jvm_gc_count_Value{name="G1-Young-Generation",role="Worker"} 2 1695731141588
metrics_jvm_gc_time_Value{name="G1-Young-Generation",role="Worker"} 74 1695731141588
metrics_jvm_gc_count_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588

metrics_jvm_heap_committed_Value{role="Worker"} 2109734912 1695731141588
metrics_jvm_non_heap_used_Value{role="Worker"} 47700056 1695731141588
metrics_jvm_heap_used_Value{role="Worker"} 82801184 1695731141588
metrics_jvm_total_committed_Value{role="Worker"} 2160263168 1695731141588
metrics_jvm_total_init_Value{role="Worker"} 2112290816 1695731141588
metrics_jvm_non_heap_max_Value{role="Worker"} -1 1695731141588
metrics_jvm_heap_usage_Value{role="Worker"} 0.009639326483011246 1695731141588
metrics_jvm_total_used_Value{role="Worker"} 130502480 1695731141589
metrics_jvm_heap_init_Value{role="Worker"} 2109734912 1695731141589
metrics_jvm_non_heap_committed_Value{role="Worker"} 50528256 1695731141589
metrics_jvm_non_heap_init_Value{role="Worker"} 2555904 1695731141589
metrics_jvm_non_heap_usage_Value{role="Worker"} -4.7701296E7 1695731141589
metrics_jvm_heap_max_Value{role="Worker"} 8589934592 1695731141589
metrics_jvm_total_max_Value{role="Worker"} 8589934591 1695731141589
metrics_jvm_memory_pool_used_Value{name="Code-Cache",role="Worker"} 10314368 1695731141588
metrics_jvm_memory_pool_committed_Value{name="Code-Cache",role="Worker"} 10944512 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Eden-Space",role="Worker"} 111149056 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Old-Gen",role="Worker"} 8589934592 1695731141588
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_used_Value{name="Compressed-Class-Space",role="Worker"} 4440192 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Metaspace",role="Worker"} 0.9449504192610433 1695731141588
metrics_jvm_memory_pool_max_Value{name="Metaspace",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Survivor-Space",role="Worker"} 0 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Eden-Space",role="Worker"} 96468992 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Survivor-Space",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Compressed-Class-Space",role="Worker"} 0.004135251045227051 1695731141588
metrics_jvm_memory_pool_usage_Value{name="G1-Survivor-Space",role="Worker"} 1.0 1695731141588
metrics_jvm_memory_pool_max_Value{name="Code-Cache",role="Worker"} 251658240 1695731141588
metrics_jvm_memory_pool_init_Value{name="Compressed-Class-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Eden-Space",role="Worker"} 0.34782608695652173 1695731141589
metrics_jvm_memory_pool_init_Value{name="Metaspace",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_max_Value{name="G1-Eden-Space",role="Worker"} -1 1695731141589
metrics_jvm_memory_pool_usage_Value{name="Code-Cache",role="Worker"} 0.04098917643229167 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Eden-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_init_Value{name="Code-Cache",role="Worker"} 2555904 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Compressed-Class-Space",role="Worker"} 4718592 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Eden-Space",role="Worker"} 33554432 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Old-Gen",role="Worker"} 34566688 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Old-Gen",role="Worker"} 0.004024092108011246 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Old-Gen",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Metaspace",role="Worker"} 34865152 1695731141589
metrics_jvm_memory_pool_init_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141589
metrics_jvm_memory_pool_used_Value{name="Metaspace",role="Worker"} 32945840 1695731141589
metrics_jvm_memory_pool_max_Value{name="Compressed-Class-Space",role="Worker"} 1073741824 1695731141589

metrics_jvm_direct_count_Value{role="Worker"} 8 1695731141589
metrics_jvm_direct_capacity_Value{role="Worker"} 1036 1695731141589
metrics_jvm_direct_used_Value{role="Worker"} 1037 1695731141589
metrics_jvm_mapped_used_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_capacity_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_count_Value{role="Worker"} 0 1695731141589

metrics_jvm_thread_timed_waiting_count_Value{role="Worker"} 23 1695731141589
metrics_jvm_thread_deadlock_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_count_Value{role="Worker"} 78 1695731141589
metrics_jvm_thread_waiting_count_Value{role="Worker"} 45 1695731141589
metrics_jvm_thread_daemon_count_Value{role="Worker"} 75 1695731141589
metrics_jvm_thread_new_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_blocked_count_Value{role="Worker"} 0 1695731141590
metrics_jvm_thread_deadlocks_Value{role="Worker"} [] 1695731141590
metrics_jvm_thread_runnable_count_Value{role="Worker"} 10 1695731141590
metrics_jvm_thread_terminated_count_Value{role="Worker"} 0 1695731141590

### How was this patch tested?
UT and cluster test with g1, PS-Scavenge/PS-MarkSweep and ParNew/CMS

Closes #1939 from onebox-li/improve-jvm-metrics.

Authored-by: onebox-li <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit b4dfc09)
Signed-off-by: zky.zhoukeyong <[email protected]>
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later LGTM

@onebox-li onebox-li deleted the improve-jvm-metrics branch September 28, 2023 02:39
pan3793 pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?
Since we use codahale metrics to expose JVM metrics, the name without prefix is not clear and it‘s not easy to make a grafana template for these metrics because it adds collector name or pool name in names rather than labels.

So here I add jvm metric prefixes, remove pool info from name and obtain the pool name as labels if needed.
And add ThreadStates metrics additionally.

### Why are the changes needed?
Make jvm metrics easy to understand and get template

### Does this PR introduce _any_ user-facing change?
Yes,jvm metrics naming is changed,expose threads state additionally.

change examples like below:
For GarbageCollectorMetricSet, G1-Old-Generation.time -> jvm.gc.time{name="G1-Old-Generation"}
For MemoryUsageGaugeSet, total.init -> jvm.memory.total.init ; pools.Metaspace.usage -> jvm.memory.pools.usage{name="Metaspace"}
For BufferPoolMetricSet, direct.count -> jvm.direct.count
For ThreadStatesGaugeSet, add jvm.thread.count.

For G1, the jvm metrics exposed now:
metrics_jvm_gc_time_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588
metrics_jvm_gc_count_Value{name="G1-Young-Generation",role="Worker"} 2 1695731141588
metrics_jvm_gc_time_Value{name="G1-Young-Generation",role="Worker"} 74 1695731141588
metrics_jvm_gc_count_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588

metrics_jvm_heap_committed_Value{role="Worker"} 2109734912 1695731141588
metrics_jvm_non_heap_used_Value{role="Worker"} 47700056 1695731141588
metrics_jvm_heap_used_Value{role="Worker"} 82801184 1695731141588
metrics_jvm_total_committed_Value{role="Worker"} 2160263168 1695731141588
metrics_jvm_total_init_Value{role="Worker"} 2112290816 1695731141588
metrics_jvm_non_heap_max_Value{role="Worker"} -1 1695731141588
metrics_jvm_heap_usage_Value{role="Worker"} 0.009639326483011246 1695731141588
metrics_jvm_total_used_Value{role="Worker"} 130502480 1695731141589
metrics_jvm_heap_init_Value{role="Worker"} 2109734912 1695731141589
metrics_jvm_non_heap_committed_Value{role="Worker"} 50528256 1695731141589
metrics_jvm_non_heap_init_Value{role="Worker"} 2555904 1695731141589
metrics_jvm_non_heap_usage_Value{role="Worker"} -4.7701296E7 1695731141589
metrics_jvm_heap_max_Value{role="Worker"} 8589934592 1695731141589
metrics_jvm_total_max_Value{role="Worker"} 8589934591 1695731141589
metrics_jvm_memory_pool_used_Value{name="Code-Cache",role="Worker"} 10314368 1695731141588
metrics_jvm_memory_pool_committed_Value{name="Code-Cache",role="Worker"} 10944512 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Eden-Space",role="Worker"} 111149056 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Old-Gen",role="Worker"} 8589934592 1695731141588
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_used_Value{name="Compressed-Class-Space",role="Worker"} 4440192 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Metaspace",role="Worker"} 0.9449504192610433 1695731141588
metrics_jvm_memory_pool_max_Value{name="Metaspace",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Survivor-Space",role="Worker"} 0 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Eden-Space",role="Worker"} 96468992 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Survivor-Space",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Compressed-Class-Space",role="Worker"} 0.004135251045227051 1695731141588
metrics_jvm_memory_pool_usage_Value{name="G1-Survivor-Space",role="Worker"} 1.0 1695731141588
metrics_jvm_memory_pool_max_Value{name="Code-Cache",role="Worker"} 251658240 1695731141588
metrics_jvm_memory_pool_init_Value{name="Compressed-Class-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Eden-Space",role="Worker"} 0.34782608695652173 1695731141589
metrics_jvm_memory_pool_init_Value{name="Metaspace",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_max_Value{name="G1-Eden-Space",role="Worker"} -1 1695731141589
metrics_jvm_memory_pool_usage_Value{name="Code-Cache",role="Worker"} 0.04098917643229167 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Eden-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_init_Value{name="Code-Cache",role="Worker"} 2555904 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Compressed-Class-Space",role="Worker"} 4718592 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Eden-Space",role="Worker"} 33554432 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Old-Gen",role="Worker"} 34566688 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Old-Gen",role="Worker"} 0.004024092108011246 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Old-Gen",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Metaspace",role="Worker"} 34865152 1695731141589
metrics_jvm_memory_pool_init_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141589
metrics_jvm_memory_pool_used_Value{name="Metaspace",role="Worker"} 32945840 1695731141589
metrics_jvm_memory_pool_max_Value{name="Compressed-Class-Space",role="Worker"} 1073741824 1695731141589

metrics_jvm_direct_count_Value{role="Worker"} 8 1695731141589
metrics_jvm_direct_capacity_Value{role="Worker"} 1036 1695731141589
metrics_jvm_direct_used_Value{role="Worker"} 1037 1695731141589
metrics_jvm_mapped_used_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_capacity_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_count_Value{role="Worker"} 0 1695731141589

metrics_jvm_thread_timed_waiting_count_Value{role="Worker"} 23 1695731141589
metrics_jvm_thread_deadlock_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_count_Value{role="Worker"} 78 1695731141589
metrics_jvm_thread_waiting_count_Value{role="Worker"} 45 1695731141589
metrics_jvm_thread_daemon_count_Value{role="Worker"} 75 1695731141589
metrics_jvm_thread_new_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_blocked_count_Value{role="Worker"} 0 1695731141590
metrics_jvm_thread_deadlocks_Value{role="Worker"} [] 1695731141590
metrics_jvm_thread_runnable_count_Value{role="Worker"} 10 1695731141590
metrics_jvm_thread_terminated_count_Value{role="Worker"} 0 1695731141590

### How was this patch tested?
UT and cluster test with g1, PS-Scavenge/PS-MarkSweep and ParNew/CMS

Closes #1939 from onebox-li/improve-jvm-metrics.

Authored-by: onebox-li <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit b4dfc09)
Signed-off-by: zky.zhoukeyong <[email protected]>
@pan3793
Copy link
Member

pan3793 commented Sep 28, 2023

@waitinfuture it's kind of breaking change in branch-0.3, should migration guide be provided?

pan3793 added a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

Mention metrics name change in Migration Guide

### Why are the changes needed?

#1939

### Does this PR introduce _any_ user-facing change?

Yes, docs updated.

### How was this patch tested?

Review.

Closes #1950 from pan3793/CELEBORN-1007-followup.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
pan3793 added a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

Mention metrics name change in Migration Guide

### Why are the changes needed?

#1939

### Does this PR introduce _any_ user-facing change?

Yes, docs updated.

### How was this patch tested?

Review.

Closes #1950 from pan3793/CELEBORN-1007-followup.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit 84ef527)
Signed-off-by: Cheng Pan <[email protected]>
@FMX
Copy link
Contributor

FMX commented Oct 12, 2023

@onebox-li Hi, there is an issue after this PR. Prometheus will stop receiving metrics output and report an error.

ts=2023-10-11T10:30:57.088Z caller=scrape.go:1399 level=debug component="scrape manager" scrape_pool=RSS target=http://core-1-1:9096/metrics/prometheus msg="Append failed" err="strconv.ParseFloat: parsing \"[]\": invalid syntax while parsing: \"metrics_jvm_thread_deadlocks_Value{role=\\\"Worker\\\"} []\""

waitinfuture pushed a commit that referenced this pull request Oct 13, 2023
### What changes were proposed in this pull request?
Currently there is no JVM metrics grafana template, nor in grafana labs. For better use, it is necessary to add one.
According the change in #1939
This template uses two variables(instance, pool).
The layout is divided into 5 rows.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/732cff90-463c-47b5-89b8-fa8dbbf33b1e)

The panels with g1 look like below:
![image](https://github.com/apache/incubator-celeborn/assets/19429353/919b7e9e-f86a-4341-a004-7f0394e1d8b2)

JVM Memory Pools row uses replicated panel mode which panels are automatically deplicated by `pool` variables.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/3bdf7a3c-d4e0-42ea-bbe0-012da55a61d1)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/8feaf9b7-156d-453e-8188-40a0399ea516)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/cba4b61c-7d66-4893-9f07-6157c64869bd)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/09b473ef-434c-4fd0-aa4b-084f7588a4f7)

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, this dashboard is based on changes in #1939

### How was this patch tested?
Cluster test

Closes #1964 from onebox-li/add-jvm-dashboard.

Authored-by: onebox-li <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
waitinfuture pushed a commit that referenced this pull request Oct 13, 2023
### What changes were proposed in this pull request?
Currently there is no JVM metrics grafana template, nor in grafana labs. For better use, it is necessary to add one.
According the change in #1939
This template uses two variables(instance, pool).
The layout is divided into 5 rows.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/732cff90-463c-47b5-89b8-fa8dbbf33b1e)

The panels with g1 look like below:
![image](https://github.com/apache/incubator-celeborn/assets/19429353/919b7e9e-f86a-4341-a004-7f0394e1d8b2)

JVM Memory Pools row uses replicated panel mode which panels are automatically deplicated by `pool` variables.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/3bdf7a3c-d4e0-42ea-bbe0-012da55a61d1)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/8feaf9b7-156d-453e-8188-40a0399ea516)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/cba4b61c-7d66-4893-9f07-6157c64869bd)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/09b473ef-434c-4fd0-aa4b-084f7588a4f7)

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, this dashboard is based on changes in #1939

### How was this patch tested?
Cluster test

Closes #1964 from onebox-li/add-jvm-dashboard.

Authored-by: onebox-li <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit 2b79692)
Signed-off-by: zky.zhoukeyong <[email protected]>
gotikkoxq added a commit to gotikkoxq/celeborn that referenced this pull request Aug 26, 2024
### What changes were proposed in this pull request?

Mention metrics name change in Migration Guide

### Why are the changes needed?

apache/celeborn#1939

### Does this PR introduce _any_ user-facing change?

Yes, docs updated.

### How was this patch tested?

Review.

Closes #1950 from pan3793/CELEBORN-1007-followup.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants