Skip to content

Commit

Permalink
Fixes #1182: Refactor Prometheus metrics reporting (ISSUE-1182 part 2)
Browse files Browse the repository at this point in the history
 o Implement dynamic registration of each alloc_pool metric
 o Refactor http-libwebsockets.c /metrics implementation
 o Update unit tests
  • Loading branch information
kgiusti committed Sep 19, 2023
1 parent 0d9a19a commit 6860923
Show file tree
Hide file tree
Showing 6 changed files with 613 additions and 265 deletions.
111 changes: 111 additions & 0 deletions docs/notes/prometheus.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License
////
= Monitoring Router Metrics Via Prometheus
The router can be configured to enable metrics scraping via
Prometheus. Metrics are provided via an HTTP service running in the
router. A snapshot of the metrics can be obtained by issuing an HTTP
GET request to the service for the */metrics* URL path.
== Configuration
Prometheus support is disabled by default. To enable metrics access an
HTTP service must be configured on the router. This is done by
specifying an *io.skupper.router.listener* entry in the router
configuration (or via in-band management). The listener entry must
provide:
* The host IP address/name
* The TCP port number
* The _http_ attribute set to True
For example, the following listener entry enables an HTTP server
listening on localhost port 22967.
listener {
port: 22976
http: True
host: localhost
saslMechanisms: ANONYMOUS
idleTimeoutSeconds: 120
authenticatePeer: no
role: normal
}
The Prometheus server must also be configured to scrape the
router. This requires adding a job in the Prometheus server's
*scrape-config* configuration for the router. An example job
configuration for the above example listener could be:
scrape_configs:
- job_name: skupper-router
metrics_path: /metrics
static_configs:
- targets:
- localhost:22976
== Metrics
The metrics provided by the router are intended for use by developers
to aid fault monitoring and debugging. Therefore the metrics content
may change between releases as features are added or removed.
=== Heap Allocation Metrics
A subset of the router metrics are concerned with the router's heap
memory utilization. The router uses a cache to manage instances of
data objects that have been allocated from the heap. This cache avoids
the overhead of allocating and freeing frequently used data objects
from the system's heap.
See alloc_pool.c for implementation details.
The cache is a pool of data objects that have been allocated from the
heap for use by the router. Each data type has its own dedicated
cache. When the router needs an instance of said data type it will
first attempt to claim an object from the cache. If the cache is
empty, the router will instead allocate a batch of data objects from
the system heap. It will reserve one data object instance from the
batch for immediate use and place the remaining into the cache. When
the router no longer needs a particular instance of a data object it
will be placed back into the cache and can be re-used at a later time.
Given this implementation, a particular instance of a data object may
be either:
* in the cache (in standby - available for use when needed)
* or currently in use by the router.
Each data type will have a set of 4 metrics associated with it:
* allocated: total number of objects that are currently allocated from the heap
* in_use: total objects currently being used by the router
* cached: total objects in the cache
* memory_use_bytes: the sum of all memory allocated from the heap for the given data type
These metrics adhere to the following relationships:
* allocated = in_use + cached
* memory_use_bytes = (sizeof(<type>) * allocated)
4 changes: 2 additions & 2 deletions include/qpid/dispatch/alloc_pool.h
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ static inline void *qd_alloc_deref_safe_ptr(const qd_alloc_safe_ptr_t *sp)
*/
void qd_alloc_desc_init(const char *name, qd_alloc_type_desc_t *desc, size_t size, const size_t *additional_size,
const qd_alloc_config_t *config);
qd_alloc_stats_t qd_alloc_desc_stats(qd_alloc_type_desc_t *desc);
qd_alloc_stats_t qd_alloc_desc_stats(const qd_alloc_type_desc_t *desc); // thread safe
// clang-format off
#define ALLOC_DEFINE_CONFIG(T,S,A,C) \
qd_alloc_type_desc_t __desc_##T __attribute__((aligned(64))); \
Expand Down Expand Up @@ -152,5 +152,5 @@ qd_alloc_stats_t qd_alloc_desc_stats(qd_alloc_type_desc_t *desc);
void qd_alloc_initialize(void);
void qd_alloc_debug_dump(const char *file);
void qd_alloc_finalize(void);

size_t qd_alloc_type_size(const qd_alloc_type_desc_t *desc); // thread safe
#endif
22 changes: 19 additions & 3 deletions src/alloc_pool.c
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include "config.h"
#include "entity.h"
#include "entity_cache.h"
#include "http.h"
#include "qd_asan_interface.h"

#include "qpid/dispatch/alloc.h"
Expand Down Expand Up @@ -500,9 +501,17 @@ void qd_alloc_initialize(void)
desc->debug = (void *) items;
#endif

// cycle the lock to flush the initialized desc before handing it off to other threads (avoids a spurious tsan
// error)

sys_mutex_lock(&desc->lock);
sys_mutex_unlock(&desc->lock);

// now add the descriptor to the management entity database
// and telemetry metrics

qd_entity_cache_add(QD_ALLOCATOR_TYPE, desc);
qd_http_add_alloc_metric(desc->type_name, desc);
}

#ifdef QD_MEMORY_DEBUG
Expand Down Expand Up @@ -545,6 +554,7 @@ void qd_alloc_finalize(void)

for (qd_alloc_type_desc_t *desc = DEQ_HEAD(desc_list); desc; desc = DEQ_NEXT(desc)) {
qd_entity_cache_remove(QD_ALLOCATOR_TYPE, desc);
qd_http_remove_alloc_metric(desc->type_name);

//
// Reclaim the items on the global free pool
Expand Down Expand Up @@ -672,15 +682,21 @@ QD_EXPORT qd_error_t qd_entity_refresh_allocator(qd_entity_t* entity, void *impl
return qd_error_code();
}

qd_alloc_stats_t qd_alloc_desc_stats(qd_alloc_type_desc_t *desc)
qd_alloc_stats_t qd_alloc_desc_stats(const qd_alloc_type_desc_t *desc)
{
sys_mutex_lock(&desc->lock);
sys_mutex_t *lock = (sys_mutex_t *) &desc->lock; // cast away const
sys_mutex_lock(lock);
qd_alloc_stats_t stats = desc->stats;
sys_mutex_unlock(&desc->lock);
sys_mutex_unlock(lock);

return stats;
}

size_t qd_alloc_type_size(const qd_alloc_type_desc_t *desc)
{
return desc->total_size;
}

void qd_alloc_debug_dump(const char *file) {
debug_dump = file ? strdup(file) : 0;
}
Expand Down
Loading

0 comments on commit 6860923

Please sign in to comment.