in multi-worker, call reset will cause other worker's lookup different with key_index #168

Ashion · 2024-03-08T11:01:29Z

as title.

Question

Step:

register gauge mertic "A" in init_worker
call metric "A".set(1, {"v1", "v2"}) to set metric value
call metric "A".reset in one of worker likes "1910"
in worker different with "1910", likes "1911", set call metric "A".set(2, {"v1", "v2"}) to set metric value and call metric_data

Expected Result:

metric_data return metric "A"

Actual Result:

nothing got.

Reason

reset only reset current worker's lookup, but reset key_index in share_memory
other worker set metric, it can get the metric full_name from lookup, so it doesn't set the key_index, and metric_data get the key list from key_index, but after reset it is empty
so the result is metric_data in other worker gets nothing

Solution

mayb it can add full_name to key_index in lookup_or_create(self, label_values) after line 414. likes below

  if full_name then
    local err = self._key_index:add(full_name, ERR_MSG_LRU_EVICTION)
    if err then
      return nil, err
    end
    return full_name
  end

knyar · 2024-03-11T08:33:38Z

Thank you for debugging this issue, and for proposing a fix. This is a good example of how various performance optimizations added to this library over time (e.g. the key index) make cross-worker coordination tricky, especially when metrics are deleted.

I believe your proposed fix would work, however adding a key to the key index on the hot path will result in KeyIndex:sync being called every time any metric is used (set or incremented). Syncing the key index involves two shared dict lookups even if there is nothing to be synced – I suspect for many users of this library this will not be acceptable from performance point of view.

There might be a few other ways we could approach this:

Invalidate each worker's key lookup table every time a metric is deleted.
Make del and reset irreversible: once deleted, the same metric can no longer be used until the process is restarted.
Make usage of key index optional, and mutually exclusive with metrics deletion.

I'd love to hear feedback on these options from users of this library, as well as other ideas we could explore here.

Option 1 seems pretty appealing. While it will result in temporary performance degradation every time a key is deleted, it would not change the stable-state performance or restrict available functionality. Perhaps a simple approach would be to add a timer to each worker that will reset the key lookup table every time the delete_count counter of the key index is incremented. This will still be racy with setting metric values: if another worker sets a value for a key that has just been deleted, it could still use the key lookup table. But I suspect in most cases previously-deleted metrics that "reappear" would be used more than once, and will eventually be re-added to the key index after the lookup tables have been flushed.

It would also be helpful to add a test for this. Not sure if there's a way to do this purely in Lua, or if we'll need to expand the integration test that runs a real Nginx instance in a container.

Ashion · 2024-03-13T09:57:52Z

I'm sorry that I only consider how to fix the problem and not realize that there is KeyIndex:sync while adding or removing.
The problem is how to make local lookup, local key_index keep consistent as the key_index in the shared memory.

In Option 1, as you memtioned, seems pretty appealing, but it needs include likes timer or event or other component, which may increase the complexity of the module, and it also have problem in some extreme condition.

the purpose of key_index is to avoid using ngx.shared.dict:get_keys. which blocks all workers and therefore it shouldn't be used with large amounts of keys.

maybe add a function to check the metric is existed in shared_memory is a good way, but I don't sure.

Thank you for debugging this issue, and for proposing a fix. This is a good example of how various performance optimizations added to this library over time (e.g. the key index) make cross-worker coordination tricky, especially when metrics are deleted.

I believe your proposed fix would work, however adding a key to the key index on the hot path will result in KeyIndex:sync being called every time any metric is used (set or incremented). Syncing the key index involves two shared dict lookups even if there is nothing to be synced – I suspect for many users of this library this will not be acceptable from performance point of view.

There might be a few other ways we could approach this:

Invalidate each worker's key lookup table every time a metric is deleted.

Make del and reset irreversible: once deleted, the same metric can no longer be used until the process is restarted.

Make usage of key index optional, and mutually exclusive with metrics deletion.

I'd love to hear feedback on these options from users of this library, as well as other ideas we could explore here.

Option 1 seems pretty appealing. While it will result in temporary performance degradation every time a key is deleted, it would not change the stable-state performance or restrict available functionality. Perhaps a simple approach would be to add a timer to each worker that will reset the key lookup table every time the delete_count counter of the key index is incremented. This will still be racy with setting metric values: if another worker sets a value for a key that has just been deleted, it could still use the key lookup table. But I suspect in most cases previously-deleted metrics that "reappear" would be used more than once, and will eventually be re-added to the key index after the lookup tables have been flushed.

It would also be helpful to add a test for this. Not sure if there's a way to do this purely in Lua, or if we'll need to expand the integration test that runs a real Nginx instance in a container.

knyar · 2024-03-14T09:27:00Z

I've expanded existing integration test to support running multiple tests in parallel, and added a metric reset test that can be used to reproduce this reliably: https://github.com/knyar/nginx-lua-prometheus/blob/test_reset/integration/test_reset.go

Ashion · 2024-04-11T05:44:46Z

also I found that reset will cause performance issues by the worker number and metric increase:
in gauge usage, we will reset before collect, likes https://github.com/Kong/kong/blob/master/kong/plugins/prometheus/exporter.lua#L363
it will cause performance problem, for example 1000 metrics:
1. worker #1 process collect and it will reset old 1000 metrics in shared memory, and add new 1000 metrics
2. now worker #2 process collect and it will sync worker #1 deleted and added 2000 metrics in shared memory, then delete old 1000 metrics and add new 1000 metrics.
3. worker #3 process collect, it will sync worker #1 and worker #2 deleted and added 4000 metrics in shared memory, then delete old 1000 metrics and add new 1000 metrics.
4. when worker #n process collect, it will sync worker #1 to worker #n-1 deleted and added n*2000 metrics in shared memory, then delete old 1000 metrics and add new 1000 metrics.

Ashion · 2024-04-11T05:51:05Z

maybe we can provide reset local likes kong: https://github.com/Kong/kong/blob/master/kong/plugins/prometheus/prometheus.lua#L695

knyar · 2024-04-11T07:31:56Z

in gauge usage, we will reset before collect, likes https://github.com/Kong/kong/blob/master/kong/plugins/prometheus/exporter.lua#L363

Deleting time series (or resetting a metric completely) is quite expensive, because a bunch of worker-local state needs to be flushed or re-synced. Doing it on every collection is wild.

If you are exporting gauges that are only set at collection time, you probably don't need a metric library at all, you can just ngx.print the metrics in Prometheus format right after calling prometheus:collect(). Kong's LOCAL_STORAGE achieves something similar by using a worker-local table to store values, but I am not quite sure why keeping values around is even useful given that the next metric collection request can hit another worker.

A better solution here would be to implement support for callback metrics - I've created #170 to track that separately. Let's keep this issue focused on the specific bug with metric resets.

knyar mentioned this issue Apr 11, 2024

Add support for callback metrics #170

Open

knyar mentioned this issue Apr 17, 2024

Reset metric name lookup table when another worker deletes a metric #171

Merged

knyar closed this as completed in #171 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in multi-worker, call reset will cause other worker's lookup different with key_index #168

in multi-worker, call reset will cause other worker's lookup different with key_index #168

Ashion commented Mar 8, 2024 •

edited

Loading

knyar commented Mar 11, 2024

Ashion commented Mar 13, 2024

knyar commented Mar 14, 2024

Ashion commented Apr 11, 2024 •

edited

Loading

Ashion commented Apr 11, 2024

knyar commented Apr 11, 2024 •

edited

Loading

in multi-worker, call reset will cause other worker's lookup different with key_index #168

in multi-worker, call reset will cause other worker's lookup different with key_index #168

Comments

Ashion commented Mar 8, 2024 • edited Loading

Question

Step:

Expected Result:

Actual Result:

Reason

Solution

knyar commented Mar 11, 2024

Ashion commented Mar 13, 2024

knyar commented Mar 14, 2024

Ashion commented Apr 11, 2024 • edited Loading

Ashion commented Apr 11, 2024

knyar commented Apr 11, 2024 • edited Loading

Ashion commented Mar 8, 2024 •

edited

Loading

Ashion commented Apr 11, 2024 •

edited

Loading

knyar commented Apr 11, 2024 •

edited

Loading