Reduce resource consumption when using file discovery with same file in multiple probes #634

benner · 2021-07-21T19:05:52Z

Reduce resource consumption when using file discovery with same file in multiple probes

Setup:
Multiple HTTP probes with same file_targets {file_path} (generated from consul-template) and different filter{} to for each probe to select related targets (similar approach as with run_on {}), .textpb file is about 220 MiB in size with more than 1M resource{}s, cloudprober.cfg has more than 500 probes.

Problem:
During start all probes launch file.refresh() and it kills almost any machine. Looking at code - all probes simultaneous opens same file and do .Unmarshal() in same time. This is memory and CPU heavy operation. Later on everything normalized due random delays and GC.

Description of what you want to happen and what problem will it solve for you.

I tried few different ways like putting global file mutex which serializes all processing but then it takes ages to gather targets for all probes. Doing similar randomization as in refresh loop will help too but same as with serial processing it will take ages. I also tried similar approach as described here: https://firas.dev.sy/post/go/2020/making-timed-maps-golang/ but simpler and with extended API with .LoadOrStore() as in sync.Map for reserving computation (file reading to avoid thundering herd problem), two maps - for reservation and another one for caching Unmarshal()'ed resources (before filtering). I got some promising results but don't have working/publishable code yet.

Some numbers I've measured (no concurrency):

loading file - almost instant
.Unmarshal() - ~50s
filter targets for probe - ~1s

BTW, I have everything working with all probes and targets when I put everything statically in main configuration file.

The text was updated successfully, but these errors were encountered:

manugarg · 2021-07-21T19:24:25Z

@benner If all probes are using the same targets you can consider using shared_targets.

Sorry, it's not documented, but here is what you can do:

Add the following config stanza at the config file level:

shared_targets {
  name: "my_file_targets"

  targets {
    file_targets {
      ..
    }
  }
}

and, then at the probe level:

probe {
  ...
  targets {
    shared_targets: "my_file_targets"
  }
}

This way we create only one target structure in memory .. and hence only one file refresh loop.

manugarg · 2021-07-21T19:28:20Z

Ah, but you're using different filters. Can these filters be a regex on target names?

targets {
  shared_targets: "my_file_targets"
  regex: ".."
}

benner · 2021-07-21T19:37:19Z

I saw shared_targets or regex filtering nut in my case all probes have different targets. Now I'm using label filters.

benner · 2021-07-21T19:45:39Z

Example of cfg:

resource {                                                                              
name: "<target>"                                                                  
labels {                                                                                
    key: "probe"                                                                        
    value: "<probe name>-<HTTPS|HTTP>"                                        
  }                                                                                     
}

        file_targets {                                                                  
          file_path: "/tmp/targets.textpb"               
          filter {                                                                      
            key: "labels.probe",                                                        
            value: "<probe name>-HTTP"                                  
          }                                                                             
        }

manugarg · 2021-07-21T19:46:56Z

I see. I guess rds_targets semantics would have been better for file_targets as well. RDS server could run in the same process (all transparent to the probes or configuration), and all probes could just rds_targets.

From configuration standpoint, it will look like this:

probe {
  ...
  targets{
    rds_target {                                                                  
       resource_path: "file://<file_name>"
          filter {                                                                      
             key: "labels.probe",                                                        
             value: "<probe name>-HTTP"                                  
          }                                                                             
        }
      }
   }
...
}

...

rds_server {
  provider {
    id: "file"
    file_config {
      file_path: ....
    }
  }
}

Does that make sense at all? I can try implementing it. It may take a few days to get it reviewed and submitted though.

benner · 2021-07-21T19:48:41Z

I will look into that

manugarg · 2021-07-21T20:05:19Z

@benner, fyi: I am also looking at adding the "file" provider to RDS. I'll try to create a PR soonish, so that you can test with it if you want.

manugarg · 2021-07-22T06:43:00Z

I still need to get changed reviewed internally, but here is the proposed change: #635 (branch rds_file).

With this change you can do the following:

Add the following stanza to your cloudprober.cfg:

rds_server {
  provider {
    file_config {
      file_path: "/tmp/targets.json" 
      re_eval_sec: 60
    }
  }

Then specify targets in probes like this:

probe {
  ...
  targets{
    rds_target {                                                                  
       resource_path: "file://"
       filter {                                                                      
         key: "labels.probe",                                                        
         value: "<probe name>-HTTP"                                  
       }                                                                             
    }
  }
  ...
}

If you'd like to test, you can download pre-built binaries from here: linux, all.

benner · 2021-07-23T14:15:42Z

I tried rds_target and it works as expected or I wanted to achieve. What I found missing or did not found how to change is rds.ClientConf.re_eval_sec which has default of 30s (with this freshness Cloudprober grows to ~40 GiB instead of 14 GiB in static version).

From what I see rds.ClientConf.re_eval_sec is not reachable in configuration file:

% rg ClientConf --glob '*.proto'
rds/client/proto/config.proto
18:// ClientConf represents resource discovery service (RDS) based targets.
20:message ClientConf {

targets/proto/targets.proto
22:  optional rds.ClientConf.ServerOptions rds_server_options = 1;
129:  optional rds.ClientConf.ServerOptions rds_server_options = 4;

targets/lameduck/proto/config.proto
43:  optional rds.ClientConf.ServerOptions rds_server_options = 6;

Only inner ServerOptions can be configured.

manugarg · 2021-07-23T17:32:33Z

Good point regarding not being to control client side re_eval_sec through rds_targets. We can fix it, but I am not sure if it will help a lot. Client side refreshing causes data copy between server and client, but is filtered data.

To change refresh interval on the server side (where the file is actually reloaded), you can modify your rds_server config:

rds_server {
  provider {
    file_config {
      file_path: "/tmp/targets.json" 
      re_eval_sec: 60  <-- here
    }
}

I am wondering if 14GB vs 40GB difference is only due to temporary objects in the memory or something else is going on. At the time of refresh, new objects are created, and old objects should automatically be garbage collected.

I've thought about implementing some improvements to RDS protocol to track if resources have changed and signal to the client in case refresh is not needed, but I've largely avoided that as it may make things more complex and may introduce new failure modes.

Was it 14GB with file_targets or targets directly embedded in the config file?

benner · 2021-07-23T18:33:46Z

I am wondering if 14GB vs 40GB difference is only due to temporary objects in the memory or something else is going on.

I think it is directly related. I actually did some heap dumps (pporf) during RSS growth and looked at .svg.

Attaching one when RSS was 5G (less probes). I don't have svg now when it was 40G but situation was the same but numbers was bigger.

Next week I'll redo the test with hands changed and recompiled rds.ClientConf.re_eval_sec.

Ether way, situation with rds files solution is like 5-6 times better than it was before :-)

benner · 2021-07-23T18:45:15Z

Was it 14GB with file_targets or targets directly embedded in the config file?

Yes

manugarg · 2021-07-23T21:32:54Z

Thanks for sharing the heap profile. Few observations:

Large amount of memory is being used while parsing the textproto file - 60% of the total (earlier you were using JSON right?).
13% is going in runtime.malg -- that's basically for goroutines. I think these must be HTTP probe goroutines.
10% + 3.54% is being used for running probes. Interestingly ListEndpoints is using only 1.64% of the heap.
fileLister.ListResources is using 8.4% of the heap space.

Thoughts:

Parsing textproto is obviously the most expensive operation -- it seems that it uses a lot of intermediate storage for doing the TextProto->Proto conversion. One thing that can improve this usage is if we somehow parse text-proto one resource at a time and re-use intermediate data structures. Not sure if it's possible to do it in textproto (I can think of how we can do something like this in JSON, using json.RawMessage to do lazy decoding). If we can do it, it will improve the memory usage, but may increase compute time a little bit, though hopefully not by much, because behind the scene, TextProto->Proto decoder is also parsing one message at a time.
Goroutine heap usage and probes' heap usage is largely unavoidable.
We can likely improve fileLister.ListResources's performance. Currently we are allocating an array of size as big as the total number of resources (to avoid doing too many allocs) but that may be a wrong strategy. We may start with a min size, let's say 10 or even 100, and then let append handle the growth of the slice.

I'll look at implementing 1 (at least for JSON) & 3.

benner · 2021-07-24T06:18:46Z

I was always using TextProto

benner · 2021-07-27T09:38:06Z

Tested full configuration with default 30s and with 60min

rds/client/client.go:
+	// reEvalInterval := time.Duration(client.c.GetReEvalSec()) * time.Second
+	reEvalInterval := 30 * 60 * time.Second

with 60 min re_eval_sec it works as expected. RSS grows almost the same as with static configuration.

For default 30s interval main reason for memory usage is big number of probes. In my setups it means RDS server receives around 3 requests per second. Result: all CPUs are busy (20 logical CPUs) and memory grows to ~40G.

30s ~900 probes:

60 min ~900 probes:

One drawback with longer re_eval_sec is that all probing can be delayed up to same interval for actual probing but I can live with that because whole point of that is to keep Cloudprober without restarting :-)

To avoid breaking existing users, continuing providing "file_targets" (directly), but implement them using RDS file targets provider. Main reasoning behind this RDS provider is to allow multiple probes to use the same targets file without each probe having to load the same file again and again. Ref: #634 PiperOrigin-RevId: 387196102

To avoid breaking existing users, continuing providing "file_targets" (directly), but implement them using RDS file targets provider. Main reasoning behind this RDS provider is to allow multiple probes to use the same targets file without each probe having load the same file again and again. Ref: #634 PiperOrigin-RevId: 387196102

manugarg · 2021-07-27T22:09:45Z

Interesting. Thanks again for the heap profiles @benner!

So ListResources should get better with the change that I just committed in the main branch: #640. There was inefficient resource allocation in the previous change.

Another improvement I am planning to make is to reload file "only if" it has been modified since the last load. With that you won't have to worry about re_eval_sec so much.

manugarg · 2021-07-28T03:30:14Z

Even though resources allocation efficiency mentioned in the previous update will help quite a bit, I think #641 will have the most impact for a scenario like yours.

Ref: #634 PiperOrigin-RevId: 387390152

Ref: #634. PiperOrigin-RevId: 387655415

manugarg · 2021-07-29T21:28:07Z

With all the changes so far, I think you should see substantial improvement in performance (much better than the default case earlier). You can even keep the re_eval_sec very low in RDS file provider config -- it won't reload file unless it has changed.

Next up is #641. I'm working on it.

If request contains if_modified_since field, use that field to decide whether to send all the resources in the response or not. We always returns last_modified in the response. This timestamp corresponds to when lister's cache was refreshed from the actual file. Ref: #641 & #634 PiperOrigin-RevId: 388502695

manugarg · 2021-08-03T19:05:05Z

@benner, I think we should be good here now. You should be able to use lower re_eval_sec (30s) without any performance impact. Until you update the targets file, no new large data allocation should happen.

Let me know how it goes.

benner · 2021-08-06T12:36:20Z

Thank you. Tested for few hours. Everything seems even better.

P.S. I tested on my laptop with full config but without actual HTTP requests. After vocation I will test on real system.

manugarg · 2021-08-06T17:53:26Z

Awesome. Great to hear. Thanks for checking!

benner · 2021-09-22T10:10:23Z

Tested in semi-prod. Now it has more that 2M targets. Some stats (from Cloudprober exported metrics):

manugarg · 2021-09-22T17:51:00Z

@benner Awesome, great to see.

benner added the enhancement label Jul 21, 2021

manugarg mentioned this issue Jul 27, 2021

Add an RDS provider for file-based targets. #639

Closed

manugarg mentioned this issue Jul 27, 2021

Add an RDS provider for file-based targets. #640

Merged

manugarg mentioned this issue Jul 28, 2021

Implement caching option in RDS protocol #641

Closed

manugarg added a commit that referenced this issue Jul 28, 2021

RDS file provider: Minor improvements to resource allocation.

9655a0b

Ref: #634 PiperOrigin-RevId: 387390152

manugarg mentioned this issue Jul 28, 2021

RDS file provider: Minor improvements to resource allocation. #642

Merged

manugarg added a commit that referenced this issue Jul 28, 2021

RDS file provider: Minor improvements to resource allocation. (#642)

b2710ef

Ref: #634 PiperOrigin-RevId: 387390152

manugarg mentioned this issue Jul 29, 2021

RDS file provider: Reload file only if it has changed. #643

Merged

manugarg added a commit that referenced this issue Jul 29, 2021

RDS file provider: Reload file only if it has changed. (#643)

cdccef4

Ref: #634. PiperOrigin-RevId: 387655415

manugarg added this to the v0.11.3 milestone Jul 30, 2021

manugarg mentioned this issue Aug 3, 2021

Add RDS-cache support to file provider. #649

Merged

manugarg closed this as completed Aug 3, 2021

manugarg reopened this Aug 4, 2021

manugarg self-assigned this Aug 5, 2021

benner closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce resource consumption when using file discovery with same file in multiple probes #634

Reduce resource consumption when using file discovery with same file in multiple probes #634

benner commented Jul 21, 2021

manugarg commented Jul 21, 2021 •

edited

Loading

manugarg commented Jul 21, 2021

benner commented Jul 21, 2021

benner commented Jul 21, 2021

manugarg commented Jul 21, 2021 •

edited

Loading

benner commented Jul 21, 2021

manugarg commented Jul 21, 2021

manugarg commented Jul 22, 2021

benner commented Jul 23, 2021

manugarg commented Jul 23, 2021

benner commented Jul 23, 2021

benner commented Jul 23, 2021

manugarg commented Jul 23, 2021 •

edited

Loading

benner commented Jul 24, 2021

benner commented Jul 27, 2021

manugarg commented Jul 27, 2021

manugarg commented Jul 28, 2021 •

edited

Loading

manugarg commented Jul 29, 2021

manugarg commented Aug 3, 2021

benner commented Aug 6, 2021

manugarg commented Aug 6, 2021

benner commented Sep 22, 2021

manugarg commented Sep 22, 2021

Reduce resource consumption when using file discovery with same file in multiple probes #634

Reduce resource consumption when using file discovery with same file in multiple probes #634

Comments

benner commented Jul 21, 2021

manugarg commented Jul 21, 2021 • edited Loading

manugarg commented Jul 21, 2021

benner commented Jul 21, 2021

benner commented Jul 21, 2021

manugarg commented Jul 21, 2021 • edited Loading

benner commented Jul 21, 2021

manugarg commented Jul 21, 2021

manugarg commented Jul 22, 2021

benner commented Jul 23, 2021

manugarg commented Jul 23, 2021

benner commented Jul 23, 2021

benner commented Jul 23, 2021

manugarg commented Jul 23, 2021 • edited Loading

benner commented Jul 24, 2021

benner commented Jul 27, 2021

manugarg commented Jul 27, 2021

manugarg commented Jul 28, 2021 • edited Loading

manugarg commented Jul 29, 2021

manugarg commented Aug 3, 2021

benner commented Aug 6, 2021

manugarg commented Aug 6, 2021

benner commented Sep 22, 2021

manugarg commented Sep 22, 2021

manugarg commented Jul 21, 2021 •

edited

Loading

manugarg commented Jul 21, 2021 •

edited

Loading

manugarg commented Jul 23, 2021 •

edited

Loading

manugarg commented Jul 28, 2021 •

edited

Loading