ResourceExhausted error when creating many HTTPPipelines and HTTPServer paths #413

samutamm · 2021-12-16T04:57:16Z

Describe the bug
Easegress server throws 2021-12-16T03:40:40.8Z ERROR statussynccontroller/statussynccontroller.go:217 sync status failed: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2274303 vs. 2097152) when creating 1000 dummy HTTPPipelines and one HTTPServer with one backend rule for each pipeline.

To Reproduce
Steps to reproduce the behavior:

Generate 1000 identical HTTPPipeline configurations, with unique names name: pipeline-$i. Run the second script provided below: bash generate_pipeline.sh > pipelines.yaml
Generate one HTTPServer configuration with 1000 rules using the first script provided below: bash generate_server.sh > httpserver.yaml
Start Easegress server: bin/easegress-server
Create pipelines bin/easegress-server object create -f pipelines.yaml
Create httpserver bin/easegress-server object create -f httpserver.yaml
Easegress server will fail in the middle of the object creation.

Expected behavior
Easegress-server should not fail when creating (many) objects.

Version
1.4.0

Configuration

Easegress Configuration
Default parameters.
HTTP server configuration
The following bash script generates the server HTTPServer:

#!/bin/bash
echo "
kind: HTTPServer
name: server-demo
port: 10080
keepAlive: true
https: false
maxConnections: 10240
rules:
  - paths:
"
for i in {0..1000..1}
   do
      echo "    - pathPrefix: /pipeline$i
      backend: pipeline-$i"
done

Pipeline Configuration
The following bash script generates the pipeline:

#!/bin/bash
for i in {0..1000..1}
   do
      echo "name: pipeline-$i
kind: HTTPPipeline
flow:
  - filter: proxy
filters:
  - kind: Proxy
    name: proxy
    mainPool:
      loadBalance:
        policy: roundRobin
      servers:
      - url: http://172.20.2.14:9095
      - url: http://172.20.2.160:9095
---"
done

Logs
This is the output of easegress-server when the error happens:

2021-12-16T04:45:22.857Z	INFO	trafficcontroller/trafficcontroller.go:424	create http pipeline default/pipeline-977
2021-12-16T04:45:22.858Z	INFO	trafficcontroller/trafficcontroller.go:424	create http pipeline default/pipeline-989
2021-12-16T04:45:22.858Z	INFO	trafficcontroller/trafficcontroller.go:424	create http pipeline default/pipeline-980
2021-12-16T04:45:22.858Z	INFO	trafficcontroller/trafficcontroller.go:424	create http pipeline default/pipeline-966
2021-12-16T04:45:22.859Z	INFO	trafficcontroller/trafficcontroller.go:424	create http pipeline default/pipeline-974
2021-12-16T04:46:05.821Z	ERROR	statussynccontroller/statussynccontroller.go:217	sync status failed: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2274327 vs. 2097152)
2021-12-16T04:46:10.775Z	ERROR	statussynccontroller/statussynccontroller.go:217	sync status failed: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2274327 vs. 2097152)
2021-12-16T04:46:15.795Z	ERROR	statussynccontroller/statussynccontroller.go:217

OS and Hardware

OS: Ubuntu 20.04
CPU: Intel(R) Xeon(R)
Memory: 15GB

The text was updated successfully, but these errors were encountered:

suchen-sci · 2021-12-16T08:13:04Z

It seems etcd default max allowed sending message size is 2.0 MiB (which is 2097152). And we send messages beyond this size. In https://github.com/etcd-io/etcd/blob/e2d67f2e3bfa6f72178e26557bb22cc1482c418c/client/v3/config.go, MaxCallSendMsgSize is the client-side request send limit in bytes. If 0, it defaults to 2.0 MiB (2 * 1024 * 1024). Make sure that "MaxCallSendMsgSize" < server-side default send/recv limit. ("--max-request-bytes" flag to etcd or "embed.Config.MaxRequestBytes").

In https://etcd.io/docs/v3.1/upgrades/upgrade_3_3/, they have a example:

// client writes exceeding "MaxCallSendMsgSize" will be rejected from client-side
_, err = cli.Put(ctx, "foo", strings.Repeat("a", 5*1024*1024))
err.Error() == "rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (5242890 vs. 2097152)"

By now I am not sure this size is for transaction or for single operation. (I think it's for single operation).

One way to solve this problem is to add an option to allow users to change "MaxCallSendMsgSize" for etcd client. And etcd recommendedMaxRequestBytes is 10 MiB (10 * 1024 * 1024).

Any idea here?

samutamm · 2021-12-17T05:39:08Z

One way to solve this problem is to add an option to allow users to change "MaxCallSendMsgSize" for etcd client. And etcd recommendedMaxRequestBytes is 10 MiB (10 * 1024 * 1024).

This is a good idea, that way the downstream applications and end users can choose appropriate value for MaxCallSendMsgSize, depending how many objects and filters they have.

Would server configurations cluster.maxMessageSize be good place and name for this information?
Like this

name: machine-1
cluster-name: easegress-cluster
cluster-role: primary
...
cluster:
  listen-peer-urls:
   - http://HOST1:2380
  listen-client-urls:
   - http://HOST1:2379
  advertise-client-urls:
   - http://HOST1:2379
  initial-advertise-peer-urls:
   - http://HOST1:2380
  initial-cluster:
   - machine-1: http://HOST1:2380
  maxMessageSize: 10MiB <---------------- NEW LINE

localvar · 2021-12-17T06:07:17Z

I don't think it is an easy task for end user to choose an appropriate value for MaxCallSendMsgSize at the beginning.
And for an Easegress cluster deployed in K8s, if the user choose a bad value at first:

what the process will be to update the configuration to a good one? Will it be very difficult as primary Easegress nodes are a statefulset?

My idea (I'm not sure if it is possible by now) is: could we detect this error at runtime, and recreate the ETCD client with a larger MaxCallSendMsgSize, then resend the message?

suchen-sci · 2021-12-17T06:52:11Z

Since this happens really rare (only happen when yaml file with thousands of lines). So, when this happens, can we create a tmp etcd client with MaxCallSendMsgSize=10MiB (in this case, can create about 3500+ pipelines), and delete this tmp client when sending is done. Maybe this can simplify the implementation and reduce potential bugs... Is this a good idea?

samutamm · 2021-12-17T07:22:53Z

what the process will be to update the configuration to a good one? Will it be very difficult as primary Easegress nodes are a statefulset?

It's not especially difficult; there seems to be few ways to patch Statefulset. But I agree that it's difficult to choose the good initial value.

Since this happens really rare (only happen when yaml file with thousands of lines). So, when this happens, can we create a tmp etcd client with MaxCallSendMsgSize=10MiB (in this case, can create about 3500+ pipelines), and delete this tmp client when sending is done. Maybe this can simplify the implementation and reduce potential bugs... Is this a good idea?

I have more fancy idea: if we reach the ETCD message limit, let's multiply the limit by a constant (1.25 for example) and then let's try again with higher limit. Repeat this until there is no more this specific ResourceExhausted error. Visually it could look like this:

So user creates first N objects and the combined Object status message size arrives to 1st limit (2MiB). We try again with 2.5 MiB. And again, until 5MiB limit works, as all Easegress objects are created (for now) and object status messages are only 4MiB.

Of course setting the limit to very high (10MiB) right away or after first ResourceExhausted error is easier to implement.

localvar · 2021-12-17T08:09:52Z

I like the idea to increase the size limit at runtime, and is it possible to get the desired limit from the error message (grep the error message is not a good way because the format may change)? If yes, then we don't need to repeat the process and try again.

But I'm also fine with adding a new configuration item if automatically increase the limit is too difficult to implement, as we can patch a Statefulset.

suchen-sci · 2021-12-17T08:18:41Z

How about we calculate the size of message when etcd client return an error? If it exceed 2.0MiB, we may create a new etcd client based on the message size and then do the resend (if the message size is bigger than 10MiB we may log an error). But first we need to figure out that this MaxCallSendMsgSize is the size of single etcd operation or whole transaction (sum of all operations in single transaction?).

xxx7xxxx · 2021-12-17T08:24:33Z

Dynamically changing argument seems too complex for me, why must we do that instead of giving a bigger default message size.

suchen-sci · 2021-12-17T08:29:45Z

Yeah, maybe we make this problem too complex, if change MaxCallSendMsgSize to 10MiB won't cause problem for other components. Just updating MaxCallSendMsgSize to 10MiB in easegress etcd config will fix. lol

localvar · 2021-12-17T08:47:21Z

Change parameters dynamically is more user friendly, even we add a option, people may raise the same issue later because it is a low level configuration.
But I agree with you to just enlarge the limit statically if it is too difficult and complex to do it dynamically.

xxx7xxxx · 2021-12-17T08:59:07Z

When a single message exceed 10+ MiB, that's not the issue only about dynamic or static adjustment, but an architecture-level problem(and we should be happy about it if it happened since it will be running a lot of things in very big size)

localvar · 2021-12-17T09:25:19Z

Ok, then I'm fine with a larger default value.

samutamm mentioned this issue Dec 20, 2021

Expose max-sync-message-size in options #419

Merged

localvar closed this as completed in #419 Dec 21, 2021

localvar mentioned this issue Feb 22, 2022

report ETCD metrics #526

Closed

This was referenced Mar 11, 2022

Reduce ETCD memory usage when having large number of HTTPPipelines #541

Closed

Split Sync statuses to smaller objects in etcd #542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceExhausted error when creating many HTTPPipelines and HTTPServer paths #413

ResourceExhausted error when creating many HTTPPipelines and HTTPServer paths #413

samutamm commented Dec 16, 2021

suchen-sci commented Dec 16, 2021 •

edited

Loading

samutamm commented Dec 17, 2021

localvar commented Dec 17, 2021 •

edited

Loading

suchen-sci commented Dec 17, 2021

samutamm commented Dec 17, 2021 •

edited

Loading

localvar commented Dec 17, 2021

suchen-sci commented Dec 17, 2021

xxx7xxxx commented Dec 17, 2021

suchen-sci commented Dec 17, 2021

localvar commented Dec 17, 2021

xxx7xxxx commented Dec 17, 2021 •

edited

Loading

localvar commented Dec 17, 2021

ResourceExhausted error when creating many HTTPPipelines and HTTPServer paths #413

ResourceExhausted error when creating many HTTPPipelines and HTTPServer paths #413

Comments

samutamm commented Dec 16, 2021

suchen-sci commented Dec 16, 2021 • edited Loading

samutamm commented Dec 17, 2021

localvar commented Dec 17, 2021 • edited Loading

suchen-sci commented Dec 17, 2021

samutamm commented Dec 17, 2021 • edited Loading

localvar commented Dec 17, 2021

suchen-sci commented Dec 17, 2021

xxx7xxxx commented Dec 17, 2021

suchen-sci commented Dec 17, 2021

localvar commented Dec 17, 2021

xxx7xxxx commented Dec 17, 2021 • edited Loading

localvar commented Dec 17, 2021

suchen-sci commented Dec 16, 2021 •

edited

Loading

localvar commented Dec 17, 2021 •

edited

Loading

samutamm commented Dec 17, 2021 •

edited

Loading

xxx7xxxx commented Dec 17, 2021 •

edited

Loading