Skip to content

Commit

Permalink
docs: add an FAQ item for apisix high latency due to etcd
Browse files Browse the repository at this point in the history
Signed-off-by: hansedong <[email protected]>
  • Loading branch information
hansedong committed Sep 14, 2022
1 parent 2a44ba6 commit 9205729
Show file tree
Hide file tree
Showing 2 changed files with 105 additions and 0 deletions.
53 changes: 53 additions & 0 deletions docs/en/latest/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -626,6 +626,59 @@ This method only detects whether the APISIX data plane is alive or not. It does

:::

## What are the scenarios with high APISIX latency related to ETCD and how to fix them?

ETCD is the data storage component of apisix, and its stability is related to the stability of APISIX.

In actual scenarios, if APISIX uses a certificate to connect to ETCD through HTTPS, the following two problems of high latency for data query or writing may occur:

1. Query or write data through APISIX Admin API.
2. In the monitoring scenario, Prometheus crawls the APISIX data plane Metrics API timeout.

These problems related to higher latency seriously affect the service stability of APISIX, and the reason why such problems occur is mainly because ETCD provides two modes of operation: HTTP (HTTPS) and gRPC. And APISIX uses the HTTP (HTTPS) protocol to operate ETCD.
In this scenario, ETCD has a bug about HTTP/2: if ETCD is operated over HTTPS (HTTP is not affected), the upper limit of HTTP/2 connections is the default `250` in Golang. Therefore, when the number of APISIX data plane nodes is large, once the number of connections between all APISIX nodes and ETCD exceeds this upper limit, the response of APISIX API interface will be very slow.

In Golang, the default upper limit of HTTP/2 connections is `250`, the code is as follows:

```go
package http2
import ...
const (
prefaceTimeout = 10 * time.Second
firstSettingsTimeout = 2 * time.Second // should be in-flight with preface anyway
handlerChunkWriteSize = 4 << 10
defaultMaxStreams = 250 // TODO: make this 100 as the GFE seems to?
maxQueuedControlFrames = 10000
)
```

At present, ETCD officially maintains two main branches, `3.4` and `3.5`.
The `3.4` branch has the recently released `3.4.20` which fixes this issue.
As for the `3.5` branch, in fact, the official is preparing to release the `3.5.5` version a long time ago, but it has not been released so far. So, if you are using a version of ETCD less than `3.5.5`, there are several ways to solve this problem:

1. Change the communication method between APISIX and ETCD from HTTPS to HTTP.
2. Fallback version to `3.4.20`.
3. Clone the ETCD source code and compile the `release-3.5` branch directly (this branch has fixed the problem of HTTP/2 connections, but the new version has not been released yet).

The way to recompile ETCD is as follows:

```shell
git checkout release-3.5
make GOOS=linux GOARCH=amd64
```

The compiled binary is in the bin directory, replace it with the ETCD binary of your server environment, and then restart ETCD:

Related issues or PRs can refer to:

- https://github.com/etcd-io/etcd/issues/14185
- https://github.com/apache/apisix/issues/7078
- https://github.com/apache/apisix/issues/7353
- https://github.com/etcd-io/etcd/pull/14169

## Where can I find more answers?

You can find more answers on:
Expand Down
52 changes: 52 additions & 0 deletions docs/zh/latest/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,58 @@ curl http://127.0.0.1:9180/apisix/admin/routes/health-info \

:::

## APISIX 与 ETCD 相关的延迟较高的问题有哪些,如何修复?

ETCD 作为 APISIX 的数据存储组件,它的稳定性关乎 APISIX 的稳定性。在实际场景中,如果 APISIX 使用证书通过 HTTPS 的方式连接 ETCD,可能会出现以下 2 种数据查询或写入延迟较高的问题:

1. 通过接口操作 APISIX Admin API 进行数据的查询或写入,延迟较高。
2. 在监控系统中,Prometheus 抓取 APISIX 数据面 Metrics 接口超时。

这些延迟问题,严重影响了 APISIX 的服务稳定性,而之所以会出现这类问题,主要是因为 ETCD 对外提供了 2 种操作方式:HTTP(HTTPS)、gRPC。而 APISIX 是基于 HTTP(HTTPS)协议来操作 ETCD 的。
在这个场景中,ETCD 存在一个关于 HTTP/2 的 BUG:如果通过 HTTPS 操作 ETCD(HTTP 不受影响),HTTP/2 的连接数上限为 Golang 默认的 `250` 个。
所以,当 APISIX 数据面节点数较多时,一旦所有 APISIX 节点与 ETCD 连接数超过这个上限,则 APISIX 的接口响应会非常的慢。

Golang 中,默认的 HTTP/2 上限为 `250`,代码如下:

```go
package http2
import ...
const (
prefaceTimeout = 10 * time.Second
firstSettingsTimeout = 2 * time.Second // should be in-flight with preface anyway
handlerChunkWriteSize = 4 << 10
defaultMaxStreams = 250 // TODO: make this 100 as the GFE seems to?
maxQueuedControlFrames = 10000
)
```

目前,ETCD 官方主要维护了 `3.4` 和 `3.5` 2 个主要版本。
而 `3.4` 已有近期发布的 `3.4.20` 修复了这个问题。
至于 `3.5` 版本,其实,官方很早之前就在筹备发布 `3.5.5` 版本了,但截止目前(2022.09.13)也尚未发布。所以,如果你使用的是 ETCD 的版本小于 `3.5.5`,可以有几种方式解决这个问题:

1. APISIX 与 ETCD 的通讯方式,由 HTTPS 改为 HTTP。
2. 回退版本到 `3.4.20`。
3. 将 ETCD 源码克隆下来,直接编译 `release-3.5` 分支(此分支已修复,只是尚未发布新版本而已)。

重新编译 ETCD 的方式如下:

```shell
git checkout release-3.5
make GOOS=linux GOARCH=amd64
```

编译的二进制在 bin 目录下,将其替换掉你服务器环境的 ETCD 二进制后,然后重启 ETCD 即可。

相关的 issue 或 PR 可以参考:

- https://github.com/etcd-io/etcd/issues/14185
- https://github.com/apache/apisix/issues/7078
- https://github.com/apache/apisix/issues/7353
- https://github.com/etcd-io/etcd/pull/14169

## 如果在使用 APISIX 过程中遇到问题,我可以在哪里寻求更多帮助?

- [Apache APISIX Slack Channel](/docs/general/join/#加入-slack-频道):加入后请选择 channel-apisix 频道,即可通过此频道进行 APISIX 相关问题的提问。
Expand Down

0 comments on commit 9205729

Please sign in to comment.