wordpress container restarts #138

vmpjdc · 2023-09-27T01:37:08Z

Bug Description

We get frequent alerts due to the wordpress container restarting. This was addressed by #135 and an upgrade to r46 of the charm, but either this didn't solve the problem or a new one has arisen.

To Reproduce

Deploy the charm.

Environment

prod-is-external-kubernetes@is-bastion-ps5

Relevant log output

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl logs -n prod-admin-insights-ubuntu-k8s wordpress-k8s-0 wordpress -p | tail -n 20
2023-09-27T01:21:43.634Z [wordpress] 10.86.80.0 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:43.916Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:44.054Z [wordpress] 10.86.64.1 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:44.915Z [wordpress] 10.86.93.0 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.652Z [wordpress] 10.86.64.1 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.828Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.953Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:44 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:46.165Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:44 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:47.107Z [wordpress] 127.0.0.1 - - [27/Sep/2023:01:21:47 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T01:21:47.297Z [wordpress] 10.86.73.0 - - [27/Sep/2023:01:21:45 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:50.450Z [wordpress] 127.0.0.1 - - [27/Sep/2023:01:21:50 +0000] "GET /index.php HTTP/1.1" 301 217 "-" "Go-http-client/1.1"
2023-09-27T01:21:51.625Z [wordpress] localhost - - [27/Sep/2023:01:21:51 +0000] "GET /server-status/?auto HTTP/1.1" 200 753 "-" "Go-http-client/1.1"
2023-09-27T01:21:53.037Z [pebble] Check "wordpress-ready" failure 3 (threshold 3): Get "http://localhost/": context deadline exceeded
2023-09-27T01:21:53.037Z [pebble] Check "wordpress-ready" failure threshold 3 hit, triggering action
2023-09-27T01:21:55.757Z [wordpress] [Wed Sep 27 01:21:55.757389 2023] [php7:notice] [pid 333] [client 10.86.73.0:34576] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777770', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:55.757Z [wordpress] [Wed Sep 27 01:21:55.757649 2023] [php7:notice] [pid 339] [client 10.86.80.0:56874] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777771', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:55.762Z [wordpress] [Wed Sep 27 01:21:55.761759 2023] [php7:notice] [pid 340] [client 10.86.80.0:56882] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777771', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:57.395Z [pebble] Exiting on terminated signal.
2023-09-27T01:21:57.399Z [pebble] Stopping all running services.
2023-09-27T01:21:57.525Z [pebble] Service "wordpress" stopped
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl logs -n prod-admin-insights-ubuntu-k8s wordpress-k8s-1 wordpress -p | tail -n 20
2023-09-27T00:56:09.429Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 15465 "-" "haproxy/httpchk"
2023-09-27T00:56:09.561Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:55:49 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.597Z [wordpress] localhost - - [27/Sep/2023:00:56:09 +0000] "GET /server-status/?auto HTTP/1.1" 200 765 "-" "Go-http-client/1.1"
2023-09-27T00:56:09.636Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.638Z [wordpress] 10.86.73.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.749Z [wordpress] 10.86.11.1 - - [27/Sep/2023:00:56:04 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.954Z [wordpress] 10.86.80.0 - - [27/Sep/2023:00:56:04 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:10.081Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:10 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T00:56:10.175Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:05 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:11.090Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:11 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T00:56:11.327Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:11.499Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:10 +0000] "GET /index.php HTTP/1.1" 301 217 "-" "Go-http-client/1.1"
2023-09-27T00:56:11.573Z [wordpress] 10.86.73.0 - - [27/Sep/2023:00:56:06 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:12.160Z [wordpress] 10.86.11.1 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:12.507Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:13.214Z [pebble] Check "wordpress-ready" failure 3 (threshold 3): Get "http://localhost/": context deadline exceeded
2023-09-27T00:56:13.214Z [pebble] Check "wordpress-ready" failure threshold 3 hit, triggering action
2023-09-27T00:56:13.513Z [pebble] Exiting on terminated signal.
2023-09-27T00:56:13.518Z [pebble] Stopping all running services.
2023-09-27T00:56:13.678Z [pebble] Service "wordpress" stopped
prod-is-external-kubernetes@is-bastion-ps5:~$

Additional context

No response

weiiwang01 · 2023-09-27T09:02:13Z

Based on the log, it appears that the resource-centre plugin is causing a deadlock in the database. Could you please forward this information to the maintainer of resource-centre for their input as well?

vmpjdc · 2023-09-27T21:26:49Z

I believe the plugin is owned by the web and design team, and I've started a thread here: https://chat.canonical.com/canonical/pl/k7quc4dqipnhjgstu5xkz1mmty

I think I've taken this as far as I reasonably can: IS doesn't own the deployment, or the site being deployed, and the cloud and the k8s cluster, which we do own, appear to be working correctly.

I've silenced this for a week in AlertManager so there's no rush to progress this from my perspective.

weiiwang01 · 2023-10-17T06:01:55Z

Thanks, we will follow up with the web team if this happens again.

ben-ballot · 2024-08-26T17:12:26Z

FYI, we have this alert constantly.

We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).

It happened twice today, with similar log to what was described in the original bug.

I can see the liveness check is rather "aggressive":

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Liveness:       http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1

Indeed, the charm pebble layer check 0 results in this

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38812,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38813,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}

This is probably a bit too aggressive.
timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing.
So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10".
Or make this configurable.

Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3.
All in all, I feel like the agressiveness of the check is probably what is causing the failures.

I'm going to silence till Friday so you have time to look at it.

weiiwang01 · 2024-08-27T02:55:54Z

FYI, we have this alert constantly.

We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).

It happened twice today, with similar log to what was described in the original bug.

I can see the liveness check is rather "aggressive":
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Liveness:       http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Indeed, the charm pebble layer check 0 results in this
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38812,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38813,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
This is probably a bit too aggressive. timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing. So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10". Or make this configurable.

Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3. All in all, I feel like the agressiveness of the check is probably what is causing the failures.

I'm going to silence till Friday so you have time to look at it.

The health checks in the k8s charms are controlled by Pebble, and the check parameters on the Kubernetes side are actually for the Pebble server. Therefore, the small failed threshold and timeout seconds are meant for Pebble health API requests, instead of WordPress health check requests. The actual health check parameters for WordPress are defined as you mentioned here, with a timeout of 5 seconds and a failure threshold of 3 (default).

Do you have any monitoring information that you can share with us? For example, there's a request duration Prometheus metric which can indicate if the WordPress server is running slowly, and perhaps any WordPress Apache logs related to the failure in Loki?

ben-ballot · 2024-08-27T15:51:49Z

Here is an extract of a failure that happened today and the relevant apache logs around it:

on wordpress-k8s-0: https://pastebin.canonical.com/p/QSnYrbFx9Q/
on wordpress-k8s-1: https://pastebin.canonical.com/p/qgjnTHPBCS/

weiiwang01 mentioned this issue Sep 10, 2024

Adjustable Pebble health check timeout #238

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordpress container restarts #138

wordpress container restarts #138

vmpjdc commented Sep 27, 2023

weiiwang01 commented Sep 27, 2023

vmpjdc commented Sep 27, 2023 •

edited

Loading

weiiwang01 commented Oct 17, 2023

ben-ballot commented Aug 26, 2024

weiiwang01 commented Aug 27, 2024

ben-ballot commented Aug 27, 2024

wordpress container restarts #138

wordpress container restarts #138

Comments

vmpjdc commented Sep 27, 2023

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

weiiwang01 commented Sep 27, 2023

vmpjdc commented Sep 27, 2023 • edited Loading

weiiwang01 commented Oct 17, 2023

ben-ballot commented Aug 26, 2024

weiiwang01 commented Aug 27, 2024

ben-ballot commented Aug 27, 2024

vmpjdc commented Sep 27, 2023 •

edited

Loading