Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordpress container restarts #138

Open
vmpjdc opened this issue Sep 27, 2023 · 6 comments
Open

wordpress container restarts #138

vmpjdc opened this issue Sep 27, 2023 · 6 comments

Comments

@vmpjdc
Copy link
Contributor

vmpjdc commented Sep 27, 2023

Bug Description

We get frequent alerts due to the wordpress container restarting. This was addressed by #135 and an upgrade to r46 of the charm, but either this didn't solve the problem or a new one has arisen.

To Reproduce

Deploy the charm.

Environment

prod-is-external-kubernetes@is-bastion-ps5

Relevant log output

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl logs -n prod-admin-insights-ubuntu-k8s wordpress-k8s-0 wordpress -p | tail -n 20
2023-09-27T01:21:43.634Z [wordpress] 10.86.80.0 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:43.916Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:44.054Z [wordpress] 10.86.64.1 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:44.915Z [wordpress] 10.86.93.0 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.652Z [wordpress] 10.86.64.1 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.828Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.953Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:44 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:46.165Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:44 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:47.107Z [wordpress] 127.0.0.1 - - [27/Sep/2023:01:21:47 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T01:21:47.297Z [wordpress] 10.86.73.0 - - [27/Sep/2023:01:21:45 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:50.450Z [wordpress] 127.0.0.1 - - [27/Sep/2023:01:21:50 +0000] "GET /index.php HTTP/1.1" 301 217 "-" "Go-http-client/1.1"
2023-09-27T01:21:51.625Z [wordpress] localhost - - [27/Sep/2023:01:21:51 +0000] "GET /server-status/?auto HTTP/1.1" 200 753 "-" "Go-http-client/1.1"
2023-09-27T01:21:53.037Z [pebble] Check "wordpress-ready" failure 3 (threshold 3): Get "http://localhost/": context deadline exceeded
2023-09-27T01:21:53.037Z [pebble] Check "wordpress-ready" failure threshold 3 hit, triggering action
2023-09-27T01:21:55.757Z [wordpress] [Wed Sep 27 01:21:55.757389 2023] [php7:notice] [pid 333] [client 10.86.73.0:34576] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777770', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:55.757Z [wordpress] [Wed Sep 27 01:21:55.757649 2023] [php7:notice] [pid 339] [client 10.86.80.0:56874] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777771', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:55.762Z [wordpress] [Wed Sep 27 01:21:55.761759 2023] [php7:notice] [pid 340] [client 10.86.80.0:56882] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777771', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:57.395Z [pebble] Exiting on terminated signal.
2023-09-27T01:21:57.399Z [pebble] Stopping all running services.
2023-09-27T01:21:57.525Z [pebble] Service "wordpress" stopped
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl logs -n prod-admin-insights-ubuntu-k8s wordpress-k8s-1 wordpress -p | tail -n 20
2023-09-27T00:56:09.429Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 15465 "-" "haproxy/httpchk"
2023-09-27T00:56:09.561Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:55:49 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.597Z [wordpress] localhost - - [27/Sep/2023:00:56:09 +0000] "GET /server-status/?auto HTTP/1.1" 200 765 "-" "Go-http-client/1.1"
2023-09-27T00:56:09.636Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.638Z [wordpress] 10.86.73.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.749Z [wordpress] 10.86.11.1 - - [27/Sep/2023:00:56:04 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.954Z [wordpress] 10.86.80.0 - - [27/Sep/2023:00:56:04 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:10.081Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:10 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T00:56:10.175Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:05 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:11.090Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:11 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T00:56:11.327Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:11.499Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:10 +0000] "GET /index.php HTTP/1.1" 301 217 "-" "Go-http-client/1.1"
2023-09-27T00:56:11.573Z [wordpress] 10.86.73.0 - - [27/Sep/2023:00:56:06 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:12.160Z [wordpress] 10.86.11.1 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:12.507Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:13.214Z [pebble] Check "wordpress-ready" failure 3 (threshold 3): Get "http://localhost/": context deadline exceeded
2023-09-27T00:56:13.214Z [pebble] Check "wordpress-ready" failure threshold 3 hit, triggering action
2023-09-27T00:56:13.513Z [pebble] Exiting on terminated signal.
2023-09-27T00:56:13.518Z [pebble] Stopping all running services.
2023-09-27T00:56:13.678Z [pebble] Service "wordpress" stopped
prod-is-external-kubernetes@is-bastion-ps5:~$

Additional context

No response

@weiiwang01
Copy link
Collaborator

Based on the log, it appears that the resource-centre plugin is causing a deadlock in the database. Could you please forward this information to the maintainer of resource-centre for their input as well?

@vmpjdc
Copy link
Contributor Author

vmpjdc commented Sep 27, 2023

I believe the plugin is owned by the web and design team, and I've started a thread here: https://chat.canonical.com/canonical/pl/k7quc4dqipnhjgstu5xkz1mmty

I think I've taken this as far as I reasonably can: IS doesn't own the deployment, or the site being deployed, and the cloud and the k8s cluster, which we do own, appear to be working correctly.

I've silenced this for a week in AlertManager so there's no rush to progress this from my perspective.

@weiiwang01
Copy link
Collaborator

Thanks, we will follow up with the web team if this happens again.

@ben-ballot
Copy link

FYI, we have this alert constantly.

We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).

It happened twice today, with similar log to what was described in the original bug.

I can see the liveness check is rather "aggressive":

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Liveness:       http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1

Indeed, the charm pebble layer check 0 results in this

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38812,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38813,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}

This is probably a bit too aggressive.
timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing.
So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10".
Or make this configurable.

Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3.
All in all, I feel like the agressiveness of the check is probably what is causing the failures.

I'm going to silence till Friday so you have time to look at it.

@weiiwang01
Copy link
Collaborator

FYI, we have this alert constantly.

We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).

It happened twice today, with similar log to what was described in the original bug.

I can see the liveness check is rather "aggressive":

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Liveness:       http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1

Indeed, the charm pebble layer check 0 results in this

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38812,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38813,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}

This is probably a bit too aggressive. timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing. So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10". Or make this configurable.

Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3. All in all, I feel like the agressiveness of the check is probably what is causing the failures.

I'm going to silence till Friday so you have time to look at it.

The health checks in the k8s charms are controlled by Pebble, and the check parameters on the Kubernetes side are actually for the Pebble server. Therefore, the small failed threshold and timeout seconds are meant for Pebble health API requests, instead of WordPress health check requests. The actual health check parameters for WordPress are defined as you mentioned here, with a timeout of 5 seconds and a failure threshold of 3 (default).

Do you have any monitoring information that you can share with us? For example, there's a request duration Prometheus metric which can indicate if the WordPress server is running slowly, and perhaps any WordPress Apache logs related to the failure in Loki?

@ben-ballot
Copy link

Here is an extract of a failure that happened today and the relevant apache logs around it:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants