Wait for mounts before starting components + add auto-restart on failure #638

morgan-patou · 2023-07-12T11:07:11Z

As a best practice, the 4 standard paths mentioned in the doc will most probably be using a dedicated mount/filesystem.

At installation, it is fine since the filesystems should already be mounted (if some needs to be used). However, in case of OS restart, the different services are currently configured to be executed After=syslog.socket network.target. With this configuration of the service, it can happen that the OS tries to start the Alfresco components before the filesystem is even available. In such cases, it would obviously fail (status=203/EXEC) and therefore not startup the component at all. Example of logs showing that the filesystem for /opt/alfresco/ is mounted after ActiveMQ tries to startup:

[root@mop-alf-ce-tool ~]# grep After /etc/systemd/system/activemq.service
After=syslog.socket network.target
[root@mop-alf-ce-tool ~]#
[root@mop-alf-ce-tool ~]# journalctl -xe | grep -E "activemq|opt-alfresco"
-- Subject: Unit activemq.service has begun start-up
-- Unit activemq.service has begun starting up.
Jul 12 09:23:41 mop-alf-ce-tool systemd[904]: activemq.service: Failed to execute command: No such file or directory
Jul 12 09:23:41 mop-alf-ce-tool systemd[904]: activemq.service: Failed at step EXEC spawning /opt/alfresco/activemq.sh: No such file or directory
-- Subject: Process /opt/alfresco/activemq.sh could not be executed
-- The process /opt/alfresco/activemq.sh could not be executed and failed.
Jul 12 09:23:41 mop-alf-ce-tool systemd[1]: activemq.service: Control process exited, code=exited status=203
Jul 12 09:23:41 mop-alf-ce-tool systemd[1]: activemq.service: Failed with result 'exit-code'.
-- The unit activemq.service has entered the 'failed' state with result 'exit-code'.
-- Subject: Unit activemq.service has failed
-- Unit activemq.service has failed.
-- Subject: Unit opt-alfresco.mount has begun start-up
-- Unit opt-alfresco.mount has begun starting up.
[root@mop-alf-ce-tool ~]#

Since the filesystem is mounted after ActiveMQ tries to start, it fails with No such file or directory and because there is no restart by default, then ActiveMQ isn't running. I faced this issue on both activemq.service as well as alfresco-content.service but it can happen on most of them, depending on where they are installed and the order the OS decides for the startup.

Therefore, I believe it would be better to make sure that the local and remote filesystems are all properly mounted before trying to startup the different Alfresco components. This can be achieved by adding local-fs.target remote-fs.target in the definition of the different services (After=syslog.socket network.target local-fs.target remote-fs.target).

As part of this PR, I also added the automatic restart of the processes on-failure (c.f. definition of the systemd service). This is more of a debatable option, since the OS will try to restart most of the components even if you are shutting them down manually (without using the systemctl stop xxx command) on purpose because the SuccessExitStatus is usually not 0 (and I didn't configure it). This means that people should only use the service to stop/start the different components (which, from my point of view, should always be how it's done...). It adds recovery in case of OOM or other kills, which is a necessary thing from my point of view.

Please let me know if you would like to change slightly the configuration and/or remove the restart option I added.

…art on failure

gionn

see comments related to services which main process is a java process and its exit status behaviour when receiving a sigterm from systemd https://serverfault.com/a/695863/72778

overall, waiting for fs is indeed a must have

roles/trouter/templates/alfresco-transform-router.service

roles/transformers/templates/alfresco-tengine-aio.service

roles/sfs/templates/alfresco-shared-fs.service

morgan-patou · 2023-07-12T13:39:44Z

Yes java will exit 143, but if we add the SuccessExitStatus configuration, we might miss some non-expected termination of the processes. If I'm not wrong, an OS OOM kill might first send a SIGTERM to the Java process before the SIGKILL. Therefore I would assume it to be possible for Java to exit 143 even if it was terminated by the OS OOM Killer. And in such case, we would want the process to automatically restart. I might be wrong, but that's basically why I didn't set the SuccessExitStatus one.

With the proposed PR as of now:

ActiveMQ manual shutdown

[root@mop-alf-ce-tool ~]# systemctl status activemq
● activemq.service - Apache ActiveMQ - Alfresco instance
   Loaded: loaded (/etc/systemd/system/activemq.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-07-12 11:49:33 CEST; 3h 32min ago
  Process: 1089 ExecStart=/opt/alfresco/activemq.sh start (code=exited, status=0/SUCCESS)
 Main PID: 1198 (java)
[root@mop-alf-ce-tool ~]# 
[alfresco@mop-alf-ce-tool ~]$ /opt/alfresco/activemq.sh stop
...
Connecting to pid: 1198
.Stopping broker: localhost
.. FINISHED
[alfresco@mop-alf-ce-tool ~]$
[root@mop-alf-ce-tool ~]# systemctl status activemq
● activemq.service - Apache ActiveMQ - Alfresco instance
   Loaded: loaded (/etc/systemd/system/activemq.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2023-07-12 15:22:06 CEST; 7s ago
  Process: 7013 ExecStop=/opt/alfresco/activemq.sh stop (code=exited, status=1/FAILURE)
  Process: 1089 ExecStart=/opt/alfresco/activemq.sh start (code=exited, status=0/SUCCESS)
 Main PID: 1198 (code=exited, status=0/SUCCESS)

tengine-aio manual kill

[root@mop-alf-ce-tool ~]# systemctl status alfresco-tengine-aio
● alfresco-tengine-aio.service - Alfresco Transform Service - AIO Transform Engine
   Loaded: loaded (/etc/systemd/system/alfresco-tengine-aio.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-07-12 11:49:31 CEST; 3h 36min ago
 Main PID: 1088 (java)
[root@mop-alf-ce-tool ~]#
[root@mop-alf-ce-tool ~]# kill 1088
[root@mop-alf-ce-tool ~]# 
[root@mop-alf-ce-tool ~]# systemctl status alfresco-tengine-aio
● alfresco-tengine-aio.service - Alfresco Transform Service - AIO Transform Engine
   Loaded: loaded (/etc/systemd/system/alfresco-tengine-aio.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2023-07-12 15:27:33 CEST; 1s ago
  Process: 1088 ExecStart=/opt/alfresco/ats-ate-aio.sh (code=exited, status=143)
 Main PID: 1088 (code=exited, status=143)

Solr (alfresco-search) also has a status of 1/Failure if it is shutdown using /opt/alfresco/search-services-2.0.6.1/solr.sh stop
ACS tomcat doesn't stop in time if you use the /opt/alfresco/tomcat.sh stop command. Because it's using the default parameters, the command will return too soon (timeout after 5s), saying that Tomcat wasn't stopped in time. Even if you add the timeout parameter like 60s, it doesn't work for me, when running this command manually, it will always say that Tomcat wasn't stopped (while it was, but the process is still there)

[alfresco@mop-alf-ce-repo ~]$ /opt/alfresco/tomcat.sh stop
Using CATALINA_BASE:   /etc/opt/alfresco/tomcat
Using CATALINA_HOME:   /opt/apache-tomcat-9.0.59
...
Tomcat did not stop in time.
PID file was not removed.
To aid diagnostics a thread dump has been written to standard out.
[alfresco@mop-alf-ce-repo ~]$

Since the Java process remain, the OS Service doesn't detect any failure either but Alfresco isn't reachable anymore. Not sure why, but with the 30 -force parameters, then it does a kill and the OS service restarts it...

alxgomz · 2023-07-12T14:52:06Z

Reading at the systemd units documentation I would argue that using simple types of unit is actually plain wrong.
Sounds to me that we should instead be using exec or forking (and make sure wrapper scripts use the exec command.
Wdyt? Do you read that as I do?

Even beter I woud argue tat none of these wrapper script really are necessary as they mostly set environment vars, which a unit can easily do... So in the end the best approach may be to get rid of them, stick to simple unit types and have ExecStart really start the service instead of calling a wrapper shell.

Though I realise this is more work and can understand if that's not something you want to touch on @morgan-patou . Also it has implication with several issuies we have opened for a while regarding removing the infamous setenv.sh

alxgomz · 2023-07-12T14:52:49Z

btw, @morgan-patou, that kind of feedback is invaluable to us! Thanks a lot!

alxgomz

The fs part is ok. I'm not sure about the the unit type but that's a different topic.

roles/sfs/templates/alfresco-shared-fs.service

roles/transformers/templates/alfresco-tengine-aio.service

alxgomz

Approving as it fixes the reported issue and further changees to the uint types is actually a separate enhancement

alxgomz · 2023-07-12T15:37:31Z

enterprise checks rolling 🛼 : https://github.com/Alfresco/alfresco-ansible-deployment/actions/runs/5533601854

gionn · 2023-08-17T09:22:18Z

Thanks for this contribution as well!

Wait for local & remote FS before starting components + add auto-rest…

ab18f37

…art on failure

morgan-patou mentioned this pull request Jul 12, 2023

502 Bad Gateway after installing with ansible and restarting machine #417

Open

gionn requested changes Jul 12, 2023

View reviewed changes

roles/trouter/templates/alfresco-transform-router.service Show resolved Hide resolved

roles/transformers/templates/alfresco-tengine-aio.service Show resolved Hide resolved

roles/sfs/templates/alfresco-shared-fs.service Show resolved Hide resolved

gionn requested a review from alxgomz July 12, 2023 13:02

alxgomz requested changes Jul 12, 2023

View reviewed changes

roles/sfs/templates/alfresco-shared-fs.service Show resolved Hide resolved

roles/transformers/templates/alfresco-tengine-aio.service Show resolved Hide resolved

alxgomz self-requested a review July 12, 2023 15:33

alxgomz approved these changes Jul 12, 2023

View reviewed changes

gionn approved these changes Aug 17, 2023

View reviewed changes

gionn merged commit ed6a9ba into Alfresco:master Aug 17, 2023

morgan-patou deleted the service-wait-fs branch November 15, 2023 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for mounts before starting components + add auto-restart on failure #638

Wait for mounts before starting components + add auto-restart on failure #638

morgan-patou commented Jul 12, 2023

gionn left a comment

morgan-patou commented Jul 12, 2023 •

edited

Loading

alxgomz commented Jul 12, 2023

alxgomz commented Jul 12, 2023

alxgomz left a comment •

edited

Loading

alxgomz left a comment

alxgomz commented Jul 12, 2023 •

edited

Loading

gionn commented Aug 17, 2023

Wait for mounts before starting components + add auto-restart on failure #638

Wait for mounts before starting components + add auto-restart on failure #638

Conversation

morgan-patou commented Jul 12, 2023

gionn left a comment

Choose a reason for hiding this comment

morgan-patou commented Jul 12, 2023 • edited Loading

alxgomz commented Jul 12, 2023

alxgomz commented Jul 12, 2023

alxgomz left a comment • edited Loading

Choose a reason for hiding this comment

alxgomz left a comment

Choose a reason for hiding this comment

alxgomz commented Jul 12, 2023 • edited Loading

gionn commented Aug 17, 2023

morgan-patou commented Jul 12, 2023 •

edited

Loading

alxgomz left a comment •

edited

Loading

alxgomz commented Jul 12, 2023 •

edited

Loading