-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for mounts before starting components + add auto-restart on failure #638
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments related to services which main process is a java process and its exit status behaviour when receiving a sigterm from systemd https://serverfault.com/a/695863/72778
overall, waiting for fs is indeed a must have
Yes java will exit 143, but if we add the With the proposed PR as of now:
[root@mop-alf-ce-tool ~]# systemctl status activemq
● activemq.service - Apache ActiveMQ - Alfresco instance
Loaded: loaded (/etc/systemd/system/activemq.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2023-07-12 11:49:33 CEST; 3h 32min ago
Process: 1089 ExecStart=/opt/alfresco/activemq.sh start (code=exited, status=0/SUCCESS)
Main PID: 1198 (java)
[root@mop-alf-ce-tool ~]#
[alfresco@mop-alf-ce-tool ~]$ /opt/alfresco/activemq.sh stop
...
Connecting to pid: 1198
.Stopping broker: localhost
.. FINISHED
[alfresco@mop-alf-ce-tool ~]$
[root@mop-alf-ce-tool ~]# systemctl status activemq
● activemq.service - Apache ActiveMQ - Alfresco instance
Loaded: loaded (/etc/systemd/system/activemq.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2023-07-12 15:22:06 CEST; 7s ago
Process: 7013 ExecStop=/opt/alfresco/activemq.sh stop (code=exited, status=1/FAILURE)
Process: 1089 ExecStart=/opt/alfresco/activemq.sh start (code=exited, status=0/SUCCESS)
Main PID: 1198 (code=exited, status=0/SUCCESS)
[root@mop-alf-ce-tool ~]# systemctl status alfresco-tengine-aio
● alfresco-tengine-aio.service - Alfresco Transform Service - AIO Transform Engine
Loaded: loaded (/etc/systemd/system/alfresco-tengine-aio.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2023-07-12 11:49:31 CEST; 3h 36min ago
Main PID: 1088 (java)
[root@mop-alf-ce-tool ~]#
[root@mop-alf-ce-tool ~]# kill 1088
[root@mop-alf-ce-tool ~]#
[root@mop-alf-ce-tool ~]# systemctl status alfresco-tengine-aio
● alfresco-tengine-aio.service - Alfresco Transform Service - AIO Transform Engine
Loaded: loaded (/etc/systemd/system/alfresco-tengine-aio.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2023-07-12 15:27:33 CEST; 1s ago
Process: 1088 ExecStart=/opt/alfresco/ats-ate-aio.sh (code=exited, status=143)
Main PID: 1088 (code=exited, status=143)
[alfresco@mop-alf-ce-repo ~]$ /opt/alfresco/tomcat.sh stop
Using CATALINA_BASE: /etc/opt/alfresco/tomcat
Using CATALINA_HOME: /opt/apache-tomcat-9.0.59
...
Tomcat did not stop in time.
PID file was not removed.
To aid diagnostics a thread dump has been written to standard out.
[alfresco@mop-alf-ce-repo ~]$ Since the Java process remain, the OS Service doesn't detect any failure either but Alfresco isn't reachable anymore. Not sure why, but with the |
Reading at the systemd units documentation I would argue that using Even beter I woud argue tat none of these wrapper script really are necessary as they mostly set environment vars, which a unit can easily do... So in the end the best approach may be to get rid of them, stick to Though I realise this is more work and can understand if that's not something you want to touch on @morgan-patou . Also it has implication with several issuies we have opened for a while regarding removing the infamous |
btw, @morgan-patou, that kind of feedback is invaluable to us! Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fs part is ok. I'm not sure about the the unit type but that's a different topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as it fixes the reported issue and further changees to the uint types is actually a separate enhancement
enterprise checks rolling 🛼 : https://github.com/Alfresco/alfresco-ansible-deployment/actions/runs/5533601854 |
Thanks for this contribution as well! |
As a best practice, the 4 standard paths mentioned in the doc will most probably be using a dedicated mount/filesystem.
At installation, it is fine since the filesystems should already be mounted (if some needs to be used). However, in case of OS restart, the different services are currently configured to be executed
After=syslog.socket network.target
. With this configuration of the service, it can happen that the OS tries to start the Alfresco components before the filesystem is even available. In such cases, it would obviously fail (status=203/EXEC
) and therefore not startup the component at all. Example of logs showing that the filesystem for /opt/alfresco/ is mounted after ActiveMQ tries to startup:Since the filesystem is mounted after ActiveMQ tries to start, it fails with
No such file or directory
and because there is no restart by default, then ActiveMQ isn't running. I faced this issue on bothactivemq.service
as well asalfresco-content.service
but it can happen on most of them, depending on where they are installed and the order the OS decides for the startup.Therefore, I believe it would be better to make sure that the local and remote filesystems are all properly mounted before trying to startup the different Alfresco components. This can be achieved by adding
local-fs.target remote-fs.target
in the definition of the different services (After=syslog.socket network.target local-fs.target remote-fs.target
).As part of this PR, I also added the automatic restart of the processes
on-failure
(c.f. definition of the systemd service). This is more of a debatable option, since the OS will try to restart most of the components even if you are shutting them down manually (without using thesystemctl stop xxx
command) on purpose because theSuccessExitStatus
is usually not 0 (and I didn't configure it). This means that people should only use the service to stop/start the different components (which, from my point of view, should always be how it's done...). It adds recovery in case of OOM or other kills, which is a necessary thing from my point of view.Please let me know if you would like to change slightly the configuration and/or remove the restart option I added.