Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for mounts before starting components + add auto-restart on failure #638

Merged
merged 1 commit into from
Aug 17, 2023

Conversation

morgan-patou
Copy link
Contributor

As a best practice, the 4 standard paths mentioned in the doc will most probably be using a dedicated mount/filesystem.

At installation, it is fine since the filesystems should already be mounted (if some needs to be used). However, in case of OS restart, the different services are currently configured to be executed After=syslog.socket network.target. With this configuration of the service, it can happen that the OS tries to start the Alfresco components before the filesystem is even available. In such cases, it would obviously fail (status=203/EXEC) and therefore not startup the component at all. Example of logs showing that the filesystem for /opt/alfresco/ is mounted after ActiveMQ tries to startup:

[root@mop-alf-ce-tool ~]# grep After /etc/systemd/system/activemq.service
After=syslog.socket network.target
[root@mop-alf-ce-tool ~]#
[root@mop-alf-ce-tool ~]# journalctl -xe | grep -E "activemq|opt-alfresco"
-- Subject: Unit activemq.service has begun start-up
-- Unit activemq.service has begun starting up.
Jul 12 09:23:41 mop-alf-ce-tool systemd[904]: activemq.service: Failed to execute command: No such file or directory
Jul 12 09:23:41 mop-alf-ce-tool systemd[904]: activemq.service: Failed at step EXEC spawning /opt/alfresco/activemq.sh: No such file or directory
-- Subject: Process /opt/alfresco/activemq.sh could not be executed
-- The process /opt/alfresco/activemq.sh could not be executed and failed.
Jul 12 09:23:41 mop-alf-ce-tool systemd[1]: activemq.service: Control process exited, code=exited status=203
Jul 12 09:23:41 mop-alf-ce-tool systemd[1]: activemq.service: Failed with result 'exit-code'.
-- The unit activemq.service has entered the 'failed' state with result 'exit-code'.
-- Subject: Unit activemq.service has failed
-- Unit activemq.service has failed.
-- Subject: Unit opt-alfresco.mount has begun start-up
-- Unit opt-alfresco.mount has begun starting up.
[root@mop-alf-ce-tool ~]#

Since the filesystem is mounted after ActiveMQ tries to start, it fails with No such file or directory and because there is no restart by default, then ActiveMQ isn't running. I faced this issue on both activemq.service as well as alfresco-content.service but it can happen on most of them, depending on where they are installed and the order the OS decides for the startup.

Therefore, I believe it would be better to make sure that the local and remote filesystems are all properly mounted before trying to startup the different Alfresco components. This can be achieved by adding local-fs.target remote-fs.target in the definition of the different services (After=syslog.socket network.target local-fs.target remote-fs.target).

As part of this PR, I also added the automatic restart of the processes on-failure (c.f. definition of the systemd service). This is more of a debatable option, since the OS will try to restart most of the components even if you are shutting them down manually (without using the systemctl stop xxx command) on purpose because the SuccessExitStatus is usually not 0 (and I didn't configure it). This means that people should only use the service to stop/start the different components (which, from my point of view, should always be how it's done...). It adds recovery in case of OOM or other kills, which is a necessary thing from my point of view.

Please let me know if you would like to change slightly the configuration and/or remove the restart option I added.

Copy link
Member

@gionn gionn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments related to services which main process is a java process and its exit status behaviour when receiving a sigterm from systemd https://serverfault.com/a/695863/72778

overall, waiting for fs is indeed a must have

@gionn gionn requested a review from alxgomz July 12, 2023 13:02
@morgan-patou
Copy link
Contributor Author

morgan-patou commented Jul 12, 2023

Yes java will exit 143, but if we add the SuccessExitStatus configuration, we might miss some non-expected termination of the processes. If I'm not wrong, an OS OOM kill might first send a SIGTERM to the Java process before the SIGKILL. Therefore I would assume it to be possible for Java to exit 143 even if it was terminated by the OS OOM Killer. And in such case, we would want the process to automatically restart. I might be wrong, but that's basically why I didn't set the SuccessExitStatus one.

With the proposed PR as of now:

  • ActiveMQ manual shutdown
[root@mop-alf-ce-tool ~]# systemctl status activemq
● activemq.service - Apache ActiveMQ - Alfresco instance
   Loaded: loaded (/etc/systemd/system/activemq.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-07-12 11:49:33 CEST; 3h 32min ago
  Process: 1089 ExecStart=/opt/alfresco/activemq.sh start (code=exited, status=0/SUCCESS)
 Main PID: 1198 (java)
[root@mop-alf-ce-tool ~]# 
[alfresco@mop-alf-ce-tool ~]$ /opt/alfresco/activemq.sh stop
...
Connecting to pid: 1198
.Stopping broker: localhost
.. FINISHED
[alfresco@mop-alf-ce-tool ~]$
[root@mop-alf-ce-tool ~]# systemctl status activemq
● activemq.service - Apache ActiveMQ - Alfresco instance
   Loaded: loaded (/etc/systemd/system/activemq.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2023-07-12 15:22:06 CEST; 7s ago
  Process: 7013 ExecStop=/opt/alfresco/activemq.sh stop (code=exited, status=1/FAILURE)
  Process: 1089 ExecStart=/opt/alfresco/activemq.sh start (code=exited, status=0/SUCCESS)
 Main PID: 1198 (code=exited, status=0/SUCCESS)
  • tengine-aio manual kill
[root@mop-alf-ce-tool ~]# systemctl status alfresco-tengine-aio
● alfresco-tengine-aio.service - Alfresco Transform Service - AIO Transform Engine
   Loaded: loaded (/etc/systemd/system/alfresco-tengine-aio.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-07-12 11:49:31 CEST; 3h 36min ago
 Main PID: 1088 (java)
[root@mop-alf-ce-tool ~]#
[root@mop-alf-ce-tool ~]# kill 1088
[root@mop-alf-ce-tool ~]# 
[root@mop-alf-ce-tool ~]# systemctl status alfresco-tengine-aio
● alfresco-tengine-aio.service - Alfresco Transform Service - AIO Transform Engine
   Loaded: loaded (/etc/systemd/system/alfresco-tengine-aio.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2023-07-12 15:27:33 CEST; 1s ago
  Process: 1088 ExecStart=/opt/alfresco/ats-ate-aio.sh (code=exited, status=143)
 Main PID: 1088 (code=exited, status=143)
  • Solr (alfresco-search) also has a status of 1/Failure if it is shutdown using /opt/alfresco/search-services-2.0.6.1/solr.sh stop

  • ACS tomcat doesn't stop in time if you use the /opt/alfresco/tomcat.sh stop command. Because it's using the default parameters, the command will return too soon (timeout after 5s), saying that Tomcat wasn't stopped in time. Even if you add the timeout parameter like 60s, it doesn't work for me, when running this command manually, it will always say that Tomcat wasn't stopped (while it was, but the process is still there)

[alfresco@mop-alf-ce-repo ~]$ /opt/alfresco/tomcat.sh stop
Using CATALINA_BASE:   /etc/opt/alfresco/tomcat
Using CATALINA_HOME:   /opt/apache-tomcat-9.0.59
...
Tomcat did not stop in time.
PID file was not removed.
To aid diagnostics a thread dump has been written to standard out.
[alfresco@mop-alf-ce-repo ~]$

Since the Java process remain, the OS Service doesn't detect any failure either but Alfresco isn't reachable anymore. Not sure why, but with the 30 -force parameters, then it does a kill and the OS service restarts it...

@alxgomz
Copy link
Contributor

alxgomz commented Jul 12, 2023

Reading at the systemd units documentation I would argue that using simple types of unit is actually plain wrong.
Sounds to me that we should instead be using exec or forking (and make sure wrapper scripts use the exec command.
Wdyt? Do you read that as I do?

Even beter I woud argue tat none of these wrapper script really are necessary as they mostly set environment vars, which a unit can easily do... So in the end the best approach may be to get rid of them, stick to simple unit types and have ExecStart really start the service instead of calling a wrapper shell.

Though I realise this is more work and can understand if that's not something you want to touch on @morgan-patou . Also it has implication with several issuies we have opened for a while regarding removing the infamous setenv.sh

@alxgomz
Copy link
Contributor

alxgomz commented Jul 12, 2023

btw, @morgan-patou, that kind of feedback is invaluable to us! Thanks a lot!

Copy link
Contributor

@alxgomz alxgomz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fs part is ok. I'm not sure about the the unit type but that's a different topic.

@alxgomz alxgomz self-requested a review July 12, 2023 15:33
Copy link
Contributor

@alxgomz alxgomz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as it fixes the reported issue and further changees to the uint types is actually a separate enhancement

@alxgomz
Copy link
Contributor

alxgomz commented Jul 12, 2023

@gionn
Copy link
Member

gionn commented Aug 17, 2023

Thanks for this contribution as well!

@gionn gionn merged commit ed6a9ba into Alfresco:master Aug 17, 2023
@morgan-patou morgan-patou deleted the service-wait-fs branch November 15, 2023 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants