Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate the Installation assistant test pipelines to GHA #20

Closed
7 of 8 tasks
teddytpc1 opened this issue Aug 7, 2024 · 20 comments · Fixed by #40, #46 or #60
Closed
7 of 8 tasks

Migrate the Installation assistant test pipelines to GHA #20

teddytpc1 opened this issue Aug 7, 2024 · 20 comments · Fixed by #40, #46 or #60
Assignees
Labels
level/subtask Subtask issue type/enhancement Enhancement issue

Comments

@teddytpc1
Copy link
Member

teddytpc1 commented Aug 7, 2024

Objective
wazuh/wazuh-packages#2904

Description

Because of the Wazuh packages redesign tier 2 objective we need to migrate the Wazuh installation assistant pipeline (Test_unattended_distributed, Test_unattended_distributed_cases, Test_unattended and Test_unattended_tier) to GHA.

Tasks

  • Adapt pytests
  • Migrate all the Wazuh installation assistant test pipelines from Jenkins to GHA:
    • Test_unattended_distributed
    • Test_unattended_distributed_cases
    • Test_unattended
    • Test_unattended_tier
  • Modify the name of the GHA from unattended_installer to installation_assistant if applies
  • Validate the GHA work as expected

Related

@teddytpc1 teddytpc1 changed the title MPV - Migrate the Installation assistant test pipelines to GHA Migrate the Installation assistant test pipelines to GHA Aug 8, 2024
@teddytpc1 teddytpc1 added type/enhancement Enhancement issue level/subtask Subtask issue labels Aug 8, 2024
@wazuhci wazuhci moved this to Backlog in Release 4.10.0 Aug 8, 2024
@c-bordon c-bordon assigned c-bordon and davidcr01 and unassigned c-bordon Aug 9, 2024
@wazuhci wazuhci moved this from Backlog to In progress in Release 4.10.0 Aug 23, 2024
@davidcr01
Copy link
Contributor

davidcr01 commented Aug 23, 2024

Update Report

Investigation

I have been investigating the mentioned Jenkins pipeline, and we decided to migrate the Test_unattended, Test_unattended_tier and Test_unattended_distributed pipelines.

Related to the rest of the tests:

  • Test_unattended_distributed_cases: The scope of this pipeline is to test the distributed installation with the installation assistant, but some components are installed in one machine and some components in another machine. For example:
    • Instance A = indexer and manager. Instance B = dashboard.
    • Instance A = manager. Instance B = indexer and dashboard.
    • Instance A = manager and dashboard. Instance B = indexer.
      We think that these cases are tested and reproduced in the Test_unattended_distributed pipeline and we discard migrating it, as it is covered in the mentioned pipeline.

@davidcr01
Copy link
Contributor

Update Report

  • Header of Test_installation_assistant.yml (in replacement of the Test_unattended Jenkins pipeline.
  • Header of Test_installation_assistant_tier.yml (in replacement of the Test_unattended_tier Jenkins pipeline.
  • Header of Test_installation_assistant_distributed.yml (in replacement of the Test_unattended_distributed Jenkins pipeline.

@wazuhci wazuhci moved this from In progress to Blocked in Release 4.10.0 Aug 26, 2024
@davidcr01
Copy link
Contributor

davidcr01 commented Aug 27, 2024

Update Report

Test installation assistant workflow

  • The first steps of this workflow have been developed and tested:
  • Input parameters visualization
  • Composite name variable manage
  • Ansible installation
  • Allocation module clone and installation
  • Allocation instance provision and deletion

These steps have been tested here: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10576987236/job/29303972255

Currently adding the necessary steps to execute the provision.yml playbook. This playbook has been modified:

  • Unattended references deleted
  • Change wazuh rpm revision to generic one, Change wazuh deb revision to generic one, Change indexer rpm revision to generic one, Change indexer deb revision to generic one, Change dashboard rpm revision to generic one and Change dashboard deb revision to generic one tasks because the related variables do not exist in the installation assistant script

Problem with CentOS 8 🔴

Caution

It seems that the CentOS 8 allocator VM does not have python installed. This is necessary to execute Ansible playbooks. We need to determine if we add the CentOS 8 AMI which was being used in the old Jenkins pipeline, or if we update the specified CentOS 8 AMI with another with Python installed.

The reported error is the following:

fatal: [ec2-3-95-210-126.compute-1.amazonaws.com]: FAILED! => {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "OpenSSH_8.9p1 Ubuntu-3ubuntu0.10, OpenSSL 3.0.2 15 Mar 2022\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files\r\ndebug1: /etc/ssh/ssh_config line 21: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: mux_client_request_session: master session id: 2\r\nShared connection to ec2-3-95-210-126.compute-1.amazonaws.com closed.\r\n", "module_stdout": "/bin/sh: /usr/bin/python3: No such file or directory\r\n", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 127}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}

Problem with CentOS 8 - Solved 🟢

After some time debugging, I found the way to install Python in the remote machine (CentOS 8) before executing the playbook. The commit with the changes is: eae9a6c

The workaround is:

  • Setting the gather_facts: no. This prevents Ansible from searching the Python interpreter at the beginning of the playbook.
  • Then, use the pre_tasks and raw parameters of Ansible to execute the task before anything else and not use Python, use SSH instead.
  • After that, gather the facts with an extra tasks. This is compulsory as the playbooks use some variables such as ansible_os_family, which are set after gathering the facts.

Important

By this way, it is not needed to set the Python interpreter before executing the playbook; it is not needed to change the CentOS 8 Allocator AMI; and it's a very useful approach to install packages before running the playbook.

The changes were tested in the following runs:

@wazuhci wazuhci moved this from Blocked to In progress in Release 4.10.0 Aug 27, 2024
@davidcr01 davidcr01 linked a pull request Aug 29, 2024 that will close this issue
@davidcr01
Copy link
Contributor

Update Report

⚠️ I encountered a severe problem related to the Ansible version and the usage of the Ansible modules. In my case, many Ansible playbooks necessary to complete the GHA use Ansible modules, such as the YUM module and the APT module.

The problem was that, in case of Amazon Linux 2, the package manager could not be fetched, and then, the YUM module could not be used, because internally it used the package manager variable (a variable that defines which package manager uses the system):

fatal: [ec2-44-212-66-219.compute-1.amazonaws.com]: FAILED! => {"changed": false, "msg": "Could not find a matching action for the \"unknown\" package manager."}

Caution

The YUM and the APT modules were not the only ones affected. Some Ansible variables such as the ansible_os_family and the ansible_os_distribution could not be used. This is a huge problem in this case because the majority of playbooks are prepared to be executed in many systems, and this is controlled with this kind of variables.

After many failed attempts to solve this, the provisional solution I came up with was installing a specific version of Ansible core. The mentioned problem was provoked with the deprecation of the YUM module and YUM action here: https://github.com/ansible/ansible/blob/stable-2.17/changelogs/CHANGELOG-v2.17.rst#removed-features-previously-deprecated. Then, the approach was to change the Ansible installation in the GH runner. See this commit.

✔️ Now, Ansible is installed using pip, a tool recommended by the official documentation:

- name: Install Ansible
  run: sudo apt-get update && sudo apt install -y python3 && python3 -m pip install --user ansible-core==2.16

With this change, the two use cases (running in Ubuntu 22 and running in AL2) succeeded:

Warning

We are aware that installing a specific version of Ansible core is not the best solution because we highly depend on that version. However, due to the migration's urgency, we will take this approach as a provisional solution until we find a better solution or, on the other hand, we may rework these GHAs in 5.0.

@davidcr01
Copy link
Contributor

On hold due to: #44

@wazuhci wazuhci moved this from In progress to On hold in Release 4.10.0 Aug 30, 2024
@davidcr01
Copy link
Contributor

davidcr01 commented Sep 2, 2024

Update Report

Problem with Allocator VM deletion

I encountered a problem when setting the final conditional of the Allocator VM deletion. The original task was:

- name: Delete allocated VM
  if: always() && steps.allocator_instance.outcome == 'success' && inputs.DESTROY == 'true'
  run: python3 wazuh-automation/deployability/modules/allocation/main.py --action delete --track-output /tmp/allocator_instance/track.yml

This task was always not executed. After many hours debugging, I found that the inputs.DESTROY == 'true' condition was the problem:

  • In the workflow, the DESTROY input is a boolean parameter.
  • When evaluating a boolean parameter as a string, the result is NaN
  • Then, the condition was always false.

✔️ Changing inputs.DESTROY == 'true' by inputs.DESTROY == true solved the problem.

Here I have tested with the condition that I have passed the possible scenarios that have occurred to me:

Cancelation cases:

Next steps

Now, I have to work on the following:

  • Many systems such as Ubuntu, RHEL and CentOS fails on the AIO playbook execution, when executing the installation assistant. The failure is not immediate, so the connection is being closed while installing the Wazuh components:
TASK [Install assistant installer] *********************************************
task path: /home/runner/work/wazuh-installation-assistant/wazuh-installation-assistant/.github/workflows/ansible-playbooks/aio.yml:27
<ec2-3-83-65-190.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<ec2-3-83-65-190.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2200 -o 'IdentityFile="/tmp/allocator_instance/gha_10664819701_assistant_test-5903/gha_10664819701_assistant_test-key-2194"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o 'ControlPath="/home/runner/.ansible/cp/285abfb711"' ec2-3-83-65-190.compute-1.amazonaws.com '/bin/sh -c '"'"'echo ~ubuntu && sleep 0'"'"''
<ec2-3-83-65-190.compute-1.amazonaws.com> (0, b'/home/ubuntu\n', b'')
<ec2-3-83-65-190.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<ec2-3-83-65-190.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2200 -o 'IdentityFile="/tmp/allocator_instance/gha_10664819701_assistant_test-5903/gha_10664819701_assistant_test-key-2194"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o 'ControlPath="/home/runner/.ansible/cp/285abfb711"' ec2-3-83-65-190.compute-1.amazonaws.com '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /home/ubuntu/.ansible/tmp `"&& mkdir "` echo /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821 `" && echo ansible-tmp-1725270564.1284177-2850-24144868290821="` echo /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821 `" ) && sleep 0'"'"''
<ec2-3-83-65-190.compute-1.amazonaws.com> (0, b'ansible-tmp-1725270564.1284177-2850-24144868290821=/home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821\n', b'')
Using module file /home/runner/.local/lib/python3.10/site-packages/ansible/modules/command.py
<ec2-3-83-65-190.compute-1.amazonaws.com> PUT /home/runner/.ansible/tmp/ansible-local-2821wja4gasf/tmpl5cix_ik TO /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821/AnsiballZ_command.py
<ec2-3-83-65-190.compute-1.amazonaws.com> SSH: EXEC sftp -b - -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2200 -o 'IdentityFile="/tmp/allocator_instance/gha_10664819701_assistant_test-5903/gha_10664819701_assistant_test-key-2194"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o 'ControlPath="/home/runner/.ansible/cp/285abfb711"' '[ec2-3-83-65-190.compute-1.amazonaws.com]'
<ec2-3-83-65-190.compute-1.amazonaws.com> (0, b'sftp> put /home/runner/.ansible/tmp/ansible-local-2821wja4gasf/tmpl5cix_ik /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821/AnsiballZ_command.py\n', b'')
<ec2-3-83-65-190.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<ec2-3-83-65-190.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2200 -o 'IdentityFile="/tmp/allocator_instance/gha_10664819701_assistant_test-5903/gha_10664819701_assistant_test-key-2194"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o 'ControlPath="/home/runner/.ansible/cp/285abfb711"' ec2-3-83-65-190.compute-1.amazonaws.com '/bin/sh -c '"'"'chmod u+x /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-[241](https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10664819701/job/29556874659#step:12:242)44868290821/ /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821/AnsiballZ_command.py && sleep 0'"'"''
<ec2-3-83-65-190.compute-1.amazonaws.com> (0, b'', b'')
<ec2-3-83-65-190.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<ec2-3-83-65-190.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2200 -o 'IdentityFile="/tmp/allocator_instance/gha_10664819701_assistant_test-5903/gha_10664819701_assistant_test-key-2194"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o 'ControlPath="/home/runner/.ansible/cp/285abfb711"' -tt ec2-3-83-65-190.compute-1.amazonaws.com '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-wscnucwrcojkmeiydxxxyssdwcaapjnp ; /usr/bin/python3 /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821/AnsiballZ_command.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<ec2-3-83-65-190.compute-1.amazonaws.com> (255, b'', b'Shared connection to ec2-3-83-65-190.compute-1.amazonaws.com closed.\r\n')
<ec2-3-83-65-190.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<ec2-3-83-65-190.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2200 -o 'IdentityFile="/tmp/allocator_instance/gha_10664819701_assistant_test-5903/gha_10664819701_assistant_test-key-2194"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o 'ControlPath="/home/runner/.ansible/cp/285abfb711"' ec2-3-83-65-190.compute-1.amazonaws.com '/bin/sh -c '"'"'rm -f -r /home/ubuntu/.ansible/tmp/ansible-tmp-1725270564.1284177-2850-24144868290821/ > /dev/null 2>&1 && sleep 0'"'"''
<ec2-3-83-65-190.compute-1.amazonaws.com> (0, b'', b'')
fatal: [ec2-3-83-65-190.compute-1.amazonaws.com]: UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: Shared connection to ec2-3-83-65-190.compute-1.amazonaws.com closed.",
    "unreachable": true
}
  • Must develop a logic to upload the Allocator VM input as an artifact in order to check the system when the Allocator VM will not be deleted

@davidcr01
Copy link
Contributor

Update Report

Progresses

Allocator VM upload

I have developed the upload artifact logic with the following code:

- name: Compress Allocator directory
  id: compress_allocator_files
  if: always() && steps.allocator_instance.outcome == 'success' && inputs.DESTROY == false
  run: |
    zip -P "${{ secrets.ZIP_ARTIFACTS_PASSWORD }}" -r $ALLOCATOR_PATH.zip $ALLOCATOR_PATH
  
- name: Upload Allocator directory as artifact
  if: always() && steps.compress_allocator_files.outcome == 'success' && inputs.DESTROY == false
  uses: actions/upload-artifact@v4
  with:
    name: allocator-instance
    path: ${{ env.ALLOCATOR_PATH }}.zip

The new development was tested:

SSH connection problem

The SSH connection problem was solved by modifying the Ansible playbook. It seems that, if the playbook takes a considerable time to execute, Ansible closes the connection. This was solved with the following code:

- name: Install assistant installer
  command: "bash {{ script_name }} -a -v"
  args:
    chdir: "{{ script_path }}"
  register: install_results
  async: 500
  poll: 5

The new development was tested in the OSs where the playbooks was failing:

@davidcr01
Copy link
Contributor

Update Report

The current state of the migration of the Test_unattended pipeline is the following:

TASK [Gather facts] ************************************************************
fatal: [ec2-18-204-213-215.compute-1.amazonaws.com]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"}, "changed": false, "msg": "ansible-core requires a minimum of Python2 version 2.7 or Python3 version 3.6. Current version: 3.5.2 (default, Jul 10 2019, 11:58:48) [GCC 5.4.0 20160609]"}
  • Ubuntu 20 fails with the following error:
TASK [Set up Python 3.9 repository] ********************************************
fatal: [ec2-44-212-63-189.compute-1.amazonaws.com]: FAILED! => {"changed": false, "msg": "failed to fetch PPA information, error was: Connection failure: The read operation timed out"}

@wazuhci wazuhci moved this from On hold to In progress in Release 4.10.0 Sep 2, 2024
@davidcr01
Copy link
Contributor

davidcr01 commented Sep 3, 2024

Update Report

Ubuntu 20 problem ✔️

As the Python 3.9 version was only being installed in Ubuntu Jammy (22.04) and the repository to install Python 3.9 was being added in every Ubuntu distribution, I changed the conditional of the repository addition to be only added in Ubuntu Jammy. These tasks were grouped in a block. Now, the GHA passes successfully: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10678513202/job/29595709184

Ubuntu 18 problem ✔️

Ubuntu 18 (Bionic) presented another problem related to pip installation:

"ERROR: This script does not work on Python 3.6. The minimum supported Python version is 3.8. Please use https://bootstrap.pypa.io/pip/3.6/get-pip.py instead.", "stdout_lines": ["ERROR: This script does not work on Python 3.6. The minimum supported Python version is 3.8. Please use https://bootstrap.pypa.io/pip/3.6/get-pip.py instead."

Then, I tried to change the link as the error mentioned and the result was another error:

root@ip-172-31-85-170:/home/ubuntu# curl  https://bootstrap.pypa.io/pip/3.6/get-pip.py | python3 -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2108k  100 2108k    0     0  39.6M      0 --:--:-- --:--:-- --:--:-- 39.6M
Traceback (most recent call last):
  File "<stdin>", line 27079, in <module>
  File "<stdin>", line 137, in main
  File "<stdin>", line 113, in bootstrap
  File "<stdin>", line 94, in monkeypatch_for_cert
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/commands/__init__.py", line 9, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/cli/base_command.py", line 13, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/cli/cmdoptions.py", line 23, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/cli/parser.py", line 12, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/configuration.py", line 26, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/utils/logging.py", line 13, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/utils/misc.py", line 40, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/locations/__init__.py", line 14, in <module>
  File "/tmp/tmphpdgbd10/pip.zip/pip/_internal/locations/_distutils.py", line 9, in <module>
ModuleNotFoundError: No module named 'distutils.cmd'

Installing the python3-pip package with the package manager solved the problem. A new task was added to the provision playbook. Now, the GHA passed successfully: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10678513202/job/29595709184

Ubuntu 16 problem ✔️

In Ubuntu 16, Python 3.6 is needed. Before, we could use the deadsnake repository, but this repository removed the support for Ubuntu 16: deadsnakes/issues#195. To continue with this, I came up with two possible solutions:

  1. Compile Python 3.6: needs more dependencies and takes a significant time to compile. The tested commands were the following:
sudo apt install build-essential checkinstall
sudo apt install libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev
wget https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tar.xz
tar xvf Python-3.6.0.tar.xz
cd Python-3.6.0/
./configure
sudo make altinstall
  1. Add a third-party repository, similar to deadsnakes repository and install Python 3.6 with the package manager. This option was chosen due to simplicity.
add-apt-repository -y ppa:jblgf0/python
apt-get update
apt-get install -y python3.6

Also, it was needed to create a link to python3.6 to avoid setting the interpreter in the workflow; and it was also needed to install python3-apt; and it was also necessary to create symlinks to some shared libraries: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10683445779/job/29611531143

@davidcr01
Copy link
Contributor

Update Report - Tier assistant test

I started to develop the tier of the assistant test. This worklfow will launch several runs of the assistant test workflow: #46

Some things to highlight:

  • The input parameters to choose the operating systems to test have been changed to a string parameter that accepts a list of numbers separated by commas. This has been due to the maximum limit of 10 parameters in the workflow form. To avoid errors when writing the names of the operating systems, they have been encoded, and instead of writing the names of the operating systems, the numbers associated with them are written.
  • Several padding characters have had to be introduced in the description of the input parameter of the operating systems so that the encoding is easy to read, since the description parameter does not support any possible formatting (we tried with \n and with <br>)
  • The blank spaces have had to be removed from the names of the operating systems since these caused problems when obtaining the operating systems associated with the numbers. This has impacted the Test_installation_assistant.yml workflow.

In a first test, the workflow has been executed correctly, parsing the operating systems and launching different instances of the assistant workflow: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10696903344/job/29653020167

This execution has launched the following executions:

Note

The assistant test worklfows have failed because they are using a branch based on 4.10.0, and there are no 4.10.0 packages yet to install.

Now working on the tier waits the child runs.

@davidcr01
Copy link
Contributor

After talking with the team leaders, we are going to change the approach. We want to test if it's possible to unify the Test_unattended and Test_unattended_tier, and make a single workflow. This workflow would launch several jobs. Each job would represent a selected OS. This avoids developing the tier workflow, which is a little bit difficult because of the management of the executing runs of the Test_unattended and the multiple GitHub API calls.

@davidcr01
Copy link
Contributor

Update Report

The Test_installation_assistant.yml workflow has been modified to launch multiple jobs depending on the selected OSs. By this way, we can unify the test and its tier in one unique workflow. This has been achieved by using the matrix strategy:

jobs:
  run-test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false    # If a job fails, the rest of jobs will not be cancelled
      matrix:
        system: ${{ fromJson(inputs.SYSTEMS) }}

Then, the steps given below will be executed in different jobs, where each job will execute the test with one of the selected OS.

@davidcr01
Copy link
Contributor

Update Report - Distributed

Currently working on the distributed pipeline migration.

@davidcr01
Copy link
Contributor

Update Report - Distributed

Jenkins pipeline explanation

We have decided to rework the distributed pipeline. The Jenkins pipeline performed the following installation:

  • It provided 10 instances, an instance per available OS.
  • It created a big inventory and a big config.yml config file with every
  • On each instance, it installed the Wazuh indexer, manager and dashboard.
  • When installing the Wazuh manager, one of the instances (CentOS7) was the master node of the Wazuh server cluster, and the rest of the instances installed worker nodes.

This is a weird scenario, where the distributed installation is being checked, but we think this is not a real scenario to test it. Instead, we want to rework this test, taking into account that this would be done in 5.0.0. Also, we are doing this in order to avoid the effort and time of the development it would take to migrate exactly the old Jenkins pipeline, when it seems not to be very practical.

Possible workaround

The proposed solution is to rework this pipeline. If we want to test the distributed installation using the assistant, we will perform the steps given in the Wazuh indexer,, Wazuh server and Wazuh dashboard documentation in this kind of installation. The infrastructure would be the following:

  • For each OS selected in the pipeline:
    • Three different instances will be deployed using the allocator.
    • The three instances will have the same OS installed.
    • In one of them, a Wazuh indexer node and Wazuh manager master node will be installed.
    • In one of them, a Wazuh indexer node and Wazuh manager worker node will be installed.
    • In one of them, a Wazuh indexer node, Wazuh manager worker node and the Wazuh dashboard will be installed.

Important

This scenario does not test the connectivity between instances of different OSs. We should review if we also want to test this scenario or not.

Thus, the workflow will be the following task:

  • General provision of the runner: repositories clonation, Ansible installation, Allocator setup...
  • Provides three instances using the allocator.
  • Get the private IPs of the three instances. This will be done using the AWS CLI and the describe-instances option.
  • Build an inventory with the information of the three instances. The inventory should have a variable, such as node_type, to identify which node will install the manager master node or the manager worker node.
  • Generate a config.yml file with the instances information and generate the certificates using the assistant.
  • Copy the certificates to the three instances.
  • Perform the Wazuh indexer, Wazuh manager, and Wazuh dashboard installation conditionally (controlled by the inventory), including the Wazuh indexer cluster initialization.
  • Test the installation with the Python test.

@teddytpc1
Copy link
Member Author

@davidcr01.

This scenario does not test the connectivity between instances of different OSs. We should review if we also want to test this scenario or not.

This scenario will not be tested for now.

@davidcr01
Copy link
Contributor

davidcr01 commented Sep 10, 2024

Update Report - Distributed

The task I have performed are the following:

  • General provision of the runner: repositories clonation, Ansible installation, Allocator setup...
  • Provides three instances using the allocator.
  • Build an inventory with the information of the three instances. The inventory should have a variable, such as node_type, to identify which node will install the manager master node or the manager worker node.
  • Get the private IPs of the three instances. This will be done using the AWS CLI and the describe-instances option.
  • Generate a config.yml file with the instance information and generate the certificates using the assistant.
  • Copy the certificates to the three instances.
  • Perform the Wazuh indexer, Wazuh manager, and Wazuh dashboard installation conditionally (controlled by the inventory), including the Wazuh indexer cluster initialization.
  • Test the installation using the Python test.

Allocating instances ✔️

Related to instance allocation and destruction, these tasks have been parallelized to provide and destroy the machines simultaneously. This makes the workflow consume less time in these tasks.

# Provision instance in parallel
(
python3 wazuh-automation/deployability/modules/allocation/main.py \
  --action create --provider aws --size large \
  --composite-name ${{ env.COMPOSITE_NAME }} \
  --working-dir $ALLOCATOR_PATH --track-output $ALLOCATOR_PATH/track_${instance_name}.yml \
  --inventory-output $ALLOCATOR_PATH/inventory_${instance_name}.yml \
  --instance-name gha_${{ github.run_id }}_${{ env.TEST_NAME }}_${instance_name} --label-team devops --label-termination-date 1d
  ...
) &
done
# Wait for all provisioning tasks to complete

wait

Note

This strategy can be used in other GHAs where we need to deploy several instances using the allocator.

Generating the certficiates ✔️

Related to the certificates generation, this was done using a Jinja2 template, similar to the old Jenkins pipeline due to the complexity of building the config.yml file using bash, and consider the different machines.

The Jinja2 template:

nodes:
  # Wazuh indexer nodes
  indexer:
{% for indexer in groups['indexers'] %}
    - name: {{ hostvars[indexer]['inventory_hostname'] }}
      ip: "{{ hostvars[indexer]['private_ip'] }}"
{% endfor %}
  server:
{% for manager in groups['managers'] %}
    - name: {{ hostvars[manager]['inventory_hostname'] }}
      ip: "{{ hostvars[manager]['private_ip'] }}"
      node_type: "{{ hostvars[manager]['manager_type'] }}"
{% endfor %}
  dashboard:
{% for dashboard in groups['dashboards'] %}
    - name: {{ hostvars[dashboard]['inventory_hostname'] }}
      ip: "{{ hostvars[dashboard]['private_ip'] }}"
{% endfor %}

Wazuh server installation ⛏️

Now working on the Wazuh server installation in the several nodes. I found the following problem:

curl -k -s -X POST -u wazuh-wui:wazuh-wui 'https://127.0.0.1:55000/security/user/authenticate/run_as?raw=true' -d '{\"user_name\":\"wzread\"}' -H content-type:application/json\n+ TOKEN='{\"title\": \"Wazuh Cluster Error\", \"detail\": \"Worker node is not connected to master\", \"remediation\": \"Check the cluster.log located at WAZUH_HOME/logs/cluster.log file to see if there are connection errors. Restart the `wazuh-manager` service.\", \"error\": 3023

It seems that the Wazuh manager worker nodes are failing in the manager check. Firstly, I thought that this problem was related to the simultaneous installation of the Wazuh manager, which leads the worker connect to a non-existing Wazuh manager master node installation. After debugging this behavior, I ensured that this was not the problem.

Important

To check this, I installed in a playbook the Wazuh manager master node, and in another playbook (executed longafter) the Wazuh manager worker nodes installation, and the result was the same. Evidence: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10794298716/job/29938263624

I have asked @wazuh/devel-pyserver and @wazuh/devel-cppserver about the Wazuh API check. This means that it is necessary to perform an API call to every Wazuh manager node, or on the other hand, perform an API call only to the Wazuh manager master node.

@wazuhci wazuhci moved this from In progress to Blocked in Release 4.10.0 Sep 11, 2024
@davidcr01
Copy link
Contributor

Update Report - Distributed

Wazuh server installation ⛏️

There was a problem in this stage, reported in #51. The related PR tends to fix the manager check service in the distributed deployment.

With the PR, the Wazuh server installation could be done simultaneously in every node (master and workers). However, we want to replicate the most reproduced scenario. This actually is to install the Wazuh manager nodes sequentially, as specified in the documentation.

If you want a Wazuh server multi-node cluster, repeat this process on every Wazuh server node.

Now working on a new logic: the Wazuh worker nodes should wait for the Wazuh master node to be installed, in order to start their installation

@wazuhci wazuhci moved this from Blocked to In progress in Release 4.10.0 Sep 12, 2024
@davidcr01
Copy link
Contributor

davidcr01 commented Sep 13, 2024

Update Report - Distributed

Wazuh server installation ✔️

I finally developed a logic in which the Wazuh manager worker nodes wait to the Wazuh manager master node to be installed:

- name: Install Wazuh server on master
  block:
    - name: Install Wazuh server (Master)
      command: "bash {{ tmp_path }}/wazuh-install.sh -ws {{ inventory_hostname }} -v"
      register: wazuh
    
    - name: Save Wazuh installation log (Master)
      blockinfile:
        marker: ""
        path: "{{ test_dir }}/{{ test_name }}_{{ inventory_hostname }}.log"
        block: |
          {{ wazuh.stderr }}
          --------------------------------
          {{ wazuh.stdout }}
  when: hostvars[inventory_hostname].manager_type == 'master'
  
  - name: Install Wazuh server on worker nodes
  block:
    - name: Wait for Wazuh master to be ready on port {{ check_port }}
      wait_for:
        host: "{{ master_ip }}"
        port: "{{ check_port }}"
        delay: "{{ delay }}"
        timeout: 300
      when: hostvars[inventory_hostname].manager_type == 'worker'
      async: 500
      poll: 5
  
    - name: Install Wazuh server (Workers)
      command: "bash {{ tmp_path }}/wazuh-install.sh -ws {{ inventory_hostname }} -v"
      register: wazuh

Here, the workers make petitions to the API master node (port 55000), and when the port is opened and ready, the worker nodes start their installation. Successful GHA here: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10833326420/job/30059832055

Wazuh dashboard installation ✔️

The Wazuh dashboard installation follows the same logic as the indexer installation. Successful GHA here: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10833846856/job/30061601201

Python tests execution ✔️

The distributed_test.yml playbook has been adapted to the infrastructure deployed in the distributed test. The Python test playbook is executed successfully: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10849996386/job/30110397969

Extra

  • The output mode of the Ansible playbooks has been modified to a better human read. The strategy is to install the community.general collection in the runner with the ansible-galaxy collection install community.general command and set the ANSIBLE_STDOUT_CALLBACK variable in every playbook execution

@davidcr01
Copy link
Contributor

davidcr01 commented Sep 16, 2024

Update Report - Distributed

Random remoted error

The development of the pipeline is finished and it is working as expected: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10881208924

However, it seems that some jobs are failing because the Python test is detecting a cluster error in some tests. The error is wazuh-remoted: ERROR: Unable to connect to socket 'queue/db/wdb. For example: https://github.com/wazuh/wazuh-installation-assistant/actions/runs/10880747600/job/30188237777

fatal: [worker1]: FAILED! => changed=true 
  cmd:
  - python3
  - -m
  - pytest
  - --tb=long
  - test_installation_assistant.py
  - -v
  - -m
  - wazuh or indexer or indexer_cluster
  delta: '0:00:03.545409'
  end: '2024-09-16 09:18:06.362740'
  msg: non-zero return code
  rc: 1
  start: '2024-09-16 09:18:02.817331'
  stderr: ''
  stderr_lines: <omitted>
  stdout: |-
    ============================= test session starts ==============================
    platform linux -- Python 3.9.20, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /usr/bin/python3
    cachedir: .pytest_cache
    rootdir: /tmp/test/tests/install, configfile: pytest.ini
    collecting ... collected 22 items / 5 deselected / 17 selected
  
    test_installation_assistant.py::test_check_wazuh_manager_authd PASSED    [  5%]
    test_installation_assistant.py::test_check_wazuh_manager_db PASSED       [ 11%]
    test_installation_assistant.py::test_check_wazuh_manager_execd PASSED    [ 17%]
    test_installation_assistant.py::test_check_wazuh_manager_analysisd PASSED [ 23%]
    test_installation_assistant.py::test_check_wazuh_manager_syscheckd PASSED [ 29%]
    test_installation_assistant.py::test_check_wazuh_manager_remoted PASSED  [ 35%]
    test_installation_assistant.py::test_check_wazuh_manager_logcollec PASSED [ 41%]
    test_installation_assistant.py::test_check_wazuh_manager_monitord PASSED [ 47%]
    test_installation_assistant.py::test_check_wazuh_manager_modulesd PASSED [ 52%]
    test_installation_assistant.py::test_check_wazuh_manager_apid PASSED     [ 58%]
    test_installation_assistant.py::test_check_filebeat_process PASSED       [ 64%]
    test_installation_assistant.py::test_check_indexer_process PASSED        [ 70%]
    test_installation_assistant.py::test_check_indexer_cluster_status_not_red PASSED [ 76%]
    test_installation_assistant.py::test_check_indexer_cluster_status_not_yellow PASSED [ 82%]
    test_installation_assistant.py::test_check_wazuh_api_status PASSED       [ 88%]
    test_installation_assistant.py::test_check_log_errors FAILED             [ 94%]
    test_installation_assistant.py::test_check_alerts PASSED                 [100%]
  
    =================================== FAILURES ===================================
    ____________________________ test_check_log_errors _____________________________
  
        @pytest.mark.wazuh
        def test_check_log_errors():
            found_error = False
            exceptions = [
                'WARNING: Cluster error detected',
                'agent-upgrade: ERROR: (8123): There has been an error executing the request in the tasks manager.',
                "ERROR: Could not send message through the cluster after '10' attempts"
  
            ]
  
            with open('/var/ossec/logs/ossec.log', 'r') as f:
                for line in f.readlines():
                    if 'ERROR' in line:
                        if not any(exception in line for exception in exceptions):
                            found_error = True
                            break
    >       assert found_error == False, line
    E       AssertionError: 2024/09/16 09:14:54 wazuh-remoted: ERROR: Unable to connect to socket 'queue/db/wdb'.
    E
    E       assert True == False
    E         +True
    E         -False

I have observed the following:

  • The error is not related to the system, as sometimes the error is generated and sometimes not (in the same system)
  • The error is not related to the worker or master node. Sometimes the error is generated in the master node, and sometimes in the worker nodes.
  • I tried to reproduce this manually without success.
  • The error seems to be random

Now I'm trying to get a test with the failure state and enter to the environment to investigate the logs and the configuration.

Investigation

After reproducing the error with the GHA, I could access to the logs and configuration of the Wazuh manager node. It seems that everything is correct, and the logs are the following:

2024/09/16 10:49:42 wazuh-remoted: INFO: Started (pid: 57643). Listening on port 1514/TCP (secure).
2024/09/16 10:49:42 wazuh-remoted: INFO: (1410): Reading authentication keys file.
2024/09/16 10:50:00 wazuh-remoted: INFO: (1225): SIGNAL [(15)-(Terminated)] Received. Exit Cleaning...
2024/09/16 10:50:10 wazuh-remoted: INFO: Started (pid: 59627). Listening on port 1514/TCP (secure).
2024/09/16 10:50:10 wazuh-remoted: INFO: Cannot find 'queue/db/wdb'. Waiting 1 seconds to reconnect.
2024/09/16 10:50:11 wazuh-remoted: INFO: Cannot find 'queue/db/wdb'. Waiting 2 seconds to reconnect.
2024/09/16 10:50:13 wazuh-remoted: INFO: Cannot find 'queue/db/wdb'. Waiting 3 seconds to reconnect.
2024/09/16 10:50:16 wazuh-remoted: ERROR: Unable to connect to socket 'queue/db/wdb'.
2024/09/16 10:50:16 wazuh-remoted: ERROR: Unable to connect to socket 'queue/db/wdb'.
2024/09/16 10:50:16 wazuh-remoted: ERROR: Error querying Wazuh DB to get agent's groups.
2024/09/16 10:50:16 wazuh-remoted: INFO: (1410): Reading authentication keys file.
2024/09/16 10:53:00 wazuh-remoted: INFO: (1225): SIGNAL [(15)-(Terminated)] Received. Exit Cleaning...
2024/09/16 10:53:11 wazuh-remoted: INFO: Started (pid: 65284). Listening on port 1514/TCP (secure).
2024/09/16 10:53:11 wazuh-remoted: INFO: (1410): Reading authentication keys file.

Although the error is reported, the cluster seems to be working correctly.

Tip

After asking the cppserver team, it seems that the error is known and the following PR was merged to 4.9.1: wazuh/wazuh#25598. Note that the error is generated after consuming the attempts, so maybe increasing the number of attempts solves the problem. We can conclude that the error is not related to the test/deployment.

@davidcr01 davidcr01 linked a pull request Sep 16, 2024 that will close this issue
@wazuhci wazuhci moved this from In progress to Pending review in Release 4.10.0 Sep 16, 2024
@wazuhci wazuhci moved this from Pending review to Done in Release 4.10.0 Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment