Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install Lustre Client fails on VT1 #85

Closed
gmarchand opened this issue Nov 29, 2023 · 4 comments
Closed

Install Lustre Client fails on VT1 #85

gmarchand opened this issue Nov 29, 2023 · 4 comments

Comments

@gmarchand
Copy link

gmarchand commented Nov 29, 2023

Description: Impossible to install Lustre Client with the AMI AMD Xilinx Video SDK AMI with ECS support for VT1 Instances (AL2) despite it works with Amazon ECS-Optimized Amazon Linux 2 (AL2) x86_64 AMI

AMI Used : AMD Xilinx Video SDK AMI with ECS support for VT1 Instances (AL2) :
https://aws.amazon.com/marketplace/pp/prodview-phvk6d4mq3hh6

User Data used by the EC2 Launch Template

#!/bin/bash -ex

exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

uname -r

fsx_dnsname=%DNS_NAME%
fsx_mountname=%MOUNT_NAME%
fsx_mountpoint=%MOUNT_POINT%

amazon-linux-extras install -y lustre2.10
mkdir -p "$fsx_mountpoint"
mount -t lustre -o relatime,flock ${fsx_dnsname}@tcp:/${fsx_mountname} ${fsx_mountpoint}

System logs:

[  120.666982] cloud-init[23528]: + exec
[  120.667287] cloud-init[23528]: ++ tee /var/log/user-data.log
[  120.668318] cloud-init[23528]: ++ logger -t user-data -s
<13>Nov 29 11:58:07 user-data: + uname -r
<13>Nov 29 11:58:07 user-data: 4.14.305-227.531.amzn2.x86_64
<13>Nov 29 11:58:07 user-data: + fsx_dnsname=fs-xxx.fsx.eu-west-1.amazonaws.com
<13>Nov 29 11:58:07 user-data: + fsx_mountname=xxx
<13>Nov 29 11:58:07 user-data: + fsx_mountpoint=/fsx-lustre
<13>Nov 29 11:58:07 user-data: + amazon-linux-extras install -y lustre2.10
<13>Nov 29 11:58:09 user-data: Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper
<13>Nov 29 11:58:09 user-data: Existing lock /var/run/yum.pid: another copy is running as pid 23715.
<13>Nov 29 11:58:09 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:09 user-data:   The other application is: yum
<13>Nov 29 11:58:09 user-data:     Memory : 221 M RSS (437 MB VSZ)
<13>Nov 29 11:58:09 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:01 ago
<13>Nov 29 11:58:09 user-data:     State  : Running, pid: 23715
<13>Nov 29 11:58:11 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:11 user-data:   The other application is: yum
<13>Nov 29 11:58:11 user-data:     Memory : 334 M RSS (550 MB VSZ)
<13>Nov 29 11:58:11 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:03 ago
<13>Nov 29 11:58:11 user-data:     State  : Running, pid: 23715
<13>Nov 29 11:58:13 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:13 user-data:   The other application is: yum
<13>Nov 29 11:58:13 user-data:     Memory : 349 M RSS (566 MB VSZ)
<13>Nov 29 11:58:13 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:05 ago
<13>Nov 29 11:58:13 user-data:     State  : Running, pid: 23715
<13>Nov 29 11:58:15 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:15 user-data:   The other application is: yum
<13>Nov 29 11:58:15 user-data:     Memory : 350 M RSS (566 MB VSZ)
<13>Nov 29 11:58:15 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:07 ago
<13>Nov 29 11:58:15 user-data:     State  : Running, pid: 23715
[�[32m  OK  �[0m] Started Dynamically Generate Message Of The Day.
<13>Nov 29 11:58:17 user-data: Cleaning repos: amzn2-core amzn2extra-docker amzn2extra-ecs amzn2extra-epel
<13>Nov 29 11:58:17 user-data:               : amzn2extra-lustre2.10 epel
<13>Nov 29 11:58:17 user-data: 34 metadata files removed
<13>Nov 29 11:58:17 user-data: 12 sqlite files removed
<13>Nov 29 11:58:17 user-data: 0 metadata files removed
<13>Nov 29 11:58:17 user-data: Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:  One of the configured repositories failed (Unknown),
<13>Nov 29 11:58:52 user-data:  and yum doesn't have enough cached data to continue. At this point the only
<13>Nov 29 11:58:52 user-data:  safe thing yum can do is fail. There are a few ways to work "fix" this:
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      1. Contact the upstream for the repository and get them to fix the problem.
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      2. Reconfigure the baseurl/etc. for the repository, to point to a working
<13>Nov 29 11:58:52 user-data:         upstream. This is most often useful if you are using a newer
<13>Nov 29 11:58:52 user-data:         distribution release than is supported by the repository (and the
<13>Nov 29 11:58:52 user-data:         packages for the previous distribution release still work).
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      3. Run the command with the repository temporarily disabled
<13>Nov 29 11:58:52 user-data:             yum --disablerepo=<repoid> ...
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      4. Disable the repository permanently, so yum won't use it by default. Yum
<13>Nov 29 11:58:52 user-data:         will then just ignore the repository until you permanently enable it
<13>Nov 29 11:58:52 user-data:         again or use --enablerepo for temporary usage:
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:             yum-config-manager --disable <repoid>
<13>Nov 29 11:58:52 user-data:         or
<13>Nov 29 11:58:52 user-data:             subscription-manager repos --disable=<repoid>
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      5. Configure the failing repository to be skipped, if it is unavailable.
<13>Nov 29 11:58:52 user-data:         Note that yum will try to contact the repo. when it runs most commands,
<13>Nov 29 11:58:52 user-data:         so will have to try and fail each time (and thus. yum will be be much
<13>Nov 29 11:58:52 user-data:         slower). If it is a very temporary problem though, this is often a nice
<13>Nov 29 11:58:52 user-data:         compromise:
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:             yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data: Cannot retrieve metalink for repository: epel/x86_64. Please verify its path and try again
<13>Nov 29 11:58:52 user-data: Installation failed. Check that you have permissions to install.
<13>Nov 29 11:58:52 user-data: Installing lustre-client
[  165.815024] cloud-init[23528]: Nov 29 11:58:52 cloud-init[23528]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-003 [13]
[  165.834238] cloud-init[23528]: Nov 29 11:58:52 cloud-init[23528]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  165.837475] cloud-init[23528]: Nov 29 11:58:52 cloud-init[23528]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
ci-info: no authorized ssh keys fingerprints found for user ec2-user.

Works well with this AMI : Amazon ECS-Optimized Amazon Linux 2 (AL2) x86_64 AMI
https://aws.amazon.com/marketplace/pp/prodview-do6i4ripwbhs2?sr=0-1&ref_=beagle&applicationId=AWSMPContessa

@NastoohX
Copy link
Collaborator

Hi,
Sorry for the late reply. Looking at the provided logs, I am not able to correlate the installation issue with VT1 AMI. To see if this installation issue is due to our packages, proceed by removing the SDK packages, on a non-mission critical system, as per https://xilinx.github.io/video-sdk/v3.0/getting_started_on_vt1.html#installing-the-sdk-on-an-existing-ami, Step 3. Once SDK is removed, continue with your original installation. If this is successful, then try to re-install the SDK, by following the above link. If installation is not successful, then please provide the relevant logs.
Cheers,

@NastoohX
Copy link
Collaborator

NastoohX commented Feb 5, 2024

Hi,
Closing this ticket due to inactivity. Feel free to reopen if needed.
Cheers,

@NastoohX NastoohX closed this as completed Feb 5, 2024
@gmarchand
Copy link
Author

Hello @NastoohX I found the issue, Need to upgrade the ECS AMI. Here the reference : aws/amazon-ecs-ami#191

@hifarhanali
Copy link

any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants