Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AZP/TEST: Add MAD tests #9735

Merged
merged 1 commit into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions buildlib/pr/mad_tests.yml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see separate setup and build stages instead of hiding them in some test stage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
jobs:
- job: SetupServer
displayName: Setup Server
pool:
name: MLNX
demands: mad_server
workspace:
clean: outputs
steps:
- checkout: self
clean: true
fetchDepth: 100
retryCountOnTaskFailure: 5
- task: Bash@3
name: Set_Vars
inputs:
targetType: "inline"
script: |
source ./buildlib/tools/test_mad.sh
set_vars
displayName: Set Vars
- bash: |
source ./buildlib/tools/test_mad.sh
build_ucx_in_docker
docker_run_srv
displayName: Setup Server

- job: SetupClient
displayName: Setup Client
pool:
name: MLNX
demands: mad_client
workspace:
clean: outputs
steps:
- checkout: self
clean: true
fetchDepth: 100
retryCountOnTaskFailure: 5
- bash: |
source ./buildlib/tools/test_mad.sh
build_ucx
displayName: Setup Client

- job: TestLid
dependsOn:
- SetupServer
- SetupClient
displayName: Test Lid
timeoutInMinutes: 10
pool:
name: MLNX
demands: mad_client
variables:
LID: $[ dependencies.SetupServer.outputs['Set_Vars.LID'] ]
HCA: $[ dependencies.SetupServer.outputs['Set_Vars.HCA'] ]
steps:
- checkout: none
- bash: |
source ./buildlib/tools/test_mad.sh
run_mad_test lid:$(LID)
env:
HCA: $(HCA)
displayName: Test LID

- job: ServerRestart
dependsOn: TestLid
displayName: Server Restart
pool:
name: MLNX
demands: mad_server
steps:
- checkout: none
- bash: |
source ./buildlib/tools/test_mad.sh
docker_run_srv
displayName: Server Restart

- job: TestGuid
dependsOn:
- SetupServer
- ServerRestart
displayName: Test Guid
timeoutInMinutes: 10
pool:
name: MLNX
demands: mad_client
variables:
GUID: $[ dependencies.SetupServer.outputs['Set_Vars.GUID'] ]
HCA: $[ dependencies.SetupServer.outputs['Set_Vars.HCA'] ]
steps:
- checkout: none
- bash: |
source ./buildlib/tools/test_mad.sh
run_mad_test guid:$(GUID)
env:
HCA: $(HCA)
displayName: Test GUID

- job: ServerStop
dependsOn: TestGuid
displayName: Server Stop
condition: always()
pool:
name: MLNX
demands: mad_server
steps:
- checkout: none
- bash: |
source ./buildlib/tools/test_mad.sh
docker_stop_srv
displayName: Server Stop
9 changes: 9 additions & 0 deletions buildlib/pr/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,15 @@ stages:
long_test: $(long_test)
test_static: $(test_static)

- stage: ucx_perftest_mad_rte
dependsOn: [Static_check]
displayName: ucx_perftest over MAD RTE
lockBehavior: sequential
variables:
- group: concurrency_lock
jobs:
- template: mad_tests.yml

- stage: WireCompat
dependsOn: [Static_check]
jobs:
Expand Down
81 changes: 81 additions & 0 deletions buildlib/tools/test_mad.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/bin/bash
set -exE -o pipefail

IMAGE="rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel8.2/builder:mofed-5.0-1.0.0.0"

if [ -z "$BUILD_SOURCESDIRECTORY" ]; then
echo "Not running in Azure"
exit 1
fi
cd "$BUILD_SOURCESDIRECTORY"

build_ucx() {
./autogen.sh
./contrib/configure-release \
--prefix="$PWD"/install \
--with-mad \
--without-valgrind \
--without-go \
--without-java
make -s -j"$(nproc)"
make install
}

build_ucx_in_docker() {
docker run --rm \
--name ucx_build_"$BUILD_BUILDID" \
-e BUILD_SOURCESDIRECTORY="$BUILD_SOURCESDIRECTORY" \
-v "$PWD":"$PWD" -w "$PWD" \
-v /hpc/local:/hpc/local \
$IMAGE \
bash -c "source ./buildlib/tools/test_mad.sh && build_ucx"

sudo chown -R swx-azure-svc:ecryptfs "$PWD"
}

docker_run_srv() {
HCA=$(detect_hca)
docker_stop_srv
docker run --rm \
--detach \
--net=host \
--name ucx_perftest_"$BUILD_BUILDID" \
-e BUILD_SOURCESDIRECTORY="$BUILD_SOURCESDIRECTORY" \
-v "$PWD":"$PWD" -w "$PWD" \
-v /hpc/local:/hpc/local \
--ulimit memlock=-1:-1 --device=/dev/infiniband/ \
$IMAGE \
bash -c "${PWD}/install/bin/ucx_perftest -K ${HCA}"
}

docker_stop_srv() {
docker stop ucx_perftest_"$BUILD_BUILDID" || true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop is not enough we should remove the container as well, we dont want to have dangling containers all the time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a daily cleanup of Docker resources.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we will have many stopped containers every day until they are cleaned? this is bad practice, each run should do its best to clean after it self, the daily cleanup is extra cleanup not instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the --rm flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Containers get cleaned once stopped.

}

set_vars() {
set +x
yosefe marked this conversation as resolved.
Show resolved Hide resolved
HCA=$(detect_hca)
# Replace ':' with space for 'ibstat' format
HCA_DEV=${HCA/:/ }
# shellcheck disable=SC2086
LID=$(ibstat $HCA_DEV | grep Base | awk '{print $NF}')
# shellcheck disable=SC2086
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disable? its only adding double quotes on $HCA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding double quotes breaks the functionality

GUID=$(ibstat $HCA_DEV | grep GUID | awk '{print $NF}')
echo "##vso[task.setvariable variable=LID;isOutput=true]$LID"
echo "##vso[task.setvariable variable=GUID;isOutput=true]$GUID"
echo "##vso[task.setvariable variable=HCA;isOutput=true]$HCA"
echo "LID: $LID"
echo "GUID: $GUID"
echo "HCA: $HCA"
}

run_mad_test() {
local ib_address="$1"
sudo chmod 777 /dev/infiniband/umad*
"$PWD"/install/bin/ucx_perftest -t tag_bw -e -K "$HCA" -e "$ib_address"
}

detect_hca() {
# Detect first active HCA port
ibv_devinfo | awk '/hca_id:/ {hca=$2} /port:/ {port=$2} /PORT_ACTIVE/ {print hca ":" port; exit}'
}
Loading