Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying smoke test to add more advanced validation as requested #1124

Merged
merged 3 commits into from
Sep 20, 2022

Conversation

atalman
Copy link
Contributor

@atalman atalman commented Sep 7, 2022

Modifying smoke test to add more advanced validation as requested
Fixes: pytorch/pytorch#83519

Please note: .github/workflows/validate-linux-binaries.yml changes shall be reverted prior to commiting
Images in this PR are taken from torchvision repo

PyTorch smoke tests validate that:
torch module can be imported
3x3 convolution works on available devices (CPU/GPU)

TorchVision smoke tests validate that:
torchvision module can be imported
torchvision can be used to decode jpeg and png image
resnet50 can classify image of a dog as a dog

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all test, validate that decoding works (i.e. returned values are not None and have expected properties)
Do not assume that test is executed as python foo.py, and use Path(__file__).parent to get path to the folder this test is located

Also, why do you need to change the builder checkout to atalman/builder?

.github/workflows/validate-linux-binaries.yml Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
x = torch.randn(1, 3, 24, 24).cuda()
with torch.cuda.amp.autocast():
out = conv(x)
print(out.sum())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Shouldn't we better do something like:

if out.sum() == 42:
   raise RuntimeError("Should not happen")

Copy link
Contributor Author

@atalman atalman Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand.

   conv = nn.Conv2d(3, 3, 3).cuda()
   x = torch.randn(1, 3, 24, 24).cuda()
      with torch.cuda.amp.autocast():
           out = conv(x)
      print(out.sum())

Has different value every time.

Can you please provide more deatils on what is required ? Maybe a link to tutorial or example test file?

test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
@malfet
Copy link
Contributor

malfet commented Sep 8, 2022

Also, please be consistent with using single vs double quote in the file

@atalman
Copy link
Contributor Author

atalman commented Sep 9, 2022

For all test, validate that decoding works (i.e. returned values are not None and have expected properties) Do not assume that test is executed as python foo.py, and use Path(__file__).parent to get path to the folder this test is located

Also, why do you need to change the builder checkout to atalman/builder?

Done removed these changes.

@atalman atalman requested a review from malfet September 9, 2022 20:45
@atalman atalman force-pushed the modifying_smoke_test branch 2 times, most recently from aea85ae to d7c4a8a Compare September 10, 2022 00:31
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though left a few nits.

Also, let's stick with https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings style on strings - i.e. use double quotes by default

test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
.github/workflows/validate-linux-binaries.yml Outdated Show resolved Hide resolved
test/smoke_test/smoke_test.py Outdated Show resolved Hide resolved
@atalman atalman force-pushed the modifying_smoke_test branch from 6374fb9 to b5923db Compare September 20, 2022 15:45
More vision smoke tests

Temporary pointing to my repo for testing

Try 2 use atalman builder

Modify path

Fixing commits

Testing

Testing

Smoke test modifications

Refactor test code

Fix typo

Fixing image read

A little more refactoring

Addressing comments

Testing
@atalman atalman force-pushed the modifying_smoke_test branch from b5923db to 9f0f774 Compare September 20, 2022 15:46
@atalman atalman merged commit 2860f35 into pytorch:main Sep 20, 2022
jithunnair-amd pushed a commit to jithunnair-amd/builder that referenced this pull request Nov 1, 2022
…torch#1124)

* Modify smoke test matrix

More vision smoke tests

Temporary pointing to my repo for testing

Try 2 use atalman builder

Modify path

Fixing commits

Testing

Testing

Smoke test modifications

Refactor test code

Fix typo

Fixing image read

A little more refactoring

Addressing comments

Testing

* Add same test for windows and macos

* Addressing c omments
jithunnair-amd added a commit to ROCm/builder that referenced this pull request Apr 11, 2023
* Make sure package_type is set (pytorch#1139)

* Update check_binary.sh

* Update check_binary.sh

* Modifying smoke test to add more advanced validation as requested (pytorch#1124)

* Modify smoke test matrix

More vision smoke tests

Temporary pointing to my repo for testing

Try 2 use atalman builder

Modify path

Fixing commits

Testing

Testing

Smoke test modifications

Refactor test code

Fix typo

Fixing image read

A little more refactoring

Addressing comments

Testing

* Add same test for windows and macos

* Addressing c omments

* Add manywheel special build for including pypi package (pytorch#1142)

* Add manywheel special build

Testing

Builder change

Testing

Adding manywheel cuda workflow

Simplify

Fix expr

* address comments

* checking for general setting

* Pass correct parameters for macos validations (pytorch#1143)

* Revert "Update check_binary.sh"

This reverts commit 6850bed.

* Revert "Update check_binary.sh"

This reverts commit 051b9d1.

* setup periodic test to run binary verification  pytorch/pytorch#84764: (pytorch#1144)

* add a reusable workflow to run all smoke tests/or smoke tests for a specific os/channel
* add workflows to schedule the periodic smoke tests for nightly and release channels

* Update aarch64 script to latest one (pytorch#1146)

* minor: fix the typo job name for windows binaries validation workflow (pytorch#1147)

* fix the typo in the the job name for the release binaries validation workflow (pytorch#1148)

issue was introduced in pytorch#1144

* Move to rc2 of 3.11 python (pytorch#1149)

Need it to get several convenience functions

* Integrates CUDA pip wheels (pytorch#1136)

* Refactors rpath to externally set var. Adds mechanism to add metadata

* Sets RUNPATH when using cudnn and cublas wheels

* Escapes dollar sign

* Fix rpath for cpu builds

Co-authored-by: atalman <[email protected]>

* Uses RPATH instead of RUNPATH so that user strictly uses pypi libs (pytorch#1150)

* Binary Validation Workflow - Adding check binary script (pytorch#1127)

* Update action.yml

* Update validate-macos-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Update validate-linux-binaries.yml

* Fix check binary for arm64 (pytorch#1155)

* Fix check binary for arm64

* Update check_binary.sh

Co-authored-by: Nikita Shulga <[email protected]>

Co-authored-by: Nikita Shulga <[email protected]>

* Fix for including nvtx dll and cudart (pytorch#1156)

* Fix for invluding nvtx dll and cudart

* Fix for include nvtx

* Fix spaces

* Back out inclusion of cudart (pytorch#1157)

* Add cuda and date check to smoke test (pytorch#1145)

* shorten binary validation workflow names, so they are more readable in the HUD and GH job view (pytorch#1159)

* Fix anaconda torchaudio smoke test (pytorch#1161)

* Fix anaconda torchaudio smoke test

* Format using ufmt

* Fix whels tests for torchaudio (pytorch#1162)

* Pin condaforge version

Most recent version fails with  invalid cert error when trying to update
python

* Option to run resnet classifier on specific device

* Fix typo

`.test/smoke_test` -> `test/smoke_test`

Noticed when pushed pytorch@3b93537 and no tests were run

* Test resnet classifier on CUDA (pytorch#1163)

* [ROCm] support for rocm5.3 wheel builds (pytorch#1160)

* Updates to support rocm5.3 wheel builds (#6)

* Changes to support ROCm 5.3

* Updated as per comments

* Installing python before magma build

- In ROCm 5.3 libtorch build are failing during magma build due to
  to missing python binary so added install statement

* Move python install to libtorch/Dockerfile (#8)

* Updating the condition for noRCCL build (#9)

* Updating the condition for noRCCL build

* Updated changes as per comments

* Use MIOpen branch for ROCm5.3; Change all conditions to -eq

* Use staging branch of MIOpen for ROCm5.3

* Fix merge conflict

Fix merge conflict

Co-authored-by: Pruthvi Madugundu <[email protected]>
Co-authored-by: Pruthvi Madugundu <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>

* Validate python 3.11 (pytorch#1165)

* Validate python 3.11

* Validate linux binaries change

Add options

Import torchvision

Adding python 3.11 install

pass package to check nightly binaries date

Test

test

Add python 3.11 code

testing

Adding python 3.11 test

Add python 3.11 validation

Adding zlib develop install

Install zlib etc..

Adding zlib1g as well

testing

testing

Adding validate windows binary

Trying to workaround

testing

Refacor smoke test

Add import statement

fix datetime call

* Fix stripping dev

* fix import

* Strip pypi-cudnn from the version.py (pytorch#1167)

* Strip pypi-cudnn from the version.py

* small fix

* Regenerates RECORD file to reflect hash changes caused by sed'ing the version suffix (pytorch#1164)

* Add pypi cudnn package to tests (pytorch#1168)

* Add pypi cudnn package to tests

* Fix pypi installation check

* Fix pypi instructions setting

* Update DEVELOPER_DIR in build_pytorch.sh

Not sure why we are still expecting Xcode9 to be present there, update it to the same folder as wheel builds

May be fixes pytorch/pytorch#87637

* Fix to not use sccache if it's not setup properly (pytorch#1171)

* Revert "Fix to not use sccache if it's not setup properly (pytorch#1171)" (pytorch#1172)

This reverts commit 377efea.

* Remove cuda102 and cuda115 docker builds and regenerate manylinux docker (pytorch#1173)

* Rebuild manywheel

* Remove cuda102 and cuda115

* [aarch64] add mkldnn acl backend build support for pytorch cpu libary (pytorch#1104)

* Only push to Docker and Anaconda repo from main (pytorch#1175)

We currently allow push from any branch to go to Docker (and Anaconda) prod. This is a dangerous practice because it allows unfinished works to jump to prod and used by other workflows

* Release 1.13 script changes (pytorch#1177)

* Test ResNet on MPS (pytorch#1176)

After pytorch/pytorch#86954 is fixed, we should be able to test resnet on MPS

* Revert "Test ResNet on MPS (pytorch#1176)" (pytorch#1180)

This reverts commit efa1bc7.

* Add v1.13 versions

* Update CMake to 3.18, needed for C++17 compilation (pytorch#1178)

* release: separate out version suffixes for torch pypi promotion (pytorch#1179)

* Fixup wheel published to PyPI (pytorch#1181)

* Fixup wheel published to PyPI

* Update prep_binary_for_pypi.sh

* Fix folder deletion for pypi prep

Co-authored-by: Andrey Talman <[email protected]>

* Update cmake version to 3.18 for libtorch docker

* Pins cuda runtime to 111.7.99 (pytorch#1182)

* Fixes cuda pypi rpaths and libnvrtc name (pytorch#1183)

* Allow ROCm minor releases to use the same MIOpen branch as the major release (pytorch#1170)

* Allow ROCm minor releases to use the same MIOpen branch as the major release

* correct logic to ensure rocm5.4 doesn't fall in wrong condition

* add 11.8 workflow for docker image build (pytorch#1186)

* Using windows runners from test-infra for validation workflows (pytorch#1188)

* Testing new windows runners

test

Testing

Testing

testing

testing

test

Test

Testing

testing

Testing

Testing

test

Test

test

testing

testing

Test

testing

test

testing

testing

testing

testing

testing

testing

test

test

testing

testing

testing

testing

Test

test

test

testing

testing

testing

testing

testing

testing

testing

testing

testing

Refactor code

* Adding details for the test-infra issue

* Update current CUDA supported matrix

* add magma build for CUDA11.8 (pytorch#1189)

* Test setting job name (pytorch#1191)

* Use official Python-3.11 tag (pytorch#1195)

* remove CUDA 10.2-11.5 builds (pytorch#1194)

* remove CUDA 10.2-11.5 builds

* remove 11.5 and 11.3 builds

* build libtorch and manywheel for 11.8 (pytorch#1190)

* build libtorch and manywheel for 11.8

* Update common/install_magma.sh

* use magma-cuda build-1 by default; remove CUDA 10.2-11.5 builds

Co-authored-by: Andrey Talman <[email protected]>

* [Validation] Pass ref:main to general worker (pytorch#1197)

* Pass ref:main to general worker

* Try to pass reference to workflow

* Pass ref:main to general worker

* Test

* Pass reference as input parameter

* Make new variable not required

* Fix typo

* Add workflow for manywheel cpu-cxx11-abi (pytorch#1198)

* [Validation] Use linux_job for linux workers (pytorch#1199)

* Use linux_job for linux workers

Test

Testing

Test

testing

Tetsing

testing

Change linux binary action

test

Simplify version check

* Fix if statement

* Fix typo

* Fix cuda version check

Fix Audio and Vision version check

Add check binary to libtorch

test

test

testing

testing

testing

Testing

Testing

testing

* Use macos generic workers (pytorch#1201)

* Use macos generic workers

fix workflow

testing

Add arm64 builds

test

Remove validate binary action

* add check binary step

* fix ld_library path

* add package type

* Adding ref to validate binaries (pytorch#1204)

* ROCm5.3 nightly wheels (pytorch#1193)

* Enable ROCm5.3 nightly wheels

* Enable ROCm5.3 docker builds

* Update amdgpu repo url for ROCm5.3

* ROCm5.3 not supported on Ubuntu 18.04

* empty

* Another empty commit

* Try disabling MLIR build to shorten docker build time

* Clean up disk space

* MLIR project changed names from ROCm5.4

* Retrigger CI to get around flaky magma git access error

* One more cmake-3.18.4 update

* Use cmake-3.18 for ROCM builds

* More cmake ROCM tweaks

* cmake-3.18 installation on ROCM (take 3)

* add conda builds for CUDA 11.8 (pytorch#1205)

* Enable nightly CUDA 11.8 builds (pytorch#1206)

* enable nightly builds for CUDA 11.8

* add CUDA 11.8 version to manywheel, remove 11.3 and 11.5

* Windows CUDA 11.8 changes (pytorch#1207)

* Add continue on error to validation jobs (pytorch#1209)

* Add continue on error to validation jobs

* test

* Delete unmaintaned torchvision build scripts (pytorch#1210)

All build logic has long moved to torchvision repo and now is executed
by reusable workflow from https://github.com/pytorch/test-infra/tree/main/.github/workflows

* build_pytorch.sh replace tabs with spaces (pytorch#1211)

* Make PyTorch depend on TorchTrition (pytorch#1213)

Remove me when Triton is properly released elsewhere

* Remove smoke test script that is no longer used (pytorch#1212)

* Another tabs-to-spaces change

`s/\t/\ \ \ \ \ \ \ \ /`

* Disable continue on error (pytorch#1214)

* Add torchtrition dependency for wheels

* Make PyTorchConda depend on Triton (Take 2)

Multi-line environment variables are hard, so lets do it traditional way

* Revert "Add torchtrition dependency for wheels"

This reverts commit 475100b.

* Add TorchTrition dependency for wheels (take 2)

Now tests should be green thanks to pytorch/pytorch#90017

* Add sympy to pytorch linux dependencies

* Mitigate windows nightly build regressions

By pinning conda to 22.9.0

Fixes pytorch/pytorch#90059

* Consolidating validation scripts (pytorch#1219)

* Consolidating validation scripts

* Fix validate script name

* Correct script path

* Correct script path

* test

* testing

* testing

* testing

* testing

* test

* test

* test

* testing

* testc

* test hook

* adding wondows use case

* windows use case

* test

* testing

* Windows fixes

* more fixes

* Add package type

* testing more

* Truncate RECORD instead of delete (pytorch#1215)

* Refactor and fix windows smoke tests (pytorch#1218)

* Fix windows smoke test

* Fix first if statement

* Refactor not to cal install nightly package

* Revert "Refactor not to cal install nightly package"

This reverts commit ac580c8.

* Fix pip install command remove cu102

* Refacor the conda installation

* Add cuda profiler apu to cuda install 11.8 (pytorch#1221)

* Update CUDA upgrade runbook to mention subpackages changes

As per following doc: https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html

* conda: Add CUDA_HOME, cuda binaries to path (pytorch#1224)

* Refactor macos-arm64 into separate group (pytorch#1226)

* Adding libcufft constraint (pytorch#1227)

* Adding libcufft constraint

* Adding rest of the dependencies

* Advance build number in pytorch-cuda (pytorch#1229)

* Make sympy mandatory dependency of PyTorch

Should fix 
https://github.com/pytorch/audio/actions/runs/3684598046/jobs/6234531675

* Revert me later: Fix conda package smoke tests

* Install `sympy` via pip rather than conda

Needs to be reverted as well

* Refactor smoke tests to configure module included in the release (pytorch#1223)

* Changes to prep for pypi script for release 1.13.1 (pytorch#1231)

* PyPi binary validation and size check (pytorch#1230)

* Validate binary size

* Validate binary size linux_job

* evaluate the fix from pytorch#1231

* Add an optional artifact upload, consolidate fixes to `prep_binary_for_pypi.sh`

* Adding new workflow to call from domain libraries to validate on domain libraries such as text (pytorch#1234)

* Testing new workflow

Fix naming

fix input

* Changed comments

* Ad ability to call validate domain library manually (pytorch#1235)

* Adding test for validate dm workflow and fixing dm validation workflow (pytorch#1236)

* Test manywheel packages (pytorch#1239)

Change only docker file

* Bump scripts in release (pytorch#1241)

* release: Strip whitespace from version_with_suffix (pytorch#1242)

* Cuda 11.8 and removal of dev packages (pytorch#1243)

* Adding more OS's to validate domain library workflow (pytorch#1238)

* Adding more OS's to validate domain library workflow

* conda and wheel togeather

* add macos workflows

* fix workflow

* Add target os variable to windows validation (pytorch#1244)

* Update MKL to 2022.1 (pytorch#1245)

As previous one occasionally crashes on AMD CPUs

May be addresses pytorch/pytorch#89817

Please note, that in order to get maximum perf on AMD CPUs one needs to compile and LD_PRELOAD following library:
```
int mkl_serv_intel_cpu_true() {
	return 1;
}
```

* Adds infra to use nvidia dependencies from pypi and cleans up patches (pytorch#1196)

* Installs NCCL from redist, uses system NCCL, and adds pypi RPATH

* Cleans up nvrtc patches and adds it using main script

* Fixes typo

* Adds more dependencies and builds torch with dynamic linking

* NCCL dirs have to be specified. Otherwise picks up different version

* Handles 11.8

* Adds echo message for nccl 2.15

* Remove invalid git option (pytorch#1246)

* Revert "Adds infra to use nvidia dependencies from pypi and cleans up patches (pytorch#1196)" (pytorch#1247)

This reverts commit ee59264.

* Add with_cuda flag (pytorch#1249)

* Add GPU architecture env variables (pytorch#1250)

* Add cuda to jobname for validate domain library (pytorch#1252)

* Remove pylief dependency (pytorch#1255)

* Fix PEP503 for packages with dashes

* Rename `torchtriton` to `pytorch-triton`

Companion change for pytorch/pytorch#91539

* s3_management: Hide specific packages between dates (pytorch#1256)

* s3_management: Pin requirements.txt

Packaging got updated and that's not what we want

Signed-off-by: Eli Uriegas <[email protected]>

* s3_management: except ValueError

Signed-off-by: Eli Uriegas <[email protected]>

* s3_management: Use the correct format for strptime

Signed-off-by: Eli Uriegas <[email protected]>

* s3_management: Bump bad dates to october 17th (pytorch#1257)

* s3_management: hide torchtriton (pytorch#1258)

* s3_management: Add PACKAGE_ALLOW_LIST for indices (pytorch#1259)

* s3_management: Bump bad date end to 12/30 (pytorch#1260)

* Adds infra to use nvidia dependencies from pypi and cleans up patches (pytorch#1248)

* Installs NCCL from redist, uses system NCCL, and adds pypi RPATH

* Cleans up nvrtc patches and adds it using main script

* Fixes typo

* Adds more dependencies and builds torch with dynamic linking

* NCCL dirs have to be specified. Otherwise picks up different version

* Handles 11.8

* Adds echo message for nccl 2.15

* Fixes logic for 11.8 and adds missing names for DEPS_SONAME

* s3_management: Account for underscore packages

pytorch-triton is listed as pytorch_triton

Signed-off-by: Eli Uriegas <[email protected]>

* s3_management: simplify allowlist, correct underscores

Signed-off-by: Eli Uriegas <[email protected]>

* Fix cuda version in nightly (pytorch#1261)

* Adding py311 validations (pytorch#1262)

* Use MATRIX_* variables instead of redeefining new var each time (pytorch#1265)

* Fix validation domain library (pytorch#1266)

remove ref main

fix workflow

more refactor

* Nightly: do test install with the dependencies better and skip CUDA tests on cpu only box (pytorch#1264)

* Refactor PyTorch wheel and libtorch build scripts for ROCm (pytorch#1232)

* Refactor wheel and libtorch build scripts (#7)

* Update to so patching for ROCm

Wildcard used in grep to grab the actual numbered so file referenced
in patchelf. This allows the removal of specifying the so number in
DEPS_LIST & DEPS_SONAME

This commit also adds the functionality for trimming so names to
build_libtorch.sh from build_common.sh

* Refactor to remove switch statement in build_rocm.sh

This commit refactors build_rocm.sh and brings in a few major updates:
 - No longer required to specify the full .so name (with number) for ROCm libraries
       - The .so versions are copied and the patching code will fix the links to point to this version
 - No longer required to specify paths for ROCm libraries allowing the removal of the large switch
       - Paths are acquired programmatically with find
 - No longer required to specify both the path and filename for the OS specific libraries
       - Programatically extract file name from the path
 - Automatically extract Tensile/Kernels files for the architectures specified in PYTORCH_ROCM_ARCH
   and any non-arch specific files e.g. TensileLibrary.dat

* rocfft/hipfft link to libhiprtc.so in ROCm5.4 (#15)

Co-authored-by: Jack Taylor <[email protected]>

* add sm_90 to CUDA11.8 builds (pytorch#1263)

* add sm_90 to CUDA11.8 builds

* Manually invoke bash for Miniconda

* Revert "add sm_90 to CUDA11.8 builds (pytorch#1263)" (pytorch#1275)

This reverts commit e1453a4.

* Set ubuntu distribution correctly for ROCm5.3 and above (pytorch#1268)

* Fix unbound variable error (pytorch#1276)

Regression introduced (and ignored) by pytorch#1262
Test plan:
```
% bash -c 'set -u; if [[ -z "${FOO}" ]]; then echo "bar"; fi' 
bash: FOO: unbound variable
(base) nshulga@nshulga-mbp builder % bash -c 'set -u; if [[ -z "${FOO+x}" ]]; then echo "bar"; fi'
bar
(base) nshulga@nshulga-mbp builder % FOO=1 bash -c 'set -u; if [[ -z "${FOO+x}" ]]; then echo "bar"; fi'

```

* Manually invoke bash for miniconda (pytorch#1277)

Fixes build issues failing with:
```
./Miniconda3-latest-Linux-x86_64.sh: 438: ./Miniconda3-latest-Linux-x86_64.sh: [[: not found
```
as seen in e.g.: pytorch#1271

* Fix perm

Which somehow got changed by pytorch@62103bf

* add sm_90 to CUDA11.8 builds (pytorch#1278)

* libtinfo.so version update and logic fix for ROCm libtorch (pytorch#1270)

* Use libtinfo.so.6 for Ubuntu 2004

* Fix to origname grep

* Condition on ROCM_VERSION for libtinfo6

* Looks like it is not used anywhere. (pytorch#1273)

* Build Windows binaries with Visual Studio 2022 Build Tools (pytorch#1240)

* Build Windows binaries with Visual Studio 2022 Build Tools

* Unify casing in Batch files, remove VS 2017 installation

* Remove VS 2017 Conda scripts, unify casing in conda Batch scripts, minor Conda scripts tweaks

* Slim down `pytorch-cuda`

It should only contain runtime dependencies that PyTorch+domain
libraries depend on, namely:
 - cudart
 - cublas
 - cusparse
 - cufft
 - curand
 - nvtx
 - nvrtc
 - nvjpeg (for TorchVision)

This removes dependencies on NVCC, build/debug tools, etc which are not
needed for running the pytorch

Test Plan:
  `conda create -n tmp -c nvidia -c malfet cuda-toolkit==11.7` and
observe that only relevant packages are installed

Fixes pytorch/pytorch#91334

* [BE] Delete `unicode-flags` build options (pytorch#1284)

There were relevant only for Python<=3.3

* [BE] Define `openssl_flags` (pytorch#1285)

Rather than have two invocations of `./configure`

* Build with `--enabled-shared` if `patchelf` is found (pytorch#1283)

This is needed to make `manylinux-wheel` images usable for building new Triton binaries.

Test plan: Build docker and verify that following `CMakeLists.txt` finishes successfully:
```
cmake_minimum_required(VERSION 3.6)
find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
message(WARNING Executable ${Python3_EXECUTABLE})
message(WARNING IncludeDirs ${Python3_INCLUDE_DIRS})
message(WARNING Libraries ${Python3_LIBRARIES})
```

* Update cudnn to 8.7.0.84 for CUDA 11.8 builds (pytorch#1271)

* update cudnn to 8.7.0.84 for CUDA 11.8 builds

* workaround for pytorch#1272

* Revert "workaround for pytorch#1272"

This reverts commit c0b10d8.

* update cudnn==8.7.0.84 for windows

* [BE] Remove references to Python<3.6 (pytorch#1287)

* Upgrade desired python versoin to 3.8

For libtorch builds

* Fix how libtorch picks the python version

* Tweak conda builds to support 3.11

Add `-c malfet` when building for 3.11 (though perhaps it's better to
move numpy to pytorch channel)

Tweak some build time dependencies

* Fix typo

* Skip triton dependency for 3.11 CUDA builds

* Update build-number to 3

* Add ability to override cuda archs for conda (pytorch#1282)

* [ROCm] reduce disk space used in image (pytorch#1288)

Fixes pytorch#1286

* Extend MacOS/Windows builds to 3.11

By installing dependencies from pip
Should be a no-op for <=3.10

* ci: Migrate to checkout@v3 (pytorch#1290)

checkout@v2 is deprecated moving to checkout@v3

Signed-off-by: Eli Uriegas <[email protected]>

* Fix typo

* Add 3.11 option for Windows builds

* Add python-3.11 download location for windows

* Add pypi with cudnn package test (pytorch#1289)

* Add pypi with cudnn package test

* Add pypi with cudnn package test

* test

* test

* More pypi cudnn changes

* test

* Fix pipy smoke test

* Remove debug comments

* Delete some ancient checks for MacOS builds

As we no longer build for Python-2.7 or 3.5

* Add libnvjpeg-dev package as fallback (pytorch#1294)

* Add libnvjpeg-dev package as fallback

* Move libnvjpeg and libnvjpeg-dev to required packages

* Update conda/pytorch-cuda/meta.yaml

---------

Co-authored-by: Nikita Shulga <[email protected]>

* Upgrade nightly wheels to rocm5.4.2 (pytorch#1225)

* Upgrade nightly wheels to rocm5.4

* Adding graphic architectures for ROCm 5.4

* Updated to use ROCm5.4.1

* Updated to use ROCm5.4.2

* Fixed syntax error

* Perform build on image with magma and miopen preinstalled

* Add dev packages for windows pytorch-cuda dependencies (pytorch#1295)

* Add dev packages for windows dependencies

* Adding architecture dependent builds

* Add notes around windows

* fix typo

* Bumping version to v3

* rocm libtorch prebuild magma; fix manylinux cmake version (pytorch#1296)

* Add manywheel:cpu-cxx11-abi checkup for check_binary.sh (pytorch#1251)

* Remove with_py311 flag (pytorch#1301)

* rocm manylinux now uses devtoolset 9 (pytorch#1300)

* fix ACL_ROOT_DIR setting and upgrade the ACL version to 22.11 (pytorch#1291)

* Add `-c malfet` for Windows builds as well

* Set torch._C._PYBIND11_BUILD_ABI version check only for GLIBCXX_USE_CXX11_ABI=0 (pytorch#1303)

* Adding limit windows builds logic (pytorch#1297)

* Adding limit windows builds logic

* Remove empty space

* Simplify mkl build dependencies (pytorch#1305)

On Linux and Mac PyTorch must be built against `mkl=2020.x` in order to be compatible with both `mkl-2021` and `mkl-2022`, that added `.so.1` and `.so.2` files respectively, that would make binary linked against those versions incompatible with the newer/older toolchains.

This is not an issue on Windows, as all mkl binaries there end with simple `.dll`

* "Fix" PyTorch CPU conda testing

It's still horribly broken, but make it a bit better by not installing
pytorch from default anaconda channel (which installs 1.12.1 that does
not have any dependencies 2.0 dev package supposed to have)

For example, see this runlog
https://github.com/pytorch/pytorch/actions/runs/4155371267/jobs/7189101147

* Update torch._C._PYBIND11_BUILD_ABI version check (pytorch#1306)

* Skip tests for manywheel built with _GLIBCXX_USE_CXX11_ABI=1

* Put back smoke test label (pytorch#1310)

* [aarch64] add support for torchdata wheel building (pytorch#1309)

* Python 3.11 validation workflow tests (pytorch#1304)

* Test windows py311

* Nightly binaries

* Fix py311 tests

* fix python calling

* Revert "Nightly binaries"

This reverts commit cbf80ca.

* add a scheduled workflow for the nightly pypi binary size validation (compliments pytorch/test-infra#2681) (pytorch#1312)

* Add regression test for pytorch/pytorch#94751

* Add 3.11 and `--pytorch-only` options

* Add `lit` to list of allowed packages

As it is now mandatory (albeit spurious) dependency of pytorch-triton

See https://pypi.org/project/lit/ for more details

* s3: Allow tar.gz as an accepted file extension (pytorch#1317)

* Changes for Python 3.11 and smoke Test RC cut (pytorch#1316)

* Smoke Test RC cut

* Validate binaries 3.11

* test

* Smoke test binaries

* Fix pytorch-cuda chan download

* Remove temp change

* Make sure we don't use GPU runners for any of libtorch validations (pytorch#1319)

* Make sure we don't use GPU runners for any of libtorch

* Make sure we don't use GPU runners for any of libtorch

* s3: Add pytorch_triton_rocm to index (pytorch#1323)

Signed-off-by: Eli Uriegas <[email protected]>

* s3: Add tqdm package req for text (pytorch#1324)

* Add `--analyze-stacks` option

That using `git rev-base`, prints total number of stacks, and its
average, mean and max depth

At the time of submission here is top 10 ghstack uses of pytorch:
```
ezyang has 462 stacks max depth is 15 avg depth is 1.70 mean is 1
awgu has 240 stacks max depth is 28 avg depth is 4.30 mean is 1
peterbell10 has 146 stacks max depth is 7 avg depth is 1.84 mean is 1
zou3519 has 128 stacks max depth is 7 avg depth is 1.98 mean is 1
jerryzh168 has 113 stacks max depth is 16 avg depth is 1.45 mean is 1
bdhirsh has 111 stacks max depth is 7 avg depth is 1.85 mean is 2
wconstab has 108 stacks max depth is 7 avg depth is 2.15 mean is 1
SherlockNoMad has 99 stacks max depth is 4 avg depth is 1.24 mean is 1
zasdfgbnm has 80 stacks max depth is 11 avg depth is 2.52 mean is 6
desertfire has 73 stacks max depth is 3 avg depth is 1.14 mean is 1
```

* Add filelock and networkx deps (pytorch#1327)

To match dependencies for wheel files defined in https://github.com/pytorch/pytorch/blob/ed1957dc1989417cb978d3070a4e3d20520674b4/setup.py#L1021-L1024

* Remove building magma from source

* Revert

* Upgrade cmake version to 3.22.1 to build triton (pytorch#1331)

* Upgrade cmake version to 3.22.1 to build triton

* Pin patcheft version

* Fix comment typo

* Smoke test for cuda runtime errors (pytorch#1315)

* Add test for cuda runtime errors

* Add cuda exception smoke test

* Move cuda runtime error to end

* Move cuda runtime error to end

* Address comments

* Address comments

* Add Jinja2 Dependency (pytorch#1332)

As part of the effort to fix pytorch/pytorch#95986

* Add MarkupSafe to S3 Index (pytorch#1335)

* Remove rocm5.1 rocm5.2 from libtorch Dockerfile

* [aarch64] Adding CI Scripts to build aarch64 wheels (pytorch#1302)

* add aarch64 ci scripts

* added readme. get branch from /pytorch

* Add smoke tests conv,linalg,compile. And better version check. (pytorch#1333)

* Add smoke tests conv,linalg,compile

* Add version check

* Fix typo

Fix version check

Add not

* Add exception for python 3.11

* fix typo

* Try to exit after CUDA Runtime exception

* Restrict carsh test only to conda

* Restrict carsh test only to conda

* Fix tests

* Turn off cuda runtime issue

* tests

* more tests

* test

* remove compile step

* test

* disable some of the tests

* testing

* Remove extra index url

* test

* Fix tests

* Additional smoke tests

Remove release blocking changes

* Aarch64 changes for PyTorch release 2.0 (pytorch#1336)

* Aarch64 changes for PyTorch release 2.0

* Fix spacing

* Update aarch64_linux/build_aarch64_wheel.py

Co-authored-by: Nikita Shulga <[email protected]>

* Update aarch64_linux/build_aarch64_wheel.py

Co-authored-by: Nikita Shulga <[email protected]>

---------

Co-authored-by: Nikita Shulga <[email protected]>

* Aarch64 build py3.11 fix (pytorch#1341)

* Fix nightly smoke test (pytorch#1340)

* Fix nightly smoke test

* Fix nightly builds

* Release 2.0 release scripts changes (pytorch#1342)

* Release 2.0 release scripts changes

* Release script modifications

* Add more packages to allow list (pytorch#1344)

* Add `jinja2` dependency to conda package

To be consistent with wheels, see
https://github.com/pytorch/pytorch/95961

* Restrict jinja to py 3.10 or less (pytorch#1345)

* Update `torchtriton` version to 2.1.0

* And update trition version here as well

* added smoke test for max-autotune (pytorch#1349)

Co-authored-by: agunapal <[email protected]>

* Refactor conda backup script (pytorch#1350)

* Refacto conda backup

* Fix space

* Minor style

* Revert "Upgrade cmake version to 3.22.1 to build triton (pytorch#1331)" (pytorch#1351)

* Revert "Upgrade cmake version to 3.22.1 to build triton (pytorch#1331)"

This reverts commit 18c5017.

* Selective revert

* Get cmake from pip

* Use 3.18.2 from conda

* Release script changes, add more release dependencies, bump version for aarch64 builds (pytorch#1352)

* Release script changes

* Add Jinja2 dependency

* Fix typo

* Add pytorch conda dependencies (pytorch#1353)

* Add latest dependencies for pytorch 2.0 release (pytorch#1357)

* Fix typo

* Revert "Revert me later: Fix conda package smoke tests"

This reverts commit d7f2a7c.

* [aarch64] update readme with the "--enable-mkldnn" option (pytorch#1362)

This needs to be enabled for official wheel building.

* Replace `--enable-mkldnn` with `--disable-mkldnn`

Also, change default to ubuntu-20.04

* Update AMIs

Using following images:
```
% aws ec2 describe-images --image-ids ami-078eece1d8119409f ami-052eac90edaa9d08f ami-0c6c29c5125214c77 --query "Images[].[ImageId, Description]"
[
    [
        "ami-078eece1d8119409f",
        "Canonical, Ubuntu, 18.04 LTS, arm64 bionic image build on 2023-03-02"
    ],
    [
        "ami-0c6c29c5125214c77",
        "Canonical, Ubuntu, 22.04 LTS, arm64 jammy image build on 2023-03-03"
    ],
    [
        "ami-052eac90edaa9d08f",
        "Canonical, Ubuntu, 20.04 LTS, arm64 focal image build on 2023-03-01"
    ]
]
```

* Update tags for domain libraries

* Add PyTorch version pinning to release wheels

* Fix flake8

* [BE] Introduce `build_domains` function

And call it to rebuild only domains if torch wheel is available

* Switch deprecated ubuntu-18.04 runner to ubuntu-latest (pytorch#1334)

* Switch deprecated ubuntu-18.04 runner to self-hosted 2xlarge

* Leave build-nvidia-docker for now

* Apply suggestions from code review

Co-authored-by: Nikita Shulga <[email protected]>

* Use ephemeral runners

* Use ubuntu-latest

* Apply suggestions from code review

Co-authored-by: Nikita Shulga <[email protected]>

* Switch from latest to 22.04 to pin the version

---------

Co-authored-by: Nikita Shulga <[email protected]>

* Introduce optional --build-number parameter

* Revert me later: Fix conda package smoke tests

(cherry picked from commit d7f2a7c)

Alas, it's still used and causes nightly build failures

* Fix aarch64 torchvision build (pytorch#1363)

* Fix torchvision image extension compilation

* Fix torchvision image extension compilation

* Set enable_mkldnn to pypi build

* Remove unused `enable_mkldnn` for configure_system

* [aarch64] Try to link statically with png/jpeg

Also, add testing (which is currently broken)

* Revert "Revert me later: Fix conda package smoke tests"

This reverts commit ce427de.

* [AARCH64] Fix image.so wheel

By adding explicit libz dependency

* [AARCH64] Pass `BUILD_S3` to torchdata

To make build consistent with Linux-x86_64

* Revert "[AARCH64] Pass `BUILD_S3` to torchdata"

This reverts commit ae8e825.

As it does not want to be built on aarch64

* Add portalocker (pytorch#1364)

* [BE] Error handling in build_aarch64_wheel

I've noticed that build errors in `build_ArmComputeLibrary` would be
ignored as semicolon is used between the commands, instead of &&
Also, replace nightly version evaluation by relying on torch, to rely on
individual libraries

* [AArch64] Pass `args.instance_type` to `start_instance`

* use c++17 when building windows smoke tests (pytorch#1365)

Summary:
We are seeing failures during CI dealing with some headers that have
nested namespaces. This is expected to remedy them.

One such example:
https://github.com/pytorch/pytorch/actions/runs/4510336715/jobs/7942660912

Test Plan: Test this with CI.

---------

Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Co-authored-by: Andrey Talman <[email protected]>
Co-authored-by: andysamfb <[email protected]>
Co-authored-by: izaitsevfb <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Syed Tousif Ahmed <[email protected]>
Co-authored-by: Syed Tousif Ahmed <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Wei Wang <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Pruthvi Madugundu <[email protected]>
Co-authored-by: Pruthvi Madugundu <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>
Co-authored-by: Huy Do <[email protected]>
Co-authored-by: snadampal <[email protected]>
Co-authored-by: Eli Uriegas <[email protected]>
Co-authored-by: ptrblck <[email protected]>
Co-authored-by: zhuhong61 <[email protected]>
Co-authored-by: Greg Roodt <[email protected]>
Co-authored-by: Eli Uriegas <[email protected]>
Co-authored-by: Dmytro Dzhulgakov <[email protected]>
Co-authored-by: albanD <[email protected]>
Co-authored-by: Radek Bartoň <[email protected]>
Co-authored-by: divchenko <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Bo Li <[email protected]>
Co-authored-by: Mike Schneider <[email protected]>
Co-authored-by: Ankith Gunapal <[email protected]>
Co-authored-by: agunapal <[email protected]>
Co-authored-by: dagitses <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide common set of smoke tests for torch cpu, torch gpu, torchvision, torchaudio
3 participants