Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.7.x] backport mixed type binary ops to v1.7.x #18649

Merged
merged 2 commits into from
Jul 5, 2020

Conversation

yijunc
Copy link
Contributor

@yijunc yijunc commented Jul 1, 2020

Description

Backport mixed type binary ops to v1.7.x branch (Mentioned in #18641 #18648 #18653)
Mainly code for #18250 and #18523

@mxnet-bot
Copy link

Hey @BenjaminCHEN2016 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, website, windows-cpu, unix-gpu, windows-gpu, edge, miscellaneous, unix-cpu, centos-gpu, clang, sanity]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@yijunc yijunc force-pushed the backport_v1.7_18523 branch 2 times, most recently from 306fcad to 454d401 Compare July 2, 2020 08:52
@yijunc yijunc changed the title backport mixed type to 1.7 backport mixed type to 1.7.x Jul 2, 2020
@yijunc yijunc changed the title backport mixed type to 1.7.x [v1.7.x] backport mixed type binary ops to v1.7.x Jul 2, 2020
@yijunc
Copy link
Contributor Author

yijunc commented Jul 2, 2020

@sxjscience @ciyongch

@ciyongch
Copy link
Contributor

ciyongch commented Jul 2, 2020

Thank you @BenjaminCHEN2016 to backport these fixes. If I understand correctly, this is the only remaining PR that fix the numpy operator and targeting to 1.7 release, am I right? @sxjscience
Can you also help to test the latest v1.7.x code base with this PR in advance to make sure it's working as expected, and no more other dependencies? Thanks!

@yijunc yijunc force-pushed the backport_v1.7_18523 branch from 6484b2a to 02d4fbf Compare July 2, 2020 12:53
@ciyongch
Copy link
Contributor

ciyongch commented Jul 3, 2020

@BenjaminCHEN2016 @sxjscience there's some build error on windows platform as below, please help to take a look. It'll be great if it can be solved within 24h, Thanks a lot!

[2020-07-02T13:28:03.699Z] c:\jenkins_slave\workspace\build-gpu\src\operator\tensor\elemwise_binary_broadcast_op.h(237) : fatal error C1002: compiler is out of heap space in pass 2
[2020-07-02T13:28:03.699Z] jom: C:\jenkins_slave\workspace\build-gpu\build\CMakeFiles\mxnet_52.dir\build.make [CMakeFiles\mxnet_52.dir\src\operator\numpy\np_elemwise_broadcast_op.cc.obj] Error 1

@yijunc
Copy link
Contributor Author

yijunc commented Jul 3, 2020

@ciyongch I will try to resolve this issue today.

@yijunc
Copy link
Contributor Author

yijunc commented Jul 3, 2020

I think that might be a compiler issue? I tried on my windows machine and it can be compiled. @ciyongch @sxjscience

@ciyongch
Copy link
Contributor

ciyongch commented Jul 3, 2020

Hi @BenjaminCHEN2016 , it looks more like a compilation error which probably introduced by the current complex expressions or something like that which ran out of memory instead of the compiler itself.
I remembered there's a discussion about the compilation memory usage here, which reported that several numpy operator files consume a large memory footprint during compilation.
But I'm curious if this is the case, why the master branch doesn't have the similar CI issue. Is there any additional or different change in this patch compared to the original PRs?
Ping @wkcn, @leezu to see if any suggestions for this case, thanks!

@yijunc
Copy link
Contributor Author

yijunc commented Jul 3, 2020

@leezu Is the CI different between master and v1.7.x branch? It seems that windows CI on v1.7.x is an older version?

@wkcn
Copy link
Member

wkcn commented Jul 3, 2020

Hi @ciyongch , the issue is related to the version of compiler.
I only build MXNet on gcc6, gcc7 and gcc10, and gcc10 takes more than 11GB memory.

If using visual studio, could you try add /Zm100 /GX into the additional options of Property/ C/C++ /All Options to enlarge the heap space ?

@ciyongch
Copy link
Contributor

ciyongch commented Jul 3, 2020

Thanks for your information @wkcn , according to @BenjaminCHEN2016 the failure only happened in MXNet windows CI pipeline but not his local environment. There's a concern that if this is the case and still not able to pass the CI, do we still need to include this in v1.7.0? @sandeep-krishnamurthy @szha @sxjscience .

@szha
Copy link
Member

szha commented Jul 3, 2020

Yes, we still need the change. It sounds like we may have missed some windows CI changes to backport. cc @ChaiBapchya to help clarify issues with windows CI

@ChaiBapchya
Copy link
Contributor

Windows CI issues in master branch in late March were fixed by PRs :

  1. Fix Windows GPU CI #17962
  • Upgrade VS from 2015 to 2019
  • use pre-installed cmake 3.17 [instead of installing 3.16]
  • upgrade from 32bit to 64bit toolchain
  • Cuda 10.2 [VS2019 requires 10.2]
  1. Fail build_windows.py if all retries failed #18177
  • Fail build if windows retries fail
  1. Re-enable build retries on MSVC #18230
  • Windows Cuda [thrust issue]

Infrastructure-wise, all branches run on same infra [meaning Same AMI, Same instance types, etc]
Code-wise I'm not sure if these PRs have been cherry-picked into other branches [specifically 1.7.x in this case].
Having said that, looking at past 15 CI runs on merged commits in 1.7.x [not 1 has failed on windows-cpu or windows-gpu].

@leezu any thoughts on those backports of your Windows CI specific fixes [that have been targeted towards master in the above mentioned PRs]?

@yijunc yijunc force-pushed the backport_v1.7_18523 branch from 84cd127 to 9060226 Compare July 4, 2020 03:45
@yijunc yijunc force-pushed the backport_v1.7_18523 branch from 9060226 to 398b4c5 Compare July 4, 2020 04:34
yijunc and others added 2 commits July 4, 2020 06:52
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
@yijunc yijunc force-pushed the backport_v1.7_18523 branch from 398b4c5 to 22de015 Compare July 4, 2020 06:52
@yijunc
Copy link
Contributor Author

yijunc commented Jul 4, 2020

@ChaiBapchya Thanks!
I applied the #17962 to bump the windows compiler to 64 bits. And, now the CI is working as expected.
@sxjscience @ciyongch I think it is ready to be merged now.

@yijunc
Copy link
Contributor Author

yijunc commented Jul 4, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@yijunc
Copy link
Contributor Author

yijunc commented Jul 4, 2020

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@ciyongch
Copy link
Contributor

ciyongch commented Jul 5, 2020

Thanks a lot @BenjaminCHEN2016 for your effort to make this patch pass all the CI tests. So we already have all we need for 1.7 release right now.
@leezu @ChaiBapchya @sxjscience @szha @TaoLv @pengzhao-intel please help to review and merge, then I will trigger the nightly test and prepare for rc0. Thanks!

@szha szha merged commit 477affe into apache:v1.7.x Jul 5, 2020
@samskalicky
Copy link
Contributor

@ciyongch @szha any reason this wasnt committed to 1.x branch?

@szha
Copy link
Member

szha commented Sep 18, 2020

I think it was just missed.

@DickJC123
Copy link
Contributor

DickJC123 commented Sep 25, 2020

A person might reasonably assume that v1.8 == v1.7 plus select commits from the 1.x branch. Will this PR be added to v1.x and v1.8, or will v1.8 and other future 1.x releases be missing this functionality?

I'm basically trying to cherry-pick commits on top of my "v1.7-ish" repo to get to v1.8. Do you think I should revert this PR commit on my repo to minimize future integration conflicts?

@ciyongch
Copy link
Contributor

If I remembered correctly, this PR only contains a minimum bug fix that are required for v1.7 other than the full features/bug fixes from master branch. If this is the case, then it might be better to try to pull back the full version of fix into v1.x as well as v1.8.x. Ping @BenjaminCHEN2016 to help confirm.

leezu added a commit to leezu/mxnet that referenced this pull request Oct 1, 2020
* Fix Windows GPU CI (apache#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
samskalicky pushed a commit that referenced this pull request Oct 2, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <[email protected]>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>
samskalicky pushed a commit to samskalicky/incubator-mxnet that referenced this pull request Oct 2, 2020
* * Fix einsum gradient (apache#18482)

* [v1.7.x] Backport PRs of numpy features (apache#18653)

* add zero grad for npi_unique (apache#18080)

* fix np.clip scalar input case (apache#17788)

* fix true_divide (apache#18393)

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>

* [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649)

* Fix Windows GPU CI (apache#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>

* revise activations (apache#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <[email protected]>

* Fail build_windows.py if all retries failed (apache#18177)

* Update to thrust 1.9.8 on Windows (apache#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (apache#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>
samskalicky added a commit that referenced this pull request Oct 3, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <[email protected]>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants