Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#728: Avoid ICEs and compilation failures in Intel 19.x and Nvidia with gcc<=7.2.0 #1178

Merged
merged 5 commits into from
Dec 8, 2020

Conversation

PhilMiller
Copy link
Member

@PhilMiller PhilMiller commented Dec 4, 2020

Fixes #728

@PhilMiller
Copy link
Member Author

There are still these that I'm suspicious of:

src/vt/pipe/pipe_manager_tl.h:  template <typename T, ActiveTypedFnType<T>* f, typename CbkT = DefType<T>>
src/vt/pipe/pipe_manager_tl.h-  void addListener(CbkT const& cb, NodeType const& node);

However, there's nothing in the code that instantiates them in a way that makes the compiler unhappy, so I'm inclined to leave them alone for now. If there's consensus on that, then I think this is ready to merge.

@PhilMiller
Copy link
Member Author

The upshot of this is that the workaround was as straightforward as I expected, once the bug was isolated. It is rather clunky though

@lifflander
Copy link
Collaborator

The upshot of this is that the workaround was as straightforward as I expected, once the bug was isolated. It is rather clunky though

Does this fix the entire Intel 19 problem?

@PhilMiller
Copy link
Member Author

Yeah, the whole suite of examples and tests built, and the only test failures are the large process count ones we sometimes see with other compilers.

@lifflander
Copy link
Collaborator

The CUDA nvcc 10.1 problem seems real.. once we get that fixed we should merge this.

@PhilMiller
Copy link
Member Author

/vt/src/vt/collective/reduce/reduce.h: In instantiation of 'vt::collective::reduce::Reduce::PendingSendType vt::collective::reduce::Reduce::reduce(const NodeType&, MsgT*, vt::Callback<MsgT>, vt::collective::reduce::detail::ReduceStamp, const ReduceNumType&) [with OpT = vt::collective::reduce::operators::PlusOp<vt::runtime::component::detail::DiagnosticValueWrapper<long int> >; MsgT = vt::collective::reduce::operators::ReduceTMsg<vt::runtime::component::detail::DiagnosticValueWrapper<long int> >; vt::collective::reduce::Reduce::PendingSendType = vt::messaging::PendingSend; vt::NodeType = short int; vt::Callback<MsgT> = vt::pipe::callback::cbunion::CallbackTyped<vt::collective::reduce::operators::ReduceTMsg<vt::runtime::component::detail::DiagnosticValueWrapper<long int> > >; vt::collective::reduce::detail::ReduceStamp = vt::util::adt::AlignedCharUnion<vt::collective::reduce::detail::Strong<int, -1, vt::collective::reduce::detail::tags::TagTag>, vt::collective::reduce::detail::TagPair, vt::collective::reduce::detail::Strong<long unsigned int, 18446744073709551615, vt::collective::reduce::detail::tags::SeqTag>, vt::collective::reduce::detail::Strong<long unsigned int, 18446744073709551615, vt::collective::reduce::detail::tags::UserIDTag>, vt::collective::reduce::detail::Strong<long unsigned int, 18446744073709551615, vt::collective::reduce::detail::tags::EpochTag> >; vt::collective::reduce::Reduce::ReduceNumType = int]':
/vt/src/vt/runtime/component/diagnostic_value.cc:88:1:   required from 'void vt::runtime::component::detail::_GLOBAL__N__46_tmpxft_00000cab_00000000_6_unity_8_cxx_cpp1_ii_50dfb234::reduceHelper(vt::runtime::component::Diagnostic*, vt::runtime::component::DiagnosticErasedValue*, T, vt::runtime::component::DiagnosticUnit, vt::runtime::component::DiagnosticUpdate, bool, std::size_t) [with T = long int; std::size_t = long unsigned int]'
/vt/src/vt/runtime/component/diagnostic_value.cc:112:297:   required from here
/vt/src/vt/collective/reduce/reduce.h:199:8: error: no class template named 'msgHandler' in 'struct vt::collective::reduce::operators::ReduceTMsg<vt::runtime::component::detail::DiagnosticValueWrapper<long int> >'
       return reduce<
        ^~~~~~~~~~~~~

@PhilMiller
Copy link
Member Author

Could you add an Intel 19 build onto the CI set as part of this PR? Or is there a bunch of additional setup that would need to be done?

@lifflander
Copy link
Collaborator

Could you add an Intel 19 build onto the CI set as part of this PR? Or is there a bunch of additional setup that would need to be done?

No, that should be easy to do. Now that I think about it the reason I added the defaults was for NVCC.

@PhilMiller
Copy link
Member Author

Could you add an Intel 19 build onto the CI set as part of this PR? Or is there a bunch of additional setup that would need to be done?

No, that should be easy to do. Now that I think about it the reason I added the defaults was for NVCC.

Lol, that's rich. Let me try adding a using ReduceCombine::msgHandler to see if it can pick it up then.

@codecov
Copy link

codecov bot commented Dec 4, 2020

Codecov Report

Merging #1178 (b290e7e) into develop (c86eb77) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #1178   +/-   ##
========================================
  Coverage    79.59%   79.60%           
========================================
  Files          724      724           
  Lines        27776    27788   +12     
========================================
+ Hits         22108    22120   +12     
  Misses        5668     5668           
Impacted Files Coverage Δ
src/vt/collective/reduce/reduce.h 100.00% <100.00%> (ø)
src/vt/objgroup/proxy/proxy_objgroup.h 100.00% <100.00%> (ø)
src/vt/vrt/collection/reducable/reducable.h 100.00% <100.00%> (ø)
tests/unit/active/test_active_send_large.cc 93.02% <100.00%> (ø)
tests/unit/test_parallel_harness.h 100.00% <100.00%> (ø)

@PhilMiller PhilMiller marked this pull request as ready for review December 4, 2020 21:14
@PhilMiller
Copy link
Member Author

I tested some key files that had been failing locally, and the last commit made nvcc happy with them. The whole build of course takes forever. I think this is ready to go, though. CI will tell.

@PhilMiller
Copy link
Member Author

/azp list

@lifflander
Copy link
Collaborator

/azp help

@azure-pipelines
Copy link

Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

@lifflander
Copy link
Collaborator

/azp run PR tests extended (nvidia cuda 11.0, ubuntu, mpich)

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@lifflander
Copy link
Collaborator

/azp run "PR tests extended (nvidia cuda 11.0, ubuntu, mpich)"

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Copy link
Contributor

@jstrzebonski jstrzebonski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Also I think I'll copycat this into my PR #1146

@PhilMiller
Copy link
Member Author

Looks good.
Also I think I'll copycat this into my PR #1146

Great! I had that in mind when I tagged you.

@PhilMiller
Copy link
Member Author

PhilMiller commented Dec 7, 2020

@lifflander I think the commas in the pipeline names are throwing off the command interface. It probably thinks those are multiple requested pipelines. Is there a quoting syntax? Should they be renamed with semicolons or dashes or something?

@PhilMiller PhilMiller marked this pull request as draft December 8, 2020 18:01
@PhilMiller
Copy link
Member Author

Nvidia build fails with stuff in callbacks. Need to fix before merging.

@PhilMiller
Copy link
Member Author

Nvidia build fails with stuff in callbacks. Need to fix before merging.

This was an issue of using a too-old nvcc, 9.2 vs 10 or 11. The last commit is to fix an ICE that arises in the later Nvidia compilers

@PhilMiller
Copy link
Member Author

OK, tests on nvcc 11.1 are happy after the last workaround.

@lifflander
Copy link
Collaborator

OK, tests on nvcc 11.1 are happy after the last workaround.

@PhilMiller Can you rebase this so we can merge it next?

@PhilMiller PhilMiller marked this pull request as ready for review December 8, 2020 19:58
@PhilMiller PhilMiller changed the title #728: Overload templates rather than instantiating in default argument to avoid ICE in Intel 19.x #728: Avoid ICEs and compilation failures in Intel 19.x and Nvidia with gcc<=7.2.0 Dec 8, 2020
@PhilMiller
Copy link
Member Author

The final nvcc ICE is actually in GCC, as a result of code that nvcc generates.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83204

Reported our repro case to Christian Trott and Max Katz to see if nvcc can be fixed to not trigger that for older versions

@PhilMiller
Copy link
Member Author

kokkos/kokkos-kernels#349 for reference on an earlier report of the issue

@PhilMiller
Copy link
Member Author

We didn't see the nvcc/gcc ICE in our CI because the container uses GCC 7.5.0, while the bug was fixed in 7.2.1/7.3.0

@PhilMiller PhilMiller merged commit 1f8e4f5 into develop Dec 8, 2020
@PhilMiller
Copy link
Member Author

For reference, here's the reduced test case for the Intel ICE:

using ActiveTypedFnType = void();

template < typename MsgT >
void basicHandler();

// *UN*commenting this makes the compiler run successfully;
// is the issue that instantiation is triggered by the template default argument?
//auto fn = &basicHandler<int>;

template <
  typename MsgT,
  ActiveTypedFnType f =
  basicHandler< MsgT>
  >
void reduce( )
{
  auto foo = f;
}

void __trans_tmp_2() {
  // Or uncommenting this
  //auto g = &basicHandler<int>;
  
  reduce<
    int
    // Or uncommenting this
    //, basicHandler<int>
    >();
}

cz4rs pushed a commit that referenced this pull request Dec 14, 2020
 #728: Avoid ICEs and compilation failures in Intel 19.x and Nvidia with gcc<=7.2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build on develop or 1.0.0 fails with Intel 19 compilers
3 participants