thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817

correaa · 2022-08-26T00:02:15Z

For example:

auto src =          std::allocator<double>{}.allocate(3);
auto dst = thrust::cuda::allocator<double>{}.allocate(3);
//  thrust::copy_n(src, 3, dst);   // this is ok
thrust::uninitialized_copy_n(src, 3, dst);

gives:

thrust::system::system_error: uninitialized_copy_n: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Using thurst::copy(_n) instead works.
I am using the uninitialized_copy because it is the natural function to use in generic code (of this library https://gitlab.com/correaa/boost-multi) when copy constructing arrays (from host to device).

For example:

multi::array<double, 2> CPU = ...;
multi::array<double, 2> GPU{CPU};  
// ^^^ "allocate[gpu] + uninit_copy[cpu->gpu]" seems more correct  than "allocate[gpu] + copy[cpu->cpu]", generically speaking
// however uninit_copy seems to be more limited that copy in thrust.

I am using nvcc 11.7 and thrust 1.15.

I understand that not all algorithms should work for all combinations of memory spaces (e.g. thrust::equal(cpu, cpu + n, gpu)) but I feel that thurst::uninitialized_copy should.

is this a defect in thrust::uninitialized_copy(_n) or is it by design?

The text was updated successfully, but these errors were encountered:

pauleonix · 2022-09-16T14:57:52Z

Seems like thrust::uninitialized_copy isn't capable of inter-system copies at all. While thrust::copy does have these capabilities, they aren't documented. Most examples use the vector constructors/assignment operators for these inter-system copies. The docs for thrust::copy state that

it performs the assignments *result = *first, *(result + 1) = *(first + 1), and so on.

which would imply that it shouldn't work here either. For thrust::uninitialized_copy it states that

for each iterator i in the input, uninitialized_copy_n creates a copy of *i in the location pointed to by the corresponding iterator in the output range by InputIterator's value_type's copy constructor with *i as its argument.

which can only work if both memories are accessible from the device (or the host).

So another workaround would be to exchange std::allocator with thrust::cuda::universal_allocator (managed memory) or defining

template <typename T>
using universal_host_pinned_allocator = thrust::mr::stateless_resource_allocator<
  T, thrust::cuda::universal_host_pinned_memory_resource>;

to actually get memory which is accessible on both sides.

No idea why this allocator isn't defined in thrust/system/cuda/memory.h yet. Maybe development on that part of Thrust is paused waiting for the implementation of CUDA resources and allocators in libcudacxx/issues/129

correaa · 2022-09-16T15:35:20Z

"which would imply that [copy] shouldn't work here either."

this is not correct, semantically speaking at least.

Thrust generates fancy references from dereference that can assigned from/to. so the equivalent instruction, while not optimal, should work.

In practice Thrust could (hopefully) find an optimization of that operation by using cudaMemcpy or strided copy, for specific types, specific iterators and for combination of devices.
Part of this technology is not available to thrust::uninit_copy because there is no inter device case implemented, even though for many cases uninit_copy and copy should do the same behind the scenes. (eg for PODs).

There is an expectation that inter device copy works, not only because the library examples are full of these cases but also because without it the library would be almost useless.

Generic code expects to use uninit_copy to be semantically equivalent to copy for many trivial types but currently it is not because of the inter device limitation.

pauleonix · 2022-09-16T16:16:36Z

Yeah, I didn't think of these fancy references probably because accessing them from the host is rarely the right thing to do. But as the std::allocator doesn't give you a fancy reference (and there is no way to copy from non-pinned memory in device code) this can't work for thrust::uninitialized_copy as it is described in the docs. So I would say that this is "by design", even though the design may be flawed as you are alluding to.

Sorry if I didn't make it clear, but my point wasn't that it shouldn't be implemented because of the docs, but that thrust::copy is in need for better documentation. I mean these fancy references aren't documented either, I think.

Implementing this such that it works for nontrivial types sounds complicated although I haven't really looked at the thrust::copy implementation for nontrivial types in different memories either.

pauleonix · 2022-09-16T16:42:54Z

Also regarding fancy references, they seem to be never used inside algorithms (by design?). I.e.

dst[0] = 42.;
dst[1] = 42.;
dst[2] = 42.;

works, but

thrust::fill(thrust::host, dst, dst + 3, 42.);

doesn't.

So even knowing about fancy references still leaves the std::copy documentation missing any information about inter-system copies.

This is also the reason why even this inverted version of your code doesn't work (independent of the execution policy):

auto src = thrust::cuda::allocator<double>{}.allocate(3);
auto dst = std::allocator<double>{}.allocate(3);
thrust::uninitialized_copy_n(thrust::host, src, 3, dst);

correaa · 2022-09-16T17:12:51Z

if host is forced in the algorithm, I bet not even thrust::copy_n will work. will it? (I can't try right now).

pauleonix · 2022-09-16T17:18:02Z

No, it wont. Exactly because fancy references aren't part of the equation here.

pauleonix · 2022-09-16T17:24:07Z

The fact that the thrust::fill above compiles but then segfaults is especially interesting as a user might expect this to be a compile time error due to the device_ptr wrapper.

For reference, I'm using CUDA 11.7.0 and the included version of Thrust, i.e. version 1.15. But I don't think that these things changed since then.

correaa · 2022-09-16T17:29:02Z

I see your point. My guess is that, fill (for example), given the types, given the kind iterators, and the (host) policy is doing an optimization (such as calling memset) that ultimately segfaults. The same could be said with copy / memcpy. I guess the policy specifier takes precedence over semantic correctness., for lack of better words.

pauleonix · 2022-09-16T17:35:44Z

Yeah, you are right again... 😄
When forcing an "unoptimized" version, it doesn't segfault:

thrust::copy_n(thrust::host, thrust::counting_iterator(0.), 3, dst);

pauleonix · 2022-09-16T17:44:56Z

So then fancy references are actually used inside algorithms (makes a lot of sense), but they only work for the device-to-host direction. And you argument then is that semantically, when not specifying an execution policy, thrust::copys assignment could be performed on the host side with cudaMemcpy as the optimization.

I will have to fall back to my argument about fancy iterators not being documented, I guess 😆 Thanks for the (somewhat off-topic) discussion, I certainly learned something!

pauleonix · 2022-09-16T18:00:59Z

I guess the above thrust::fill and similar segfaulting behavior is just past the point where one can expect a perfect implementation. Mixing tagged iterators, execution policies, fancy references and system specific optimizations and expecting all those things to be evaluated at compile time to prevent the sharp edge of runtime segfaults might just not be feasible.
(Still not saying that thrust::uninitialized_copy shouldn't be able to mix memories.)

correaa · 2022-09-16T19:07:18Z

Yes, I agree, documentation is too Doxygen centric, which as usual leaves lots of questions open. Doxygen documentation IMO tends to document syntax and interfaces rather than semantics. Most of what I know about thrust is through the few examples that are online, reading the implementation code and having a solid preliminary knowledge of the STL techniques.

correaa · 2023-11-13T01:33:04Z

This is Godbolt replicating the problem: https://godbolt.org/z/Yr8z5sMb9

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

github-project-automation bot added this to CCCL Nov 8, 2023

github-project-automation bot moved this to Todo in CCCL Nov 8, 2023

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817

thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817

correaa commented Aug 26, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022 •

edited

Loading

correaa commented Sep 16, 2022

pauleonix commented Sep 16, 2022

pauleonix commented Sep 16, 2022 •

edited

Loading

correaa commented Sep 16, 2022

pauleonix commented Sep 16, 2022

pauleonix commented Sep 16, 2022 •

edited

Loading

correaa commented Sep 16, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022

pauleonix commented Sep 16, 2022

correaa commented Sep 16, 2022

correaa commented Nov 13, 2023

thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817

thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817

Comments

correaa commented Aug 26, 2022 • edited Loading

pauleonix commented Sep 16, 2022 • edited Loading

correaa commented Sep 16, 2022

pauleonix commented Sep 16, 2022

pauleonix commented Sep 16, 2022 • edited Loading

correaa commented Sep 16, 2022

pauleonix commented Sep 16, 2022

pauleonix commented Sep 16, 2022 • edited Loading

correaa commented Sep 16, 2022 • edited Loading

pauleonix commented Sep 16, 2022 • edited Loading

pauleonix commented Sep 16, 2022

pauleonix commented Sep 16, 2022

correaa commented Sep 16, 2022

correaa commented Nov 13, 2023

correaa commented Aug 26, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022 •

edited

Loading

correaa commented Sep 16, 2022 •

edited

Loading

pauleonix commented Sep 16, 2022 •

edited

Loading