Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817

Open
correaa opened this issue Aug 26, 2022 · 13 comments
Open
Labels
thrust For all items related to Thrust.

Comments

@correaa
Copy link

correaa commented Aug 26, 2022

For example:

auto src =          std::allocator<double>{}.allocate(3);
auto dst = thrust::cuda::allocator<double>{}.allocate(3);
//  thrust::copy_n(src, 3, dst);   // this is ok
thrust::uninitialized_copy_n(src, 3, dst);

gives:

thrust::system::system_error: uninitialized_copy_n: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Using thurst::copy(_n) instead works.
I am using the uninitialized_copy because it is the natural function to use in generic code (of this library https://gitlab.com/correaa/boost-multi) when copy constructing arrays (from host to device).

For example:

multi::array<double, 2> CPU = ...;
multi::array<double, 2> GPU{CPU};  
// ^^^ "allocate[gpu] + uninit_copy[cpu->gpu]" seems more correct  than "allocate[gpu] + copy[cpu->cpu]", generically speaking
// however uninit_copy seems to be more limited that copy in thrust.

I am using nvcc 11.7 and thrust 1.15.

I understand that not all algorithms should work for all combinations of memory spaces (e.g. thrust::equal(cpu, cpu + n, gpu)) but I feel that thurst::uninitialized_copy should.

is this a defect in thrust::uninitialized_copy(_n) or is it by design?

@pauleonix
Copy link
Contributor

pauleonix commented Sep 16, 2022

Seems like thrust::uninitialized_copy isn't capable of inter-system copies at all. While thrust::copy does have these capabilities, they aren't documented. Most examples use the vector constructors/assignment operators for these inter-system copies. The docs for thrust::copy state that

it performs the assignments *result = *first, *(result + 1) = *(first + 1), and so on.

which would imply that it shouldn't work here either. For thrust::uninitialized_copy it states that

for each iterator i in the input, uninitialized_copy_n creates a copy of *i in the location pointed to by the corresponding iterator in the output range by InputIterator's value_type's copy constructor with *i as its argument.

which can only work if both memories are accessible from the device (or the host).

So another workaround would be to exchange std::allocator with thrust::cuda::universal_allocator (managed memory) or defining

template <typename T>
using universal_host_pinned_allocator = thrust::mr::stateless_resource_allocator<
  T, thrust::cuda::universal_host_pinned_memory_resource>;

to actually get memory which is accessible on both sides.

No idea why this allocator isn't defined in thrust/system/cuda/memory.h yet. Maybe development on that part of Thrust is paused waiting for the implementation of CUDA resources and allocators in libcudacxx/issues/129

@correaa
Copy link
Author

correaa commented Sep 16, 2022

"which would imply that [copy] shouldn't work here either."

this is not correct, semantically speaking at least.

Thrust generates fancy references from dereference that can assigned from/to. so the equivalent instruction, while not optimal, should work.

In practice Thrust could (hopefully) find an optimization of that operation by using cudaMemcpy or strided copy, for specific types, specific iterators and for combination of devices.
Part of this technology is not available to thrust::uninit_copy because there is no inter device case implemented, even though for many cases uninit_copy and copy should do the same behind the scenes. (eg for PODs).

There is an expectation that inter device copy works, not only because the library examples are full of these cases but also because without it the library would be almost useless.

Generic code expects to use uninit_copy to be semantically equivalent to copy for many trivial types but currently it is not because of the inter device limitation.

@pauleonix
Copy link
Contributor

Yeah, I didn't think of these fancy references probably because accessing them from the host is rarely the right thing to do. But as the std::allocator doesn't give you a fancy reference (and there is no way to copy from non-pinned memory in device code) this can't work for thrust::uninitialized_copy as it is described in the docs. So I would say that this is "by design", even though the design may be flawed as you are alluding to.

Sorry if I didn't make it clear, but my point wasn't that it shouldn't be implemented because of the docs, but that thrust::copy is in need for better documentation. I mean these fancy references aren't documented either, I think.

Implementing this such that it works for nontrivial types sounds complicated although I haven't really looked at the thrust::copy implementation for nontrivial types in different memories either.

@pauleonix
Copy link
Contributor

pauleonix commented Sep 16, 2022

Also regarding fancy references, they seem to be never used inside algorithms (by design?). I.e.

dst[0] = 42.;
dst[1] = 42.;
dst[2] = 42.;

works, but

thrust::fill(thrust::host, dst, dst + 3, 42.);

doesn't.

So even knowing about fancy references still leaves the std::copy documentation missing any information about inter-system copies.

This is also the reason why even this inverted version of your code doesn't work (independent of the execution policy):

auto src = thrust::cuda::allocator<double>{}.allocate(3);
auto dst = std::allocator<double>{}.allocate(3);
thrust::uninitialized_copy_n(thrust::host, src, 3, dst);

@correaa
Copy link
Author

correaa commented Sep 16, 2022

if host is forced in the algorithm, I bet not even thrust::copy_n will work. will it? (I can't try right now).

@pauleonix
Copy link
Contributor

No, it wont. Exactly because fancy references aren't part of the equation here.

@pauleonix
Copy link
Contributor

pauleonix commented Sep 16, 2022

The fact that the thrust::fill above compiles but then segfaults is especially interesting as a user might expect this to be a compile time error due to the device_ptr wrapper.

For reference, I'm using CUDA 11.7.0 and the included version of Thrust, i.e. version 1.15. But I don't think that these things changed since then.

@correaa
Copy link
Author

correaa commented Sep 16, 2022

I see your point. My guess is that, fill (for example), given the types, given the kind iterators, and the (host) policy is doing an optimization (such as calling memset) that ultimately segfaults. The same could be said with copy / memcpy. I guess the policy specifier takes precedence over semantic correctness., for lack of better words.

@pauleonix
Copy link
Contributor

pauleonix commented Sep 16, 2022

Yeah, you are right again... 😄
When forcing an "unoptimized" version, it doesn't segfault:

thrust::copy_n(thrust::host, thrust::counting_iterator(0.), 3, dst);

@pauleonix
Copy link
Contributor

So then fancy references are actually used inside algorithms (makes a lot of sense), but they only work for the device-to-host direction. And you argument then is that semantically, when not specifying an execution policy, thrust::copys assignment could be performed on the host side with cudaMemcpy as the optimization.

I will have to fall back to my argument about fancy iterators not being documented, I guess 😆 Thanks for the (somewhat off-topic) discussion, I certainly learned something!

@pauleonix
Copy link
Contributor

I guess the above thrust::fill and similar segfaulting behavior is just past the point where one can expect a perfect implementation. Mixing tagged iterators, execution policies, fancy references and system specific optimizations and expecting all those things to be evaluated at compile time to prevent the sharp edge of runtime segfaults might just not be feasible.
(Still not saying that thrust::uninitialized_copy shouldn't be able to mix memories.)

@correaa
Copy link
Author

correaa commented Sep 16, 2022

Yes, I agree, documentation is too Doxygen centric, which as usual leaves lots of questions open. Doxygen documentation IMO tends to document syntax and interfaces rather than semantics. Most of what I know about thrust is through the few examples that are online, reading the implementation code and having a solid preliminary knowledge of the STL techniques.

@jrhemstad jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 8, 2023
@jarmak-nv jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023
@correaa
Copy link
Author

correaa commented Nov 13, 2023

This is Godbolt replicating the problem: https://godbolt.org/z/Yr8z5sMb9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Status: Todo
Development

No branches or pull requests

3 participants