-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thrust::uninitialized_copy(_n) gives a runtime error when mixing memories #817
Comments
Seems like
which would imply that it shouldn't work here either. For
which can only work if both memories are accessible from the device (or the host). So another workaround would be to exchange template <typename T>
using universal_host_pinned_allocator = thrust::mr::stateless_resource_allocator<
T, thrust::cuda::universal_host_pinned_memory_resource>; to actually get memory which is accessible on both sides. No idea why this allocator isn't defined in |
"which would imply that [copy] shouldn't work here either." this is not correct, semantically speaking at least. Thrust generates fancy references from dereference that can assigned from/to. so the equivalent instruction, while not optimal, should work. In practice Thrust could (hopefully) find an optimization of that operation by using cudaMemcpy or strided copy, for specific types, specific iterators and for combination of devices. There is an expectation that inter device copy works, not only because the library examples are full of these cases but also because without it the library would be almost useless. Generic code expects to use uninit_copy to be semantically equivalent to copy for many trivial types but currently it is not because of the inter device limitation. |
Yeah, I didn't think of these fancy references probably because accessing them from the host is rarely the right thing to do. But as the Sorry if I didn't make it clear, but my point wasn't that it shouldn't be implemented because of the docs, but that Implementing this such that it works for nontrivial types sounds complicated although I haven't really looked at the |
Also regarding fancy references, they seem to be never used inside algorithms (by design?). I.e. dst[0] = 42.;
dst[1] = 42.;
dst[2] = 42.; works, but thrust::fill(thrust::host, dst, dst + 3, 42.); doesn't. So even knowing about fancy references still leaves the This is also the reason why even this inverted version of your code doesn't work (independent of the execution policy): auto src = thrust::cuda::allocator<double>{}.allocate(3);
auto dst = std::allocator<double>{}.allocate(3);
thrust::uninitialized_copy_n(thrust::host, src, 3, dst); |
if host is forced in the algorithm, I bet not even |
No, it wont. Exactly because fancy references aren't part of the equation here. |
The fact that the For reference, I'm using CUDA 11.7.0 and the included version of Thrust, i.e. version 1.15. But I don't think that these things changed since then. |
I see your point. My guess is that, |
Yeah, you are right again... 😄 thrust::copy_n(thrust::host, thrust::counting_iterator(0.), 3, dst); |
So then fancy references are actually used inside algorithms (makes a lot of sense), but they only work for the device-to-host direction. And you argument then is that semantically, when not specifying an execution policy, I will have to fall back to my argument about fancy iterators not being documented, I guess 😆 Thanks for the (somewhat off-topic) discussion, I certainly learned something! |
I guess the above |
Yes, I agree, documentation is too Doxygen centric, which as usual leaves lots of questions open. Doxygen documentation IMO tends to document syntax and interfaces rather than semantics. Most of what I know about thrust is through the few examples that are online, reading the implementation code and having a solid preliminary knowledge of the STL techniques. |
This is Godbolt replicating the problem: https://godbolt.org/z/Yr8z5sMb9 |
For example:
gives:
Using
thurst::copy(_n)
instead works.I am using the
uninitialized_copy
because it is the natural function to use in generic code (of this library https://gitlab.com/correaa/boost-multi) when copy constructing arrays (from host to device).For example:
I am using
nvcc 11.7
andthrust 1.15
.I understand that not all algorithms should work for all combinations of memory spaces (e.g.
thrust::equal(cpu, cpu + n, gpu)
) but I feel thatthurst::uninitialized_copy
should.is this a defect in
thrust::uninitialized_copy(_n)
or is it by design?The text was updated successfully, but these errors were encountered: