Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] #1452

Closed
davidwendt opened this issue Jun 4, 2021 · 5 comments · Fixed by #1882
Assignees
Labels
nvbug Has an associated internal NVIDIA NVBug. P1: should have Necessary, but not critical. repro: verified The provided repro has been validated. thrust type: bug: functional Does not work as intended.

Comments

@davidwendt
Copy link

I believe this is a compiler issue since the problem only appears when using the -G option on nvcc. Unfortunately I'm not able to follow the thrust source code here well enough to see where the problem occurs. I've attached a smallish testcase that can reproduce the error consistently.

lb_output_itr.cu source file to reproduce the error
#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <iostream>
#include <type_traits>
#include <vector>

// build failing one using:
//   nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
//
// build non-failing one by just removing the -G option

struct output_indexalator {
  using difference_type   = ptrdiff_t;
  using value_type        = int;
  using pointer           = int *;
  using iterator_category = std::random_access_iterator_tag;
  using reference         = output_indexalator const &;

  output_indexalator()                           = default;
  output_indexalator(output_indexalator const &) = default;
  output_indexalator(output_indexalator &&)      = default;
  output_indexalator &operator=(output_indexalator const &) = default;
  output_indexalator &operator=(output_indexalator &&) = default;

  __host__ __device__ output_indexalator &operator+=(difference_type offset)
  {
    p_ += offset * width_;
    return *this;
  }

  __host__ __device__ output_indexalator operator+(difference_type offset) const
  {
    auto tmp = output_indexalator(*this);
    tmp.p_ += (offset * width_);
    return tmp;
  }

  __device__ output_indexalator const operator[](int idx) const
  {
    output_indexalator tmp{*this};
    tmp.p_ += (idx * width_);
    return tmp;
  }

  __device__ output_indexalator const &operator*() const
  {
    printf("%p: op*() %p\n", this, p_);
    return *this;
  }

  __device__ output_indexalator const &operator=(int const value) const
  {
    void *tp = p_;
    printf("%p: set(%p,%d)\n", this, tp, value);  // class data is garbage
    // crashes here:
    (*static_cast<int *>(tp)) = static_cast<int>(value);
    return *this;
  }

  output_indexalator(void *data, int width) : width_(width), p_{static_cast<char *>(data)} {}

  int width_;  /// integer type width = 1,2,4, or 8
  char *p_;    /// pointer to the integer data in device memory
};

template <typename T>
thrust::device_vector<T> make_device_vector(std::vector<T> data)
{
  thrust::host_vector<T> h_data{data.begin(), data.end()};
  thrust::device_vector<T> d_data{h_data};
  return d_data;
}

int main(void)
{
  std::vector<int> input{0, 1, 2, 3, 4};
  auto d_input = make_device_vector(input);
  std::vector<int> values{9, 1, 0, 2};
  auto d_values = make_device_vector(values);

  thrust::device_vector<int> output(d_values.size());
  auto itr = output_indexalator(output.data().get(), sizeof(int));

  thrust::lower_bound(thrust::device,
                      d_input.begin(),
                      d_input.end(),
                      d_values.begin(),
                      d_values.end(),
                      itr,  // output.begin(),
                      thrust::less<long>());

  thrust::host_vector<int> h_output(output);
  for (auto v : h_output) std::cout << " " << v;
  std::cout << std::endl;
  return 0;
}

Compile the source file using the following command:

nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr

Running the resulting lb_output_itr executable gives the following result:

$ ./lb_output_itr 
0x7f65eefffbd0: op*() 0x7f65b7e00400
0x7f65eefffbd0: op*() 0x7f65b7e00404
0x7f65eefffbd0: op*() 0x7f65b7e00408
0x7f65eefffbd0: op*() 0x7f65b7e0040c
0x7f65eefffbd0: set(0x400000000,5)
0x7f65eefffbd0: set(0x400000000,1)
0x7f65eefffbd0: set(0x400000000,0)
0x7f65eefffbd0: set(0x400000000,2)
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted

The 0x400000000 should be the device pointer but the iterator object is getting trashed somewhere.

Building without the -G option will produce the correct result:

$ nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
$ ./lb_output_itr 
0x7f745cfffcc8: op*() 0x7f7427e00400
0x7f745cfffcc8: op*() 0x7f7427e00404
0x7f745cfffcc8: op*() 0x7f7427e00408
0x7f745cfffcc8: op*() 0x7f7427e0040c
0x7f745cfffcc8: set(0x7f7427e00400,5)
0x7f745cfffcc8: set(0x7f7427e00404,1)
0x7f745cfffcc8: set(0x7f7427e00408,0)
0x7f745cfffcc8: set(0x7f7427e0040c,2)
 5 1 0 2

The output_indexalator iterator being used here is a simplified version from a much larger set of code and has been pared down to provide a minimal reproducer for this issue.

I've verified the error occurs with the same results on my Linux 18.04 system with the following nvcc compiler versions (and associated thrust versions): V11.0.221, V11.1.105, V11.2.142, and V11.3.109

@alliepiper
Copy link
Collaborator

Took a quick look at this in the debugger it appears that memory is getting corrupt -- the parallel_for algorithm's tile_base is consistently set to 0, but is a very large negative value by the time the crash happens. I haven't found a reason for this, and it's odd that it's only in debugging mode.

@davidwendt Has the RAPIDS team filed an nvcc bug for this?

@davidwendt
Copy link
Author

No, not yet. I wanted to get some help from you on creating the details for nvcc bug since I don't know what is happening.
Do you think the information here is enough to open an nvcc bug?

@alliepiper
Copy link
Collaborator

Ok, just wanted to make sure. I'm planning to spend a bit more time looking at this in case it is in our libraries before we escalate.

@alliepiper
Copy link
Collaborator

I spent a couple more hours looking into this, and things seem to go off the rails around this line. The lhs of the assignment (thrust::get<1>(t)) is producing a null reference to an output_indexalator, but only in debug mode -- it is a valid object otherwise. This is why the class data is invalid during the output_indexalator::operator= method.

This tuple is produced from a zip_iterator of values_begin and output here.

I can't see anything going wrong in the source code, so this does seem like a compiler bug. I've filed NVBug 3322776 to have the compiler folks check it out.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jun 11, 2021
…tor in thrust::lower_bound (#8432)

Closes #6521 

The `thrust::lower_bound` call is crashing on a libcudf debug build when using the `output_indexalator`. I've opened [an issue in the thrust github](NVIDIA/thrust#1452) keep track of this. The problem only occurs when using the `-G` nvcc compile option.

I found a workaround using a `thrust::transform` along with device lambda containing a `thrust::lower_bound(seq)` call for each element. This PR adds the workaround which is only used in a debug build since the error occurs in functions that used as utilities for other functions when using dictionary columns.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #8432
@alliepiper alliepiper changed the title CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] Aug 17, 2021
@alliepiper alliepiper added nvbug Has an associated internal NVIDIA NVBug. P1: should have Necessary, but not critical. labels Aug 17, 2021
@alliepiper alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021
@alliepiper alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021
@alliepiper alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022
@alliepiper alliepiper modified the milestones: 1.17.0, Backlog Apr 25, 2022
@ericniebler
Copy link
Collaborator

This is a tricky one. The root issue is that for output_indexalator, which is a random access iterator, operator* returns by reference but operator[] returns by value. This is allowed by the C++17 iterator concepts, but the Thrust for_each[_n] algorithms are not expecting this.

  template <class Input, class UnaryOp>
  struct for_each_f
  {
    Input input;
    UnaryOp op;

    THRUST_FUNCTION
    for_each_f(Input input, UnaryOp op)
        : input(input), op(op) {}

    template <class Size>
    THRUST_DEVICE_FUNCTION void operator()(Size idx)
    {
      op(raw_reference_cast(input[idx])); // HERE
    }
  };

This has a bad interaction with zip_iterator's dereference_iterator callable leading to a function returning a reference to a local.

The fix is quite simple: Thrust should avoid using operator[] on random access iterators. This eliminates the crash:

    template <class Size>
    THRUST_DEVICE_FUNCTION void operator()(Size idx)
    {
      op(raw_reference_cast(*(input + idx)));
    }

ericniebler added a commit to ericniebler/thrust that referenced this issue Mar 8, 2023
Too much code in Thrust assumes that it[n] returns the same type as
*(it+n), but the standard only requires that it[n] is convertible to the
type of *(it+n). Thrust should avoid using operator[] on iterators and
prefer instead to use addition/dereference.

Fixes NVIDIA#1452
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
nvbug Has an associated internal NVIDIA NVBug. P1: should have Necessary, but not critical. repro: verified The provided repro has been validated. thrust type: bug: functional Does not work as intended.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants