CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] #1452

davidwendt · 2021-06-04T21:39:31Z

I believe this is a compiler issue since the problem only appears when using the -G option on nvcc. Unfortunately I'm not able to follow the thrust source code here well enough to see where the problem occurs. I've attached a smallish testcase that can reproduce the error consistently.

lb_output_itr.cu source file to reproduce the error

#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <iostream>
#include <type_traits>
#include <vector>

// build failing one using:
//   nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
//
// build non-failing one by just removing the -G option

struct output_indexalator {
  using difference_type   = ptrdiff_t;
  using value_type        = int;
  using pointer           = int *;
  using iterator_category = std::random_access_iterator_tag;
  using reference         = output_indexalator const &;

  output_indexalator()                           = default;
  output_indexalator(output_indexalator const &) = default;
  output_indexalator(output_indexalator &&)      = default;
  output_indexalator &operator=(output_indexalator const &) = default;
  output_indexalator &operator=(output_indexalator &&) = default;

  __host__ __device__ output_indexalator &operator+=(difference_type offset)
  {
    p_ += offset * width_;
    return *this;
  }

  __host__ __device__ output_indexalator operator+(difference_type offset) const
  {
    auto tmp = output_indexalator(*this);
    tmp.p_ += (offset * width_);
    return tmp;
  }

  __device__ output_indexalator const operator[](int idx) const
  {
    output_indexalator tmp{*this};
    tmp.p_ += (idx * width_);
    return tmp;
  }

  __device__ output_indexalator const &operator*() const
  {
    printf("%p: op*() %p\n", this, p_);
    return *this;
  }

  __device__ output_indexalator const &operator=(int const value) const
  {
    void *tp = p_;
    printf("%p: set(%p,%d)\n", this, tp, value);  // class data is garbage
    // crashes here:
    (*static_cast<int *>(tp)) = static_cast<int>(value);
    return *this;
  }

  output_indexalator(void *data, int width) : width_(width), p_{static_cast<char *>(data)} {}

  int width_;  /// integer type width = 1,2,4, or 8
  char *p_;    /// pointer to the integer data in device memory
};

template <typename T>
thrust::device_vector<T> make_device_vector(std::vector<T> data)
{
  thrust::host_vector<T> h_data{data.begin(), data.end()};
  thrust::device_vector<T> d_data{h_data};
  return d_data;
}

int main(void)
{
  std::vector<int> input{0, 1, 2, 3, 4};
  auto d_input = make_device_vector(input);
  std::vector<int> values{9, 1, 0, 2};
  auto d_values = make_device_vector(values);

  thrust::device_vector<int> output(d_values.size());
  auto itr = output_indexalator(output.data().get(), sizeof(int));

  thrust::lower_bound(thrust::device,
                      d_input.begin(),
                      d_input.end(),
                      d_values.begin(),
                      d_values.end(),
                      itr,  // output.begin(),
                      thrust::less<long>());

  thrust::host_vector<int> h_output(output);
  for (auto v : h_output) std::cout << " " << v;
  std::cout << std::endl;
  return 0;
}

Compile the source file using the following command:

nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr

Running the resulting lb_output_itr executable gives the following result:

$ ./lb_output_itr 
0x7f65eefffbd0: op*() 0x7f65b7e00400
0x7f65eefffbd0: op*() 0x7f65b7e00404
0x7f65eefffbd0: op*() 0x7f65b7e00408
0x7f65eefffbd0: op*() 0x7f65b7e0040c
0x7f65eefffbd0: set(0x400000000,5)
0x7f65eefffbd0: set(0x400000000,1)
0x7f65eefffbd0: set(0x400000000,0)
0x7f65eefffbd0: set(0x400000000,2)
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted

The 0x400000000 should be the device pointer but the iterator object is getting trashed somewhere.

Building without the -G option will produce the correct result:

$ nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
$ ./lb_output_itr 
0x7f745cfffcc8: op*() 0x7f7427e00400
0x7f745cfffcc8: op*() 0x7f7427e00404
0x7f745cfffcc8: op*() 0x7f7427e00408
0x7f745cfffcc8: op*() 0x7f7427e0040c
0x7f745cfffcc8: set(0x7f7427e00400,5)
0x7f745cfffcc8: set(0x7f7427e00404,1)
0x7f745cfffcc8: set(0x7f7427e00408,0)
0x7f745cfffcc8: set(0x7f7427e0040c,2)
 5 1 0 2

The output_indexalator iterator being used here is a simplified version from a much larger set of code and has been pared down to provide a minimal reproducer for this issue.

I've verified the error occurs with the same results on my Linux 18.04 system with the following nvcc compiler versions (and associated thrust versions): V11.0.221, V11.1.105, V11.2.142, and V11.3.109

The text was updated successfully, but these errors were encountered:

alliepiper · 2021-06-07T14:15:26Z

Took a quick look at this in the debugger it appears that memory is getting corrupt -- the parallel_for algorithm's tile_base is consistently set to 0, but is a very large negative value by the time the crash happens. I haven't found a reason for this, and it's odd that it's only in debugging mode.

@davidwendt Has the RAPIDS team filed an nvcc bug for this?

davidwendt · 2021-06-07T15:08:59Z

No, not yet. I wanted to get some help from you on creating the details for nvcc bug since I don't know what is happening.
Do you think the information here is enough to open an nvcc bug?

alliepiper · 2021-06-07T16:06:26Z

Ok, just wanted to make sure. I'm planning to spend a bit more time looking at this in case it is in our libraries before we escalate.

alliepiper · 2021-06-08T14:31:40Z

I spent a couple more hours looking into this, and things seem to go off the rails around this line. The lhs of the assignment (thrust::get<1>(t)) is producing a null reference to an output_indexalator, but only in debug mode -- it is a valid object otherwise. This is why the class data is invalid during the output_indexalator::operator= method.

This tuple is produced from a zip_iterator of values_begin and output here.

I can't see anything going wrong in the source code, so this does seem like a compiler bug. I've filed NVBug 3322776 to have the compiler folks check it out.

…tor in thrust::lower_bound (#8432) Closes #6521 The `thrust::lower_bound` call is crashing on a libcudf debug build when using the `output_indexalator`. I've opened [an issue in the thrust github](NVIDIA/thrust#1452) keep track of this. The problem only occurs when using the `-G` nvcc compile option. I found a workaround using a `thrust::transform` along with device lambda containing a `thrust::lower_bound(seq)` call for each element. This PR adds the workaround which is only used in a debug build since the error occurs in functions that used as utilities for other functions when using dictionary columns. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Devavret Makkar (https://github.com/devavret) - Karthikeyan (https://github.com/karthikeyann) URL: #8432

ericniebler · 2023-03-08T19:04:23Z

This is a tricky one. The root issue is that for output_indexalator, which is a random access iterator, operator* returns by reference but operator[] returns by value. This is allowed by the C++17 iterator concepts, but the Thrust for_each[_n] algorithms are not expecting this.

  template <class Input, class UnaryOp>
  struct for_each_f
  {
    Input input;
    UnaryOp op;

    THRUST_FUNCTION
    for_each_f(Input input, UnaryOp op)
        : input(input), op(op) {}

    template <class Size>
    THRUST_DEVICE_FUNCTION void operator()(Size idx)
    {
      op(raw_reference_cast(input[idx])); // HERE
    }
  };

This has a bad interaction with zip_iterator's dereference_iterator callable leading to a function returning a reference to a local.

The fix is quite simple: Thrust should avoid using operator[] on random access iterators. This eliminates the crash:

    template <class Size>
    THRUST_DEVICE_FUNCTION void operator()(Size idx)
    {
      op(raw_reference_cast(*(input + idx)));
    }

Too much code in Thrust assumes that it[n] returns the same type as *(it+n), but the standard only requires that it[n] is convertible to the type of *(it+n). Thrust should avoid using operator[] on iterators and prefer instead to use addition/dereference. Fixes NVIDIA#1452

davidwendt mentioned this issue Jun 4, 2021

Add workaround for crash in libcudf debug build using output_indexalator in thrust::lower_bound rapidsai/cudf#8432

Merged

alliepiper added the type: bug: functional Does not work as intended. label Jun 7, 2021

alliepiper added this to the 1.14.0 milestone Jun 7, 2021

alliepiper added the repro: verified The provided repro has been validated. label Jun 7, 2021

alliepiper changed the title ~~CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option~~ CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] Aug 17, 2021

alliepiper added nvbug Has an associated internal NVIDIA NVBug. P1: should have Necessary, but not critical. labels Aug 17, 2021

alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021

alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021

alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022

robertutterback mentioned this issue Mar 3, 2022

"unspecified launch failure" using thrust::max_element on class with array member (only with -G) #1625

Closed

alliepiper modified the milestones: 1.17.0, Backlog Apr 25, 2022

jrhemstad added the thrust label Feb 22, 2023

jarmak-nv assigned ericniebler Feb 23, 2023

ericniebler mentioned this issue Mar 8, 2023

Stop using operator[] on iterators in for_each[_n] #1882

Merged

ericniebler closed this as completed in #1882 Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] #1452

CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] #1452

davidwendt commented Jun 4, 2021

alliepiper commented Jun 7, 2021

davidwendt commented Jun 7, 2021

alliepiper commented Jun 7, 2021

alliepiper commented Jun 8, 2021

ericniebler commented Mar 8, 2023

CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] #1452

CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776] #1452

Comments

davidwendt commented Jun 4, 2021

alliepiper commented Jun 7, 2021

davidwendt commented Jun 7, 2021

alliepiper commented Jun 7, 2021

alliepiper commented Jun 8, 2021

ericniebler commented Mar 8, 2023