[FEA] Add a new parameter for `all_integer` & `all_float` to evaluate null values as True #5136

galipremsagar · 2020-05-08T03:00:59Z

Is your feature request related to a problem? Please describe.
This is a followup feature request from the discussion that happened in #5130

From the discussion happened here, it appears to be that all_integer/all_float were added as a replacement for is_integer/is_float + all operations.

But since is_integer & is_float return null when there is null, and when we apply all reduction the nulls get ignored, we'd like to have similar functionality when all_integer/ all_float is called.

Currently all_integer & all_float seem to return False for null.

>>> import cudf
>>> import cudf._lib.strings.char_types as c
>>> s = cudf.Series(["10", np.nan, "1"])
>>> cudf.Series(c.is_integer(s._column)).all()
True
>>> s = cudf.Series(["abc", None, "1"])
>>> cudf.Series(c.is_integer(s._column))
0    False
1     null
2     True
dtype: bool
>>> s = cudf.Series(["10", None, "1"])
>>> cudf.Series(c.is_integer(s._column)).all()
True
>>> c.all_integers(s._column)
False

>>> cudf.Series(c.is_float(s._column))
0    True
1    null
2    True
dtype: bool
>>> cudf.Series(c.is_float(s._column)).all()
True
>>> c.all_floats(s._column)
False

Describe the solution you'd like
One solution is to have a null_policy parameter added to the functions and evaluate based on the inclusion/ exclusion policy.

Describe alternatives you've considered
The current alternative is to do is_interger/is_float + all.

Additional context
PR that has cython bindings added: #5054

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-05-08T03:20:30Z

@galipremsagar I thought we decided that we would use is_integer / is_float with all. This way people can control nulls as desired without needing to expand the libcudf API footprint unnecessarily.

galipremsagar · 2020-05-08T14:40:48Z

I think calling all_integer/all_float once rather than is_integer/is_float and and all will result in better performance.

After benchmarking a few different inputs, I found some interesting results.

The current all_integer/all_float are calling thrust::all_of which seems to perform slower than (is_interger/is_float) + all. After some digging found this issue: https://github.com/thrust/thrust/issues/1016

Seems like we are trying to not use all_of when not necessary from some of the previous review comments:

So, I went ahead and made changes to all_integer & all_float APIs in libcudf to see performance differences.

str to int conversion check

import cudf._lib.strings.char_types as c
%timeit c.is_integer(data._column).all()
%timeit c.all_integers(data._column)

data	`thrust::transform`(`is_integer`) + .all()	`all_integer`(`thrust::all_of`)	`all_integer`(`thrust::trasform_reduce`)	`all_integer`(`thrust::count_if`)
cudf.Series(["10", '2.0', "1.0"])	192 µs ± 1.25 µs	80.3 µs ± 1.01 µs	59.2 µs ± 520 ns	191 µs ± 1.55 µs
cudf.datasets.randomdata(600_00_00_00, {'s':int})['s'].astype('str')	61 ms ± 298 µs	87.9 ms ± 596 µs	69.8 ms ± 127 µs	68 ms ± 250 µs
cudf.datasets.randomdata(600_00_00, {'s':str})['s']	1.77 ms ± 235 µs	500 µs ± 4.88 µs	847 µs ± 3.16 µs	919 µs ± 17.5 µs
cudf.datasets.randomdata(600_00_000, {'s':float})['s'].astype('str')	8.76 ms ± 34.8 µs	511 µs ± 3.88 µs	8.24 ms ± 15.3 µs	9.03 ms ± 10.9 µs
cudf.Series(["-100.763767", "23545676", "+32674864873", "-0.37628746348734"]*150000)	707 µs ± 4.43 µs	516 µs ± 2.76 µs	480 µs ± 1.31 µs	508 µs ± 10.6 µs

str to float conversion check

import cudf._lib.strings.char_types as c
%timeit c.is_float(data._column).all()
%timeit c.all_floats(data._column)

data	transform(`is_float`) + .all()	`all_float`(`thrust::all_of`)	`all_float`(`thrust::trasform_reduce`)	`all_float`(`thrust::count_if`)
cudf.Series(["10", '2.0', "1.0"])	195 µs ± 1.85 µs	84.6 µs ± 416 ns	61.3 µs ± 574 ns	60.5 µs ± 707 ns
cudf.datasets.randomdata(600_00_00_00, {'s':int})['s'].astype('str')	40.6 ms ± 53.8 µs	67.5 ms ± 228 µs	41.4 ms ± 960 µs	44.8 ms ± 253 µs
cudf.datasets.randomdata(600_00_00, {'s':str})['s']	1.5 ms ± 32.6 µs	471 µs ± 6.25 µs	728 µs ± 3.34 µs	728 µs ± 2.81
cudf.datasets.randomdata(600_00_000, {'s':float})['s'].astype('str')	8.15 ms ± 84.3 µs	9.56 ms ± 28.6 µs	7.51 ms ± 25.7 µs	8.89 ms ± 24.8 µs
cudf.Series(["-100.763767", "23545676", "+32674864873", "-0.37628746348734"]*150000)	697 µs ± 41.8 µs	483 µs ± 4.57 µs	458 µs ± 3.52 µs	468 µs ± 8.45 µs

Note: The numbers in Bold indicate best performing cells/APIs.

It seems like trasform_reduce & count_if perform well but we see some outlier cases of all_of incase of str to int benchmarks. So I'm confused on which implementations of thrust to go ahead with.

@kkraus14 @davidwendt @jrhemstad Could you suggest if we can pick one of the APIs?

galipremsagar · 2020-05-08T14:42:52Z

Sample all_float methods using different APIs which were used to benchmark

thrust::transform_reduce

bool all_float(strings_column_view const& strings,
               null_policy null_handling           = null_policy::EXCLUDE,
               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
               cudaStream_t stream                 = 0)
{
  auto strings_column = column_device_view::create(strings.parent(), stream);
  auto d_column       = *strings_column;
  size_type strings_count = strings.size();
  return thrust::transform_reduce(
                          rmm::exec_policy(stream)->on(stream),
                          thrust::make_counting_iterator<size_type>(0),
                          thrust::make_counting_iterator<size_type>(strings_count),
                          [d_column, null_handling] __device__(size_type idx) {
                            if (d_column.is_null(idx))
                              return null_handling == null_policy::INCLUDE;
                            return string::is_float(d_column.element<string_view>(idx));
                          }
                          ,
                          true, 
                          thrust::logical_and<bool>{}
                          );
}

thrust::count_if

bool all_float(strings_column_view const& strings,
               null_policy null_handling           = null_policy::EXCLUDE,
               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
               cudaStream_t stream                 = 0)
{
  auto strings_column = column_device_view::create(strings.parent(), stream);
  auto d_column       = *strings_column;
  size_type strings_count = strings.size();
  
  return thrust::count_if(
                          rmm::exec_policy(stream)->on(stream),
                          thrust::make_counting_iterator<size_type>(0),
                          thrust::make_counting_iterator<size_type>(strings_count),
                          [d_column, null_handling] __device__(size_type idx) {
                            if (d_column.is_null(idx))
                              return null_handling == null_policy::INCLUDE;
                            return string::is_float(d_column.element<string_view>(idx));
                          }) == strings_count;
}

thrust::all_of

bool all_float(strings_column_view const& strings,
               null_policy null_handling           = null_policy::EXCLUDE,
               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
               cudaStream_t stream                 = 0)
{
  auto strings_column = column_device_view::create(strings.parent(), stream);
  auto d_column       = *strings_column;
  auto transformer_itr =
    thrust::make_transform_iterator(thrust::make_counting_iterator<size_type>(0),
                                    [d_column, null_handling] __device__(size_type idx) {
                                      if (d_column.is_null(idx))
                                        return null_handling == null_policy::INCLUDE;
                                      return string::is_float(d_column.element<string_view>(idx));
                                    });
  return thrust::all_of(rmm::exec_policy(stream)->on(stream),
                        transformer_itr,
                        transformer_itr + strings.size(),
                        thrust::identity<bool>());
}

davidwendt · 2020-05-08T15:56:21Z

My opinion is we should optimize for the expected case of valid data. We should not optimize for expecting bad data since the data can usually not be processed any further (i.e. converting to integers or floats will fail). The thrust::all_of will only be faster on bad data.

My recommendation is to just use the is_integer() - all() pattern or change the all_integer() and all_float() to use thrust::transform_reduce instead of thrust::all_of.
Both of these appear faster on valid data from the chart above.

kkraus14 · 2020-05-08T19:07:08Z

For the sake of reducing API footprint I would suggest we continue with is_integer() + all().

galipremsagar · 2020-05-11T17:12:29Z

Thanks everyone for the inputs! Closing this issue as we'll be using is_integer() + all() in #5054

galipremsagar added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels May 8, 2020

galipremsagar self-assigned this May 8, 2020

galipremsagar mentioned this issue May 8, 2020

[REVIEW] Change String typecasting to be inline with Pandas #5054

Merged

3 tasks

galipremsagar closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a new parameter for `all_integer` & `all_float` to evaluate null values as True #5136

[FEA] Add a new parameter for `all_integer` & `all_float` to evaluate null values as True #5136

galipremsagar commented May 8, 2020

kkraus14 commented May 8, 2020

galipremsagar commented May 8, 2020 •

edited

Loading

galipremsagar commented May 8, 2020

davidwendt commented May 8, 2020 •

edited

Loading

kkraus14 commented May 8, 2020

galipremsagar commented May 11, 2020

[FEA] Add a new parameter for all_integer & all_float to evaluate null values as True #5136

[FEA] Add a new parameter for all_integer & all_float to evaluate null values as True #5136

Comments

galipremsagar commented May 8, 2020

kkraus14 commented May 8, 2020

galipremsagar commented May 8, 2020 • edited Loading

galipremsagar commented May 8, 2020

davidwendt commented May 8, 2020 • edited Loading

kkraus14 commented May 8, 2020

galipremsagar commented May 11, 2020

[FEA] Add a new parameter for `all_integer` & `all_float` to evaluate null values as True #5136

[FEA] Add a new parameter for `all_integer` & `all_float` to evaluate null values as True #5136

galipremsagar commented May 8, 2020 •

edited

Loading

davidwendt commented May 8, 2020 •

edited

Loading