Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a new parameter for all_integer & all_float to evaluate null values as True #5136

Closed
galipremsagar opened this issue May 8, 2020 · 6 comments · Fixed by #5054
Closed
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@galipremsagar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
This is a followup feature request from the discussion that happened in #5130

From the discussion happened here, it appears to be that all_integer/all_float were added as a replacement for is_integer/is_float + all operations.

But since is_integer & is_float return null when there is null, and when we apply all reduction the nulls get ignored, we'd like to have similar functionality when all_integer/ all_float is called.

Currently all_integer & all_float seem to return False for null.

>>> import cudf
>>> import cudf._lib.strings.char_types as c
>>> s = cudf.Series(["10", np.nan, "1"])
>>> cudf.Series(c.is_integer(s._column)).all()
True
>>> s = cudf.Series(["abc", None, "1"])
>>> cudf.Series(c.is_integer(s._column))
0    False
1     null
2     True
dtype: bool
>>> s = cudf.Series(["10", None, "1"])
>>> cudf.Series(c.is_integer(s._column)).all()
True
>>> c.all_integers(s._column)
False

>>> cudf.Series(c.is_float(s._column))
0    True
1    null
2    True
dtype: bool
>>> cudf.Series(c.is_float(s._column)).all()
True
>>> c.all_floats(s._column)
False

Describe the solution you'd like
One solution is to have a null_policy parameter added to the functions and evaluate based on the inclusion/ exclusion policy.

Describe alternatives you've considered
The current alternative is to do is_interger/is_float + all.

Additional context
PR that has cython bindings added: #5054

@galipremsagar galipremsagar added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels May 8, 2020
@galipremsagar galipremsagar self-assigned this May 8, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented May 8, 2020

@galipremsagar I thought we decided that we would use is_integer / is_float with all. This way people can control nulls as desired without needing to expand the libcudf API footprint unnecessarily.

@galipremsagar
Copy link
Contributor Author

galipremsagar commented May 8, 2020

I think calling all_integer/all_float once rather than is_integer/is_float and and all will result in better performance.

After benchmarking a few different inputs, I found some interesting results.

The current all_integer/all_float are calling thrust::all_of which seems to perform slower than (is_interger/is_float) + all. After some digging found this issue: https://github.com/thrust/thrust/issues/1016

Seems like we are trying to not use all_of when not necessary from some of the previous review comments:

  1. [REVIEW] Port scatter to libcudf++ #3354 (comment)
  2. [REVIEW] Define and implement new stream compaction APIs copy_if, drop_nulls, apply_boolean_mask, drop_duplicate and unique_count. #3303 (comment)

So, I went ahead and made changes to all_integer & all_float APIs in libcudf to see performance differences.

str to int conversion check

import cudf._lib.strings.char_types as c
%timeit c.is_integer(data._column).all()
%timeit c.all_integers(data._column)
data thrust::transform(is_integer) + .all() all_integer(thrust::all_of) all_integer(thrust::trasform_reduce) all_integer(thrust::count_if)
cudf.Series(["10", '2.0', "1.0"]) 192 µs ± 1.25 µs 80.3 µs ± 1.01 µs 59.2 µs ± 520 ns 191 µs ± 1.55 µs
cudf.datasets.randomdata(600_00_00_00, {'s':int})['s'].astype('str') 61 ms ± 298 µs 87.9 ms ± 596 µs 69.8 ms ± 127 µs 68 ms ± 250 µs
cudf.datasets.randomdata(600_00_00, {'s':str})['s'] 1.77 ms ± 235 µs 500 µs ± 4.88 µs 847 µs ± 3.16 µs 919 µs ± 17.5 µs
cudf.datasets.randomdata(600_00_000, {'s':float})['s'].astype('str') 8.76 ms ± 34.8 µs 511 µs ± 3.88 µs 8.24 ms ± 15.3 µs 9.03 ms ± 10.9 µs
cudf.Series(["-100.763767", "23545676", "+32674864873", "-0.37628746348734"]*150000) 707 µs ± 4.43 µs 516 µs ± 2.76 µs 480 µs ± 1.31 µs 508 µs ± 10.6 µs

str to float conversion check

import cudf._lib.strings.char_types as c
%timeit c.is_float(data._column).all()
%timeit c.all_floats(data._column)
data transform(is_float) + .all() all_float(thrust::all_of) all_float(thrust::trasform_reduce) all_float(thrust::count_if)
cudf.Series(["10", '2.0', "1.0"]) 195 µs ± 1.85 µs 84.6 µs ± 416 ns 61.3 µs ± 574 ns 60.5 µs ± 707 ns
cudf.datasets.randomdata(600_00_00_00, {'s':int})['s'].astype('str') 40.6 ms ± 53.8 µs 67.5 ms ± 228 µs 41.4 ms ± 960 µs 44.8 ms ± 253 µs
cudf.datasets.randomdata(600_00_00, {'s':str})['s'] 1.5 ms ± 32.6 µs 471 µs ± 6.25 µs 728 µs ± 3.34 µs 728 µs ± 2.81
cudf.datasets.randomdata(600_00_000, {'s':float})['s'].astype('str') 8.15 ms ± 84.3 µs 9.56 ms ± 28.6 µs 7.51 ms ± 25.7 µs 8.89 ms ± 24.8 µs
cudf.Series(["-100.763767", "23545676", "+32674864873", "-0.37628746348734"]*150000) 697 µs ± 41.8 µs 483 µs ± 4.57 µs 458 µs ± 3.52 µs 468 µs ± 8.45 µs

Note: The numbers in Bold indicate best performing cells/APIs.

It seems like trasform_reduce & count_if perform well but we see some outlier cases of all_of incase of str to int benchmarks. So I'm confused on which implementations of thrust to go ahead with.

@kkraus14 @davidwendt @jrhemstad Could you suggest if we can pick one of the APIs?

@galipremsagar
Copy link
Contributor Author

Sample all_float methods using different APIs which were used to benchmark

thrust::transform_reduce

bool all_float(strings_column_view const& strings,
               null_policy null_handling           = null_policy::EXCLUDE,
               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
               cudaStream_t stream                 = 0)
{
  auto strings_column = column_device_view::create(strings.parent(), stream);
  auto d_column       = *strings_column;
  size_type strings_count = strings.size();
  return thrust::transform_reduce(
                          rmm::exec_policy(stream)->on(stream),
                          thrust::make_counting_iterator<size_type>(0),
                          thrust::make_counting_iterator<size_type>(strings_count),
                          [d_column, null_handling] __device__(size_type idx) {
                            if (d_column.is_null(idx))
                              return null_handling == null_policy::INCLUDE;
                            return string::is_float(d_column.element<string_view>(idx));
                          }
                          ,
                          true, 
                          thrust::logical_and<bool>{}
                          );
}

thrust::count_if

bool all_float(strings_column_view const& strings,
               null_policy null_handling           = null_policy::EXCLUDE,
               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
               cudaStream_t stream                 = 0)
{
  auto strings_column = column_device_view::create(strings.parent(), stream);
  auto d_column       = *strings_column;
  size_type strings_count = strings.size();
  
  return thrust::count_if(
                          rmm::exec_policy(stream)->on(stream),
                          thrust::make_counting_iterator<size_type>(0),
                          thrust::make_counting_iterator<size_type>(strings_count),
                          [d_column, null_handling] __device__(size_type idx) {
                            if (d_column.is_null(idx))
                              return null_handling == null_policy::INCLUDE;
                            return string::is_float(d_column.element<string_view>(idx));
                          }) == strings_count;
}

thrust::all_of

bool all_float(strings_column_view const& strings,
               null_policy null_handling           = null_policy::EXCLUDE,
               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
               cudaStream_t stream                 = 0)
{
  auto strings_column = column_device_view::create(strings.parent(), stream);
  auto d_column       = *strings_column;
  auto transformer_itr =
    thrust::make_transform_iterator(thrust::make_counting_iterator<size_type>(0),
                                    [d_column, null_handling] __device__(size_type idx) {
                                      if (d_column.is_null(idx))
                                        return null_handling == null_policy::INCLUDE;
                                      return string::is_float(d_column.element<string_view>(idx));
                                    });
  return thrust::all_of(rmm::exec_policy(stream)->on(stream),
                        transformer_itr,
                        transformer_itr + strings.size(),
                        thrust::identity<bool>());
}

@davidwendt
Copy link
Contributor

davidwendt commented May 8, 2020

My opinion is we should optimize for the expected case of valid data. We should not optimize for expecting bad data since the data can usually not be processed any further (i.e. converting to integers or floats will fail). The thrust::all_of will only be faster on bad data.

My recommendation is to just use the is_integer() - all() pattern or change the all_integer() and all_float() to use thrust::transform_reduce instead of thrust::all_of.
Both of these appear faster on valid data from the chart above.

@kkraus14
Copy link
Collaborator

kkraus14 commented May 8, 2020

For the sake of reducing API footprint I would suggest we continue with is_integer() + all().

@galipremsagar
Copy link
Contributor Author

Thanks everyone for the inputs! Closing this issue as we'll be using is_integer() + all() in #5054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants