[FEA] Improve performance of loading long strings from Parquet #7545

jlowe · 2021-03-10T02:59:20Z

Is your feature request related to a problem? Please describe.
Loading a Parquet dataset containing relatively long strings per row (e.g.: 500+ characters per row) takes quite a bit of time due to the time spent in make_strings_column as shown in this Nsight Systems trace:

It looks like make_strings_column may be using a row-level parallelism algorithm which will not perform well when there are a large number of characters per row.

Describe the solution you'd like
Ideally Parquet string decoding for long strings should be fast, whether that be via optimizing make_strings_column or using a different approach to string decoding altogether. Updating make_strings_column to use a char-parallel algorithm may be appropriate.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-03-10T03:59:11Z

I believe this is the code that is suffering on long strings:

cudf/cpp/include/cudf/strings/detail/strings_column_factories.cuh

Lines 78 to 87 in e5d0ec9

    
           auto copy_chars = [d_chars] __device__(auto item) { 
        
             string_index_pair str = thrust::get<0>(item); 
        
             size_type offset      = thrust::get<1>(item); 
        
             if (str.first != nullptr) memcpy(d_chars + offset, str.first, str.second); 
        
           }; 
        
           thrust::for_each_n(rmm::exec_policy(stream), 
        
                              thrust::make_zip_iterator( 
        
                                thrust::make_tuple(begin, offsets_column->view().template begin<int32_t>())), 
        
                              strings_count, 
        
                              copy_chars);

This could use the same treatment as your optimization to gather. In fact, I wonder if there's a way to cast this factory as a gather in order to take of advantage of the optimization that is already there.

davidwendt · 2021-03-23T16:01:51Z

This seems similar if not the same as #7571
The make_strings_column was improved for long strings in #7576
Can this be closed?

jlowe · 2021-03-23T16:20:09Z

Yes, this is much improved.

jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Mar 10, 2021

jrhemstad removed the Needs Triage Need team to review and classify label Mar 10, 2021

jrhemstad added the Performance Performance related issue label Mar 10, 2021

chenrui17 mentioned this issue Mar 11, 2021

[QST]How to improve performance of read_parquet and url_decode further ? #7571

Closed

jlowe closed this as completed Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve performance of loading long strings from Parquet #7545

[FEA] Improve performance of loading long strings from Parquet #7545

jlowe commented Mar 10, 2021

jrhemstad commented Mar 10, 2021

davidwendt commented Mar 23, 2021

jlowe commented Mar 23, 2021

[FEA] Improve performance of loading long strings from Parquet #7545

[FEA] Improve performance of loading long strings from Parquet #7545

Comments

jlowe commented Mar 10, 2021

jrhemstad commented Mar 10, 2021

davidwendt commented Mar 23, 2021

jlowe commented Mar 23, 2021