[FEA] Improve performance of loading long strings from Parquet #7545
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
strings
strings issues (C++ and Python)
Is your feature request related to a problem? Please describe.
Loading a Parquet dataset containing relatively long strings per row (e.g.: 500+ characters per row) takes quite a bit of time due to the time spent in
make_strings_column
as shown in this Nsight Systems trace:It looks like
make_strings_column
may be using a row-level parallelism algorithm which will not perform well when there are a large number of characters per row.Describe the solution you'd like
Ideally Parquet string decoding for long strings should be fast, whether that be via optimizing
make_strings_column
or using a different approach to string decoding altogether. Updatingmake_strings_column
to use a char-parallel algorithm may be appropriate.The text was updated successfully, but these errors were encountered: