Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writer dictionary encoding refactor #8476

Merged
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
32dce41
Add make_zero_device_uvector_async
harrism Mar 30, 2021
d922868
device_vector->uvector in gather
harrism Mar 30, 2021
10c8b38
zero->zeroed
harrism Mar 31, 2021
404ddc4
uvector in dictionary
devavret Apr 2, 2021
dff6e5d
uvector in stats
devavret Apr 2, 2021
48618ce
Merge remote-tracking branch 'harrism/fea-gather-uvector' into parque…
devavret Apr 2, 2021
361878e
uvector in all remaining places in parquet writer
devavret Apr 5, 2021
edcd6d7
uvector in read parquet
devavret Apr 5, 2021
a838118
Use hostdevice_2dvector and spans for fragments
devavret Apr 5, 2021
7b446ab
fragment update in stats
devavret Apr 6, 2021
5685584
spans in frag stats
devavret Apr 6, 2021
a471bdb
2dvector for chunks part 1
devavret Apr 7, 2021
aa43df2
2d span for chunks
devavret Apr 7, 2021
f1639c4
flat span for chunk part 1
devavret Apr 8, 2021
34952e0
chunks flat span part 2
devavret Apr 8, 2021
bf3bdd3
span for pages part 1
devavret Apr 9, 2021
bb169bb
Merge branch 'branch-0.20' into parquet-writer-spans
devavret Apr 9, 2021
4a02435
spans for columndesc
devavret Apr 9, 2021
89cc637
chunk span for dictionary
devavret Apr 12, 2021
0ca82ea
spans for compression structs
devavret Apr 12, 2021
ab656d8
Clean up function arguments
devavret Apr 13, 2021
08ab80b
small cleanup in span for chunk fragment
devavret Apr 13, 2021
421e909
Merge branch 'branch-0.20' into parquet-writer-spans
devavret Apr 14, 2021
920ba7b
Merge branch 'branch-0.20' into parquet-writer-spans
devavret Apr 17, 2021
3c050bb
Merge branch 'branch-0.20' into parquet-writer-spans
devavret Apr 19, 2021
765b166
review fix
devavret Apr 19, 2021
9a2789f
Working insertion for 100 unique values
devavret Apr 27, 2021
37e8952
Misc review fixes
devavret Apr 29, 2021
a02c749
Merge branch 'branch-0.20' into parquet-writer-spans
devavret Apr 29, 2021
b9f49c4
Put a pointer in the page struct to its corresponding comp_stat
devavret Apr 29, 2021
90d3c7f
Clean up unnecessary arguments for comp_status
devavret Apr 29, 2021
b61dc48
Give chunk struct a pointer to its pages
devavret Apr 29, 2021
36ed3ad
Remove start_page from last remaining kernel
devavret Apr 30, 2021
a65eef7
Merge branch 'parquet-writer-spans' into parquet-writer-dict-refactor
devavret Apr 30, 2021
3dc3d44
Merge branch 'branch-0.20' into parquet-writer-dict-refactor
devavret May 3, 2021
638b554
Fixed the issue with hash and initializer.
devavret May 5, 2021
ac2173e
Merge branch 'branch-0.20' into parquet-writer-dict-refactor
devavret May 14, 2021
f8febb3
tested large num uniq
devavret May 18, 2021
ad64143
Pull from cuco
devavret May 19, 2021
b223cc4
Add dict size counting and decision making
devavret May 19, 2021
17acb35
Add dictionary compaction
devavret May 20, 2021
3d3ea90
Get dictionary indices. Slow but hopefully working
devavret May 21, 2021
8376ae4
Merge branch 'branch-21.06' into parquet-writer-dict-refactor
devavret May 21, 2021
214b756
change get indices launch to 1blk/5000 rows from 1blk/ck
devavret May 24, 2021
6d97224
tuned block sizes for dict insert and find kernels
devavret May 24, 2021
0152f88
Plug new dict into encoder. works for int8 test
devavret Jun 1, 2021
b04fb7b
Disable dict for bool cols
devavret Jun 2, 2021
e3093d1
Fix bug where num_dict_entries is 0 so nbits is 32
devavret Jun 3, 2021
5bc604e
Fix dict_index writing.
devavret Jun 3, 2021
1939ce5
Merge branch 'branch-21.08' into parquet-writer-dict-refactor
devavret Jun 7, 2021
bda722f
Complete replacing old dict code with new
devavret Jun 8, 2021
026ed8c
Clenup dict_data and dict_index
devavret Jun 8, 2021
23d4346
Don't launch dict kernels for 0 chunks
devavret Jun 9, 2021
d8a701f
dict_rle_bits_plus1 -> dict_rle_bits
devavret Jun 9, 2021
7fbd26b
Misc cleanups
devavret Jun 9, 2021
0d2cb6f
dict code cleanups
devavret Jun 9, 2021
a62f7f3
Remove old dict code from initFrags
devavret Jun 9, 2021
2e871ba
Completely remove old dict code
devavret Jun 9, 2021
cce3b8b
Documentation
devavret Jun 10, 2021
67997b5
Replace copied cuco headers with cpm included repo
devavret Jun 10, 2021
f4afda4
Revert changes to benchmark and test code
devavret Jun 10, 2021
9e8d666
Testing CI for deadlock
devavret Jun 10, 2021
797ba36
Merge branch 'branch-21.10' into parquet-writer-dict-refactor
devavret Aug 5, 2021
ff8b885
Build breakage caused by merge
devavret Aug 5, 2021
bb07ded
Confirming that newest version fixes CI for pascal
devavret Aug 6, 2021
15d15b4
Add missing syncthreads
devavret Aug 9, 2021
b401a5f
Fix for #8890
devavret Aug 9, 2021
533ccab
Review cleanups
devavret Aug 9, 2021
cd375a0
Fix a potential bug in dictionary creation where loop exits before sy…
devavret Aug 10, 2021
c717e72
Cmake review changes
devavret Aug 11, 2021
520cb84
More cmake review fixes
devavret Aug 11, 2021
8b74b96
More cmake fix
devavret Aug 11, 2021
487ffd3
no more camelCase
devavret Aug 12, 2021
09d02b5
Review fixes
devavret Aug 17, 2021
1f22996
MAX_DICT_SIZE was off by one
devavret Aug 18, 2021
81be63f
Update cpp/src/io/parquet/chunk_dict.cu
devavret Aug 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ include(cmake/thirdparty/CUDF_GetArrow.cmake)
include(cmake/thirdparty/CUDF_GetDLPack.cmake)
# find libcu++
include(cmake/thirdparty/CUDF_GetLibcudacxx.cmake)
# find cuCollections
include(cmake/thirdparty/CUDF_GetcuCollections.cmake)
# find or install GoogleTest
include(cmake/thirdparty/CUDF_GetGTest.cmake)
# preprocess jitify-able kernels
Expand Down Expand Up @@ -245,7 +247,7 @@ add_library(cudf
src/io/orc/writer_impl.cu
src/io/parquet/compact_protocol_writer.cpp
src/io/parquet/page_data.cu
src/io/parquet/page_dict.cu
src/io/parquet/chunk_dict.cu
src/io/parquet/page_enc.cu
src/io/parquet/page_hdr.cu
src/io/parquet/parquet.cpp
Expand Down Expand Up @@ -438,6 +440,7 @@ target_compile_definitions(cudf PRIVATE "JITIFY_PRINT_LOG=0")
target_include_directories(cudf
PUBLIC "$<BUILD_INTERFACE:${DLPACK_INCLUDE_DIR}>"
"$<BUILD_INTERFACE:${JITIFY_INCLUDE_DIR}>"
"$<BUILD_INTERFACE:${CUCO_INCLUDE_DIR}>"
devavret marked this conversation as resolved.
Show resolved Hide resolved
"$<BUILD_INTERFACE:${LIBCUDACXX_INCLUDE_DIR}>"
"$<BUILD_INTERFACE:${CUDF_SOURCE_DIR}/include>"
"$<BUILD_INTERFACE:${CUDF_GENERATED_INCLUDE_DIR}/include>"
Expand Down
38 changes: 38 additions & 0 deletions cpp/cmake/thirdparty/CUDF_GetcuCollections.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#=============================================================================
# Copyright (c) 2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

function(find_and_configure_cucollections)

if(TARGET cuCollections::cuco)
robertmaynard marked this conversation as resolved.
Show resolved Hide resolved
return()
endif()

# Find or install cuCollections
CPMFindPackage(NAME cuco
robertmaynard marked this conversation as resolved.
Show resolved Hide resolved
GITHUB_REPOSITORY NVIDIA/cuCollections
GIT_TAG dev
devavret marked this conversation as resolved.
Show resolved Hide resolved
OPTIONS "BUILD_TESTS OFF"
"BUILD_BENCHMARKS OFF"
"BUILD_EXAMPLES OFF"
)

set(CUCO_INCLUDE_DIR "${cuco_SOURCE_DIR}/include" PARENT_SCOPE)

# Make sure consumers of cudf can also see cuCollections::cuco target
fix_cmake_global_defaults(cuCollections::cuco)
endfunction()

find_and_configure_cucollections()
Loading