[C++][Parquet] Utilize revamped common hashing machinery for dictionary encoding #42863

asfimport · 2018-11-25T00:33:04Z

@pitrou has recently made some significant improvements to hashing / dictionary encoding machinery in Apache Arrow

eaf8d32

parquet-cpp is using a custom hash table

https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding-internal.h#L456

It would be nice to utilize common hash table machinery if possible. We should of course make sure that such a change does not cause performance regressions (performance improved due to Antoine's patch, so perf may also get better on the Parquet write path)

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

GitHub Pull Request #3036

_{Note: This issue was originally created as PARQUET-1463. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2018-11-28T15:42:11Z

Wes McKinney / @wesm:
Issue resolved by pull request 3036
#3036

asfimport · 2018-11-30T00:37:22Z

Wes McKinney / @wesm:
Using the following setup

import pandas as pd
import pyarrow as pa
import numpy as np

K = 1000
N = 50000000
df = pd.DataFrame({'ints': np.tile(np.arange(K), N // K)})
table = pa.Table.from_pandas(df)

Here all values will end up represented in the dictionary as literal runs. I didn't find any appreciable difference in performance

# BEFORE

# In [12]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=True)
# CPU times: user 1.5 s, sys: 132 ms, total: 1.64 s
# Wall time: 1.63 s

# In [13]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=False)
# CPU times: user 1.28 s, sys: 148 ms, total: 1.43 s
# Wall time: 1.43 s

# AFTER

# In [4]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=True)
# CPU times: user 1.56 s, sys: 120 ms, total: 1.68 s
# Wall time: 1.69 s

# In [5]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=True)
# CPU times: user 1.5 s, sys: 116 ms, total: 1.62 s
# Wall time: 1.61 s

# In [6]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=True)
# CPU times: user 1.5 s, sys: 108 ms, total: 1.61 s
# Wall time: 1.61 s

# In [8]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=False)
# CPU times: user 1.31 s, sys: 120 ms, total: 1.43 s
# Wall time: 1.44 s

# In [9]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=False)
# CPU times: user 1.29 s, sys: 116 ms, total: 1.41 s
# Wall time: 1.41 s

# In [10]: %time pq.write_table(table, pa.BufferOutputStream(), use_dictionary=False)
# CPU times: user 1.32 s, sys: 112 ms, total: 1.44 s
# Wall time: 1.44 s

I was mainly interested if the inner working of the hash table caused any overhead in this case. I would guess string dictionary writes are a bit faster now with the better hashing path.

asfimport closed this as completed Nov 28, 2018

asfimport assigned pitrou Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Utilize revamped common hashing machinery for dictionary encoding #42863

[C++][Parquet] Utilize revamped common hashing machinery for dictionary encoding #42863

asfimport commented Nov 25, 2018 •

edited

Loading

asfimport commented Nov 28, 2018

asfimport commented Nov 30, 2018

[C++][Parquet] Utilize revamped common hashing machinery for dictionary encoding #42863

[C++][Parquet] Utilize revamped common hashing machinery for dictionary encoding #42863

Comments

asfimport commented Nov 25, 2018 • edited Loading

PRs and other links:

asfimport commented Nov 28, 2018

asfimport commented Nov 30, 2018

asfimport commented Nov 25, 2018 •

edited

Loading