PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036

pitrou · 2018-11-26T16:41:57Z

No description provided.

pitrou · 2018-11-26T17:09:53Z

A crude micro-benchmark doesn't show any significant performance difference. I don't know if we have more significant benchmarks.

…ncoding

wesm · 2018-11-27T13:49:04Z

cc @xhochy @majetideepak

@pitrou thanks so much for doing this. I will review the code and also run some before/after write perf benchmarks to kick the tires

xhochy

This is LGTM. In the absence of benchmarks I often use http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml as the basis for real world data in performance testing.

pitrou · 2018-11-27T18:17:19Z

Ok, I tested on a subset of taxi data. It's a bit faster with this PR. Actually, dictionary encoding does not seem to slow things down (anymore?):

>>> %time table = csv.read_csv("yellow.csv")                                                                                                                      
CPU times: user 1.11 s, sys: 309 ms, total: 1.41 s
Wall time: 105 ms
>>> table                                                                                                                                                         
pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[s]
tpep_dropoff_datetime: timestamp[s]
passenger_count: int64
trip_distance: double
RatecodeID: int64
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
>>> table.num_rows                                                                                                                                                
999998
>>> %time pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True)                                                             
CPU times: user 337 ms, sys: 3 ms, total: 340 ms
Wall time: 339 ms
>>> %time pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False)                                                            
CPU times: user 334 ms, sys: 11.4 ms, total: 345 ms
Wall time: 345 ms

(note how reading the data from CSV is faster in wall-clock time than saving it to Parquet ;-))

pitrou · 2018-11-27T18:23:49Z

Oops, I should have disabled compression. Updated results:

before:

>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False, compression=None)                                        
203 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True, compression=None)                                         
338 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after:

>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False, compression=None)                                        
204 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True, compression=None)                                         
297 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it looks like a 10% improvement when dictionary encoding is enabled.

majetideepak · 2018-11-27T18:26:12Z

+1 LGTM. This is a nice improvement!

xhochy · 2018-11-27T18:32:44Z

Thanks @pitrou for the optimization. The Parquet write path is something we have never looked into performance optimization as it was already quite fast in its basic form. So there's probably still some low hanging fruit.

pitrou · 2018-11-28T15:28:20Z

Will merge soon if nobody objects.

wesm

+1. I'll run some ad hoc perf tests when I can out of my own curiosity and post results here

pitrou force-pushed the PARQUET-1463-hashing-refactor branch from 84a88f3 to f909f0a Compare November 26, 2018 16:42

PARQUET-1463: [C++] Utilize common hashing machinery for dictionary e…

3c12c88

…ncoding

pitrou force-pushed the PARQUET-1463-hashing-refactor branch from f909f0a to 3c12c88 Compare November 26, 2018 17:10

pitrou mentioned this pull request Nov 26, 2018

ARROW-3844: [C++] Remove ARROW_USE_SSE and ARROW_SSE3 #3037

Closed

xhochy reviewed Nov 27, 2018

View reviewed changes

wesm approved these changes Nov 28, 2018

View reviewed changes

wesm closed this in 99b3b0a Nov 28, 2018

pitrou deleted the PARQUET-1463-hashing-refactor branch November 28, 2018 16:04

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Utilize revamped common hashing machinery for dictionary encoding #42863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036

PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036

pitrou commented Nov 26, 2018

pitrou commented Nov 26, 2018

wesm commented Nov 27, 2018

xhochy left a comment

pitrou commented Nov 27, 2018

pitrou commented Nov 27, 2018

majetideepak commented Nov 27, 2018

xhochy commented Nov 27, 2018

pitrou commented Nov 28, 2018

wesm left a comment

PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036

PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036

Conversation

pitrou commented Nov 26, 2018

pitrou commented Nov 26, 2018

wesm commented Nov 27, 2018

xhochy left a comment

Choose a reason for hiding this comment

pitrou commented Nov 27, 2018

pitrou commented Nov 27, 2018

majetideepak commented Nov 27, 2018

xhochy commented Nov 27, 2018

pitrou commented Nov 28, 2018

wesm left a comment

Choose a reason for hiding this comment