-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1463: [C++] Utilize common hashing machinery for dictionary encoding #3036
Conversation
84a88f3
to
f909f0a
Compare
A crude micro-benchmark doesn't show any significant performance difference. I don't know if we have more significant benchmarks. |
f909f0a
to
3c12c88
Compare
@pitrou thanks so much for doing this. I will review the code and also run some before/after write perf benchmarks to kick the tires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is LGTM. In the absence of benchmarks I often use http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml as the basis for real world data in performance testing.
Ok, I tested on a subset of taxi data. It's a bit faster with this PR. Actually, dictionary encoding does not seem to slow things down (anymore?): >>> %time table = csv.read_csv("yellow.csv")
CPU times: user 1.11 s, sys: 309 ms, total: 1.41 s
Wall time: 105 ms
>>> table
pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[s]
tpep_dropoff_datetime: timestamp[s]
passenger_count: int64
trip_distance: double
RatecodeID: int64
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
>>> table.num_rows
999998
>>> %time pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True)
CPU times: user 337 ms, sys: 3 ms, total: 340 ms
Wall time: 339 ms
>>> %time pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False)
CPU times: user 334 ms, sys: 11.4 ms, total: 345 ms
Wall time: 345 ms (note how reading the data from CSV is faster in wall-clock time than saving it to Parquet ;-)) |
Oops, I should have disabled compression. Updated results:
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False, compression=None)
203 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True, compression=None)
338 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=False, compression=None)
204 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pq.write_table(table, pa.BufferOutputStream(), coerce_timestamps='ms', use_dictionary=True, compression=None)
297 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) So it looks like a 10% improvement when dictionary encoding is enabled. |
+1 LGTM. This is a nice improvement! |
Thanks @pitrou for the optimization. The Parquet write path is something we have never looked into performance optimization as it was already quite fast in its basic form. So there's probably still some low hanging fruit. |
Will merge soon if nobody objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'll run some ad hoc perf tests when I can out of my own curiosity and post results here
No description provided.