Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up reading Datetime64 in numpy mode. #365

Merged
merged 1 commit into from
Mar 26, 2023

Conversation

joelgibson
Copy link
Contributor

@joelgibson joelgibson commented Mar 26, 2023

Currently a lot of time is spent in converting DateTime64 columns into the datetime64[ns] dtype, spent in copies, multiplications, divisions, and conversions. This change speeds up conversion by about 3x.

My apologies: I don't see where I could include a test for this sort of change (performance-only), but I would be happy to take guidance and modify the PR.

Benchmarking
I'm running amd64 Linux, with the latest Clickhouse "quick installation" (version 23.3.1.2286) running off a RAMdisk. I checked these benchmarks in Python 3.11.2 and 3.8.16, they were similar. The numpy version is 1.24.2.

I set up a table containing 10 million Datetime64(3, 'UTC') timestamps:

CREATE TABLE times (time Datetime64(3, 'UTC'))
ENGINE = MergeTree
PRIMARY KEY (time)

INSERT INTO times
SELECT timestamp_add(now(), INTERVAL n.number SECOND) AS time
FROM numbers(10000000) n

OPTIMIZE TABLE times FINAL

and a short benchmarking script:

import timeit
from clickhouse_driver import Client

REPS = 10

client = Client(host='localhost', settings=dict(use_numpy=True))
time = timeit.timeit(lambda: client.query_dataframe('SELECT * FROM times'), number=REPS)
print(f"{time/REPS:.3f} seconds per iteration ({REPS} iterations)")
client.disconnect()

The current master branch takes about 0.25 seconds per iteration, while my change takes about 0.07, about 30% the time.

Correctness
Clickhouse's DateTime64(s) field returns a timestamp of $t$ seconds as $t \cdot 10^s$, where $0 \leq s \leq 9$ is the scale parameter. Numpy's datetime64[ns] dtype represents the same time as $t \cdot 10^9$, and so to convert from Clickhouse to Numpy we need to multiply by $10^{9 - s}$, which is always an integer since $s \leq 9$.

There is one other subtlety, which is that Clickhouse can return negative numbers for timestamps (i.e. the receiving datatype for the column should be int64 rather than uint64). The existing code works fine in this case, since Numpy's datetime64[ns] is a signed type underneath. I've added a test for it anyway, and changed the receiving datatype for the column to better reflect what is being sent across from CH.

Checklist

  • Add tests that demonstrate the correct behavior of the change. Tests should fail without the change.
  • Add or update relevant docs, in the docs folder and in code.
  • Ensure PR doesn't contain untouched code reformatting: spaces, etc.
  • Run flake8 and fix issues.
  • Run pytest no tests failed. See https://clickhouse-driver.readthedocs.io/en/latest/development.html.

@coveralls
Copy link

Coverage Status

Coverage: 96.286% (-0.002%) from 96.288% when pulling 5849ce5 on joelgibson:datetime64-speedup into 2942ae3 on mymarilyn:master.

@xzkostyan
Copy link
Member

There are no performance tests now in the project.

@xzkostyan xzkostyan merged commit 94f2585 into mymarilyn:master Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants