Speed up reading Datetime64 in numpy mode. #365
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently a lot of time is spent in converting DateTime64 columns into the
datetime64[ns]
dtype, spent in copies, multiplications, divisions, and conversions. This change speeds up conversion by about 3x.My apologies: I don't see where I could include a test for this sort of change (performance-only), but I would be happy to take guidance and modify the PR.
Benchmarking
I'm running
amd64
Linux, with the latest Clickhouse "quick installation" (version 23.3.1.2286) running off a RAMdisk. I checked these benchmarks in Python 3.11.2 and 3.8.16, they were similar. The numpy version is 1.24.2.I set up a table containing 10 million
Datetime64(3, 'UTC')
timestamps:and a short benchmarking script:
The current
master
branch takes about0.25
seconds per iteration, while my change takes about0.07
, about 30% the time.Correctness$t$ seconds as $t \cdot 10^s$ , where $0 \leq s \leq 9$ is the scale parameter. Numpy's $t \cdot 10^9$ , and so to convert from Clickhouse to Numpy we need to multiply by $10^{9 - s}$ , which is always an integer since $s \leq 9$ .
Clickhouse's
DateTime64(s)
field returns a timestamp ofdatetime64[ns]
dtype represents the same time asThere is one other subtlety, which is that Clickhouse can return negative numbers for timestamps (i.e. the receiving datatype for the column should be
int64
rather thanuint64
). The existing code works fine in this case, since Numpy'sdatetime64[ns]
is a signed type underneath. I've added a test for it anyway, and changed the receiving datatype for the column to better reflect what is being sent across from CH.Checklist
flake8
and fix issues.pytest
no tests failed. See https://clickhouse-driver.readthedocs.io/en/latest/development.html.