Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas interop #9

Closed
kszucs opened this issue Sep 19, 2017 · 10 comments
Closed

Pandas interop #9

kszucs opened this issue Sep 19, 2017 · 10 comments

Comments

@kszucs
Copy link
Contributor

kszucs commented Sep 19, 2017

It might worth considering to support (optionally) pandas dataframes as inputs and outputs in clickhouse-driver.
Creating pandas dataframe would be quite straightforward from the block stream before transposing to row oriented block.
Here is a basic pandas dtype -> clickhouse type mapping: https://github.com/kszucs/ibis/blob/0101250a9d96f6a387129fcd770f3e092856dc56/ibis/clickhouse/types.py#L14

@xzkostyan would You like to include pandas support?

@xzkostyan
Copy link
Member

Hi, @kszucs!

Integration with different data sources is a good idea. But it should be in wrapper packages, not in the core package. There are many sources to integrate with. Interfaces for communication can change really fast. Unfortunately I'm not familiar with all data sources and I can't stay in touch with these changes.

You can write you wrapper and place it on pypi. Feel free to ask if you need any information about integration. I'll try to help.

Note: latest release contains version of driver:

from clickhouse_driver import VERSION

It might help with flawless integration.

@kszucs
Copy link
Contributor Author

kszucs commented Oct 11, 2017

Hi @xzkostyan

Actually I try to implement a columnar version of QueryResult, but there are a couple of inconsistencies in the Block implementation. AFAIK when receiving the block.data stores column-wise data whereas when sending it contains row-wise data.

I'm kinda blocked because the tests are running really slowly (I don't know why), so instead I share my findings: master...kszucs:columnar_block

@xzkostyan
Copy link
Member

Hi, @kszucs.

Yup, you are right. There is some inconsistencies in Block.data storing:

  • When you emit INSERT query, data for insert is stored in block in row-wise way.
  • On SELECT statement data received from CH is stored in Block.data in column-wise way.

That's why get_rows method is used to transpose received column-wise data to row-wise. This behavior should be split into different blocks later.

You can check this branch: https://github.com/mymarilyn/clickhouse-driver/tree/feature-deferred-rows-length-validation. There are some speed optimizations on SELECT.

If you want to do some research on performance you can use following profiling snippets:

from clickhouse_driver import Client
c = Client('localhost')
%prun c.execute('SELECT * FROM large_table')
from clickhouse_driver import Client
c = Client('localhost')
%prun c.execute('INSERT INTO test (a, b, c) VALUES', [(x, x, x) for x in range(N)])

@xzkostyan
Copy link
Member

If you need only to implement columnar version of QueryResult you can implement get_columns method that will pick raw block.data. After if you can iteratively .extend() this data in result.

That's it.

@kszucs
Copy link
Contributor Author

kszucs commented Oct 11, 2017

I've created a PR according to your comment.

@kszucs
Copy link
Contributor Author

kszucs commented Oct 20, 2017

@xzkostyan would You mint to draft a new release? I'd like to use here the columnar result extending fix.

@xzkostyan
Copy link
Member

Sure! I'll make new release on Saturday or Sunday.

@kszucs
Copy link
Contributor Author

kszucs commented Oct 22, 2017

Great! Thanks Kostya!

@xzkostyan
Copy link
Member

Hi, @kszucs!

0.0.8 version is released.

@kszucs
Copy link
Contributor Author

kszucs commented Jan 19, 2018

Eventually pandas interop will be released in ibis, so I'm closing this ticket now. Thanks!

@kszucs kszucs closed this as completed Jan 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants