Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Test integration with dask.dataframe #107

Closed
xhochy opened this issue Feb 24, 2020 · 7 comments
Closed

Test integration with dask.dataframe #107

xhochy opened this issue Feb 24, 2020 · 7 comments
Labels
good first issue Good for newcomers hackathon-2020-03 integration Integration with other tools (Arrow, Dask, ..)

Comments

@xhochy
Copy link
Owner

xhochy commented Feb 24, 2020

dask.dataframe should also be able to handle fletcher columns and accessors. Thus we should have at least tests that confirm:

  • dask.dataframe can have fletcher.Fletcher{Chunked,Continuous}Array columns
  • The fr_text accessor is working with dask.dataframe
@xhochy xhochy added good first issue Good for newcomers hackathon-2020-03 integration Integration with other tools (Arrow, Dask, ..) labels Feb 24, 2020
@mrocklin
Copy link

@TomAugspurger is this possible today?

@mrocklin
Copy link

For context, as a side project today I'm looking at text handling in dask dataframe. It seems to be a common concern in benchmarks, particularly due to memory-blowup.

@xhochy
Copy link
Owner Author

xhochy commented Jun 28, 2020

Yes, since Thursday this is working on master: #147

@xhochy xhochy closed this as completed Jun 28, 2020
@xhochy
Copy link
Owner Author

xhochy commented Jun 29, 2020

@mrocklin This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays 😃

Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39

What is missing from the fletcher<->dask support is the fr_text accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115.

The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first.

@mrocklin
Copy link

mrocklin commented Jun 29, 2020 via email

@mrocklin
Copy link

mrocklin commented Jun 29, 2020 via email

@xhochy
Copy link
Owner Author

xhochy commented Jun 29, 2020

If you want to create such a column from a Parquet file without going through the object, check out the types_mapper argument of pyarrow.Tables.to_pandas. This also works for other ExtensionArray, not only fletcher. This can also save quite some overhead / GIL contentation.

import fletcher as fr
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'str': ['a', 'b', 'c']})
df.to_parquet("test.parquet")
table = pq.read_table("test.parquet")

table.to_pandas().info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   str     3 non-null      object
# dtypes: object(1)
# memory usage: 152.0+ bytes

table.to_pandas(types_mapper={pa.string(): fr.FletcherChunkedDtype(pa.string())}.get).info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype                   
# ---  ------  --------------  -----                   
#  0   str     3 non-null      fletcher_chunked[string]
# dtypes: fletcher_chunked[string](1)
# memory usage: 147.0 bytes

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers hackathon-2020-03 integration Integration with other tools (Arrow, Dask, ..)
Projects
None yet
Development

No branches or pull requests

2 participants