Test integration with dask.dataframe #107

xhochy · 2020-02-24T10:57:57Z

dask.dataframe should also be able to handle fletcher columns and accessors. Thus we should have at least tests that confirm:

dask.dataframe can have fletcher.Fletcher{Chunked,Continuous}Array columns
The fr_text accessor is working with dask.dataframe

The text was updated successfully, but these errors were encountered:

mrocklin · 2020-06-28T17:01:36Z

@TomAugspurger is this possible today?

mrocklin · 2020-06-28T17:02:22Z

For context, as a side project today I'm looking at text handling in dask dataframe. It seems to be a common concern in benchmarks, particularly due to memory-blowup.

xhochy · 2020-06-28T18:06:47Z

Yes, since Thursday this is working on master: #147

xhochy · 2020-06-29T06:22:31Z

@mrocklin This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays 😃

Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39

What is missing from the fletcher<->dask support is the fr_text accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115.

The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first.

mrocklin · 2020-06-29T13:44:03Z

Ah, that's a nice blog. I should probably check it out more often :) Mostly right now I'm just exploring this space. Short term I would be curious how we convert existing columns of text data in a dask dataframe to use Fletcher. The examples today all seem to start from a Pandas dataframe, which is an atypical starting point in the real world. Given a dask series of object dtype, how does one make a dask series backed by fletcher arrays?

…

On Sun, Jun 28, 2020 at 11:22 PM Uwe L. Korn ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays 😃 Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39 <ContinuumIO/cyberpandas#39> What is missing from the fletcher<->dask support is the fr_text accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115 <#115>. The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#107 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDFFZH4OE4XN6F32WLRZAXLHANCNFSM4K2F3OTA> .

mrocklin · 2020-06-29T13:44:45Z

Ah, I'm seeing https://gist.github.com/xhochy/edb7d01364db87ed2e44ac828472e663 now

…

On Mon, Jun 29, 2020 at 6:43 AM Matthew Rocklin ***@***.***> wrote: Ah, that's a nice blog. I should probably check it out more often :) Mostly right now I'm just exploring this space. Short term I would be curious how we convert existing columns of text data in a dask dataframe to use Fletcher. The examples today all seem to start from a Pandas dataframe, which is an atypical starting point in the real world. Given a dask series of object dtype, how does one make a dask series backed by fletcher arrays? On Sun, Jun 28, 2020 at 11:22 PM Uwe L. Korn ***@***.***> wrote: > @mrocklin <https://github.com/mrocklin> This is already working since a > year, see the dask blog > https://blog.dask.org/2019/01/22/dask-extension-arrays 😃 > > Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39 > <ContinuumIO/cyberpandas#39> > > What is missing from the fletcher<->dask support is the fr_text > accessor. If you want to play with it, I can quickly implement it, > otherwise I would take a stab at that once I tackled #115 > <#115>. > > The project here isn't yet fully functional but shows what is there in > Arrow & pandas to support native string arrays. It was dormant for ~6 > months as other things had a higher priority but we're now continuing in > Arrow to build string kernels and will ship hopefully a lot of them in 1.0 > / 1.1 in the next 2-3 months, making this setup here usable. If you have > specific functionality you're looking for, just give us a heads-up and we > can implement them first. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#107 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTDFFZH4OE4XN6F32WLRZAXLHANCNFSM4K2F3OTA> > . >

xhochy · 2020-06-29T14:00:48Z

If you want to create such a column from a Parquet file without going through the object, check out the types_mapper argument of pyarrow.Tables.to_pandas. This also works for other ExtensionArray, not only fletcher. This can also save quite some overhead / GIL contentation.

import fletcher as fr
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'str': ['a', 'b', 'c']})
df.to_parquet("test.parquet")
table = pq.read_table("test.parquet")

table.to_pandas().info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   str     3 non-null      object
# dtypes: object(1)
# memory usage: 152.0+ bytes

table.to_pandas(types_mapper={pa.string(): fr.FletcherChunkedDtype(pa.string())}.get).info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype                   
# ---  ------  --------------  -----                   
#  0   str     3 non-null      fletcher_chunked[string]
# dtypes: fletcher_chunked[string](1)
# memory usage: 147.0 bytes

xhochy added good first issue Good for newcomers hackathon-2020-03 integration Integration with other tools (Arrow, Dask, ..) labels Feb 24, 2020

xhochy closed this as completed Jun 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test integration with dask.dataframe #107

Test integration with dask.dataframe #107

xhochy commented Feb 24, 2020

mrocklin commented Jun 28, 2020

mrocklin commented Jun 28, 2020

xhochy commented Jun 28, 2020

xhochy commented Jun 29, 2020

mrocklin commented Jun 29, 2020 via email

mrocklin commented Jun 29, 2020 via email

xhochy commented Jun 29, 2020

Test integration with dask.dataframe #107

Test integration with dask.dataframe #107

Comments

xhochy commented Feb 24, 2020

mrocklin commented Jun 28, 2020

mrocklin commented Jun 28, 2020

xhochy commented Jun 28, 2020

xhochy commented Jun 29, 2020

mrocklin commented Jun 29, 2020 via email

mrocklin commented Jun 29, 2020 via email

xhochy commented Jun 29, 2020