-
Notifications
You must be signed in to change notification settings - Fork 33
Test integration with dask.dataframe #107
Comments
@TomAugspurger is this possible today? |
For context, as a side project today I'm looking at text handling in dask dataframe. It seems to be a common concern in benchmarks, particularly due to memory-blowup. |
Yes, since Thursday this is working on master: #147 |
@mrocklin This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays 😃 Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39 What is missing from the fletcher<->dask support is the The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first. |
Ah, that's a nice blog. I should probably check it out more often :)
Mostly right now I'm just exploring this space. Short term I would be
curious how we convert existing columns of text data in a dask dataframe to
use Fletcher. The examples today all seem to start from a Pandas
dataframe, which is an atypical starting point in the real world. Given a
dask series of object dtype, how does one make a dask series backed by
fletcher arrays?
…On Sun, Jun 28, 2020 at 11:22 PM Uwe L. Korn ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> This is already working since a
year, see the dask blog
https://blog.dask.org/2019/01/22/dask-extension-arrays 😃
Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39
<ContinuumIO/cyberpandas#39>
What is missing from the fletcher<->dask support is the fr_text accessor.
If you want to play with it, I can quickly implement it, otherwise I would
take a stab at that once I tackled #115
<#115>.
The project here isn't yet fully functional but shows what is there in
Arrow & pandas to support native string arrays. It was dormant for ~6
months as other things had a higher priority but we're now continuing in
Arrow to build string kernels and will ship hopefully a lot of them in 1.0
/ 1.1 in the next 2-3 months, making this setup here usable. If you have
specific functionality you're looking for, just give us a heads-up and we
can implement them first.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#107 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDFFZH4OE4XN6F32WLRZAXLHANCNFSM4K2F3OTA>
.
|
Ah, I'm seeing
https://gist.github.com/xhochy/edb7d01364db87ed2e44ac828472e663 now
…On Mon, Jun 29, 2020 at 6:43 AM Matthew Rocklin ***@***.***> wrote:
Ah, that's a nice blog. I should probably check it out more often :)
Mostly right now I'm just exploring this space. Short term I would be
curious how we convert existing columns of text data in a dask dataframe to
use Fletcher. The examples today all seem to start from a Pandas
dataframe, which is an atypical starting point in the real world. Given a
dask series of object dtype, how does one make a dask series backed by
fletcher arrays?
On Sun, Jun 28, 2020 at 11:22 PM Uwe L. Korn ***@***.***>
wrote:
> @mrocklin <https://github.com/mrocklin> This is already working since a
> year, see the dask blog
> https://blog.dask.org/2019/01/22/dask-extension-arrays 😃
>
> Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39
> <ContinuumIO/cyberpandas#39>
>
> What is missing from the fletcher<->dask support is the fr_text
> accessor. If you want to play with it, I can quickly implement it,
> otherwise I would take a stab at that once I tackled #115
> <#115>.
>
> The project here isn't yet fully functional but shows what is there in
> Arrow & pandas to support native string arrays. It was dormant for ~6
> months as other things had a higher priority but we're now continuing in
> Arrow to build string kernels and will ship hopefully a lot of them in 1.0
> / 1.1 in the next 2-3 months, making this setup here usable. If you have
> specific functionality you're looking for, just give us a heads-up and we
> can implement them first.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#107 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTDFFZH4OE4XN6F32WLRZAXLHANCNFSM4K2F3OTA>
> .
>
|
If you want to create such a column from a Parquet file without going through the
|
dask.dataframe
should also be able to handlefletcher
columns and accessors. Thus we should have at least tests that confirm:dask.dataframe
can havefletcher.Fletcher{Chunked,Continuous}Array
columnsfr_text
accessor is working withdask.dataframe
The text was updated successfully, but these errors were encountered: