-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full support for multiindex in dataframes #1493
Comments
I encourage you to prototype this, perhaps with |
I was originally thinking of doing this as a
Unfortunately, when I try to use them to build a
I don't have much experience with
|
Can you do me a favor and try this from git master? On Mon, Aug 22, 2016 at 8:05 PM, dirkbike [email protected] wrote:
|
That worked, thanks. So, what happened is that all of the 'aapl' data was concatenated to the end of 'msft' data in one large dataframe. However, in this case it would be more desirable to have a top-level index that uses
|
Then perhaps you're right that your dict-of-dataframes idea would suit better |
Just curious, but why can't the |
Eventually yes, it would be nice for DataFrame to support multiindices. It's non-trivial to change all functions within dask.dataframe to support this. I budget this task at somewhere between a week and a month of developer time, though I am often pessimistic in things like this. Have you read through the design documentation of dask.dataframe? http://dask.readthedocs.io/en/latest/dataframe-partitions.html |
So if I'm understanding correctly, it seems that the best way to support multiindex would be to map them to multiple dimensions of partitions since the multiindex itself provides a natural place to create a partition. My example above would only add a second dimension to the partitions (partitions would span time index and then first-level column keys). It would be a lot easier to maintain these partitions outside of dask by managing multiple |
Yes, that seems like a reasonable synopsis. We would choose some depth of the multi-index along with to partition. For example we might partition along the second or third step of the multi-index. Partitions would then hold a list of tuples of values rather than a list of single values. Many of the operations can probably be changed in bulk, by changing some of the heavier functions, like elemwise and reduction, but I would expect groupbys, joins, etc. to take a fair amount of finesse. I don't yet see a way to do this incrementally. |
This may be a bit of a stretch, but maybe it's worth considering more abstract partitioning. I got some inspiration from this paper A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis that breaks data into hierarchical chunks of smaller and smaller index-slices to make data exploration faster. Partitioning the data would be an expensive operation, but can be done as data is collected. You would need at least one hierarchy for the index and possibly one for the columns (column groupings can come from an existing storage hierarchy or can be made dynamically using groupbys). The index hierarchy would define the partitions similar to dask's current structure except using multiple levels (i.e. days are grouped into months, which are grouped into years, etc.). Columns could use a pseudo-index to map to the main index (i.e. a range or years, months, or specific days) to keep the data dense (no filler NaNs) and allow calculations to quickly skip regions with no data. Index and column groupings would be exposed to the end user via indexing and slicing methods and would provide natural partition boundaries for applied computation. A column hierarchy also provides an organized structure for caching intermediate computation results. |
Sounds very cool. I encourage you to explore that further. |
I started a prototype using basic python structures (dicts, and subclasses of lists) and realized that data columns either need to use a sequence of index labels to identify each element (because of the hierarchical index), or the columns can map to a flat representation of the hierarchical index (using a pseudo-index). I couldn't think of another way to do this, and mapping each element individually with labels would be very wasteful. The problem with using a pseudo-index is that when data is appended to the data set, the pseudo-index needs to be recalculated. I'm starting to re-think the use of hierarchies at all. Relational databases can represent hierarchical structures by referencing keys between tables of data, and joining tables on a specific column already aligns tables to each other. Perhaps it's better to treat every chunk of data (of n-columns) as a regular 2D dataframe, and use a relational representation to tie all of the dataframes together. Each dataframe would have its own independent index, avoiding the pseudo-index problem, and only when chunks are joined would the index need to be adjusted. The end user could still reference a specific subset of data using slices and labels, but chunks of data would be dynamically joined (or split if necessary) behind the scenes. I'm going to try and prototype something using sqlite and pandas with some more stock data and see how that might work. |
Any update on this issue? |
I ended up using a regular SQL database to track chunks of data and assembling them as necessary into Pandas DataFrames. |
Does dask support reading multi-level indices yet? I'm particularly interested in reading a table written to parquet with a multi-level column index, and I'm getting the following traceback when I try to do this:
If multi-level indices are generally supported, but not in |
To add my 5 cents: absence of MultiIndex support is the show-stopper for me in terms of doing anything with Dask beyond poking around a bit. It is the most important missing feature. Please, do consider implementing it in some form sooner. |
I definitely agree @vss888 if this is something that you'd like to contribute that would be very welcome! |
This is maybe a simple but not ideal hack with a less that idea resolution. If there were a way to do element-wise concatenation on the two indexes you could create a unique multi-index value (sort of). The issue that I am running into is that i can't figure out how to do element-wise concat on two dask arrays. Any way to do the following?
|
Hi there. I'm trying to get a handle on what all might be involved in supporting development on this. It sounds like a few options were previously explored, but the method discussed by @dirkbike and @mrocklin above is the preferred path forward, although the main blocker to that is the amount of work and inability to implement such a change incrementally. @mrocklin do you have a ballpark on the number of functions in the DataFrame API that would be affected by this? I see that it's a complex issue, but I'd like to at least look into supporting this or breaking it down and finding some people to help chip away at it. |
I don't personally have a ballpark estimate, no. Others might though.
…On Fri, Jun 28, 2019 at 11:33 PM Marissa ***@***.***> wrote:
Hi there. I'm trying to get a handle on what all might be involved in
supporting development on this. It sounds like a few options were
previously explored, but the method discussed by @dirkbike
<https://github.com/dirkbike> and @mrocklin <https://github.com/mrocklin>
above is the preferred path forward, although the main blocker to that is
the amount of work and inability to implement such a change incrementally.
@mrocklin <https://github.com/mrocklin> do you have a ballpark on the
number of functions in the DataFrame API that would be affected by this? I
see that it's a complex issue, but I'd like to at least look into
supporting this or breaking it down and finding some people to help chip
away at it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1493?email_source=notifications&email_token=AACKZTFSSIGDF4PLUZDU3XTP42GVPA5CNFSM4CNELPY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY3J45I#issuecomment-506895989>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTFAJLRMCFUKX5NDOITP42GVPANCNFSM4CNELPYQ>
.
|
@TomAugspurger, do you have any thoughts on this one? |
Still open, still worth doing.
…On Thu, Aug 8, 2019 at 10:31 AM jakirkham ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger>, do you have any
thoughts on this one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1493?email_source=notifications&email_token=AAKAOIU3Y7QFNSAWW2YUW2LQDQ34TA5CNFSM4CNELPY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3374FQ#issuecomment-519568918>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIVFR476BN5MDFKVICTQDQ34TANCNFSM4CNELPYQ>
.
|
In the interim, does anyone have a workaround? I don't actually need the multi-index, but all of the intermediate operations I want to use output a multi-index dataframe ( These two operations produce the same result for my data, but both produce a multi-index as an intermediate step:
My only workaround is to iterate over the unique values of one of the desired indices (in this case, |
That's what I've done to get around this in the past. |
@TheXu How do we do this when doing read_sql_table please
|
Since dask doesn't support MultiIndex yet, you might have to concatenate column1 and column2 within SQL (naming it 'column1_and_column2' for example), then daskDF = ddf.read_sql_table('test_table', sqluri, index_col=['column1_and_column2']) |
@TheXu Thank you. Can we construct this in the dask ? something like. If yes, would appreciate if you can share syntax?
Alternatively, can we do this? i get syntax error sa_meta = sa.MetaData() |
I ran into this problem earlier in the week when, for the first time, converting a package to use Dask instead of Pandas. It was a serious bummer since the performance benefits I was starting to see were dramatic. If I have more time in the future, I would be more than happy to do the implementation if one has already been designed. |
@0x00b1 see #1493 (comment) / https://github.com/TomAugspurger/dask/tree/multiindex. I don't recall where that branch is at, but the main choice is how to represent the data. The two choices are a list of arrays, or as an ndarray of tuples. pandas does a bit of both, but tries to avoid "materializing the tuples" as long as possible. I think Dask has to take the tuple approach. Things like |
@TomAugspurger What about representing the MultiIndex as a sparse CSD array? Being more fancy, we could also have multi-dimensional partitioning along multiple dimensions of the array. |
I'm not sure, would any advantages to using a sparse CSD array apply equally well to a pandas MultiIndex?
I think I looked into this briefly, but struggled with the requirement that DataFrames are ordered so (AFAICT) the partitioning strategy needs to include information from every level of the MultiIndex. |
Yes, you are right @TomAugspurger. Thinking over it, this would make only sense to carry all index in memory. Likely, it makes more sense to have a RangePartitioner and HashPartitioner together with skip indices as done in Spark:
Then it would be easy to implement a ".from_pandas()" / ".to_pandas()" by mirroring the multiindex as columns. |
My company is interested in paying someone to implement this functionality. Would significantly simplify our infrastructure and be a great contribution to the community. Please reach out to me if you're interested or would like to suggest someone who can do it. @TomAugspurger @Hoeze @jsignell |
^ @mrocklin , I don't know if Coiled does paid feature requests or if there is any capacity. Perhaps you have a lead for this. |
Hey @terramars Nice to virtually meet you :) Coiled would certainly like to try and help out! Especially since it would help out the community as well as make you and your teammates lives easier. Are you free sometime next week to talk over the specifics? Here's a link to my calendar where you can find a time that works best for you. |
Thanks for reaching out, definitely would love to chat next week. After I
wrote this I actually realized a bigger problem for us is the inability for
dask to shuffle large data frames, which seems like an ongoing issue
although at least it's being worked on. We can see what makes sense and if
there's some way to start an engagement.
…On Thu, Aug 4, 2022, 5:39 PM 0-ren ***@***.***> wrote:
Hey @terramars <https://github.com/terramars> Nice to virtually meet you
:) Coiled would certainly like to try and help out! Especially since it
would help out the community as well as make you and your teammates lives
easier.
Are you free sometime next week to talk over the specifics? Here's a link
to my calendar <https://calendly.com/d/dtn-w6c-tr9/talk-about-coiled-o_p>
where you can find a time that works best for you.
—
Reply to this email directly, view it on GitHub
<#1493 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHA64IYPXDJDJE4ICAUARTVXQ2BXANCNFSM4CNELPYQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sounds great! Excited to help you and hopefully the community too! :) |
Hello, I'm also interested in this feature, but my use-case is perhaps simpler, and there might be a way to implement this in an acceptable way that's very simple. In my case, partitioning and chunking only need to be performed on the first level of the index. Could a simple implementation involve Dask just keeping track of the first level of the index as a simple index, and delegating everything else to Pandas? This would restrict partitioning and chunking to the first level, but I think in a lot of cases that might be good enough. |
I’ll upvote this. In my work, many of our datasets have multiple “natural index” columns that we frequently join on together or filter/group by. Or sometimes there isn’t a single column that has high enough cardinality to effectively distribute a DF across many partitions. The result is that we can’t get the optimizations that dask has when operating with the index. This can also be a stumbling block in learning to transition from pandas code. |
Dask can load a dataframe from a pytables hdf5 file, and pytables already supports a hierarchy tables. Why not simulate a multiindex (like in pandas) by loading all tables from an hdf5 file into one dask dataframe with nested column indices?
The text was updated successfully, but these errors were encountered: