Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2786: applymap fails with dupe columns, ObjectBlock convert() method bombs #3102

Closed
wants to merge 1 commit into from
Closed

Conversation

ghost
Copy link

@ghost ghost commented Mar 20, 2013

@jreback, does this look ok? travis is still running.
#2786

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

if you delete then insert a column in the same block at a different position
then try this you will get mixed up columns

I would only do this if its on a non unique index (eg leave the unique case as is)

but I am not sure of the guarantees w.r.t. ordering of the items in block
when there are dups

I think blocks need some method to deal with the location mapping problem

push to 0.12?

@wesm any thoughts?

@ghost
Copy link
Author

ghost commented Mar 20, 2013

I can't believe these two are not equivalent.

for i in  range(len(self.items)):
    values = self.values[i]

and

for i,c in  enumerate(self.items)):
    values = self.get('c')

That would mean the order of items returned via enumerate(items)
is different from the of the values in ndarray.
That would be very strange, In the dupe case even more so - what
else could server as the mapping from one to the other? not the label, certainly.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

get uses get_loc on the items indexer
this is normally equivalent to range
except on an insertion/deletion operation when it could changed
(to avoid actually moving the data around)

@ghost
Copy link
Author

ghost commented Mar 20, 2013

I'm looking at get_loc right now.

@ghost
Copy link
Author

ghost commented Mar 20, 2013

for monotonic it does a binary search
for duplicates it does a linear scan over self._get_index_values() :
for unique, non-monotonic is does self.mapping.get_item(val),
which is an implementation detail of khash. so maybe, maybe not.

Ok, will limit to dupes.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

in the dup case the indexer doesn't exist

@ghost
Copy link
Author

ghost commented Mar 20, 2013

which I guess is the reason why get_loc delegates to get_loc_duplicates which does a linear scan.

This btw can made more efficient by hash chaining, couldn't it? that's linear in the worst case,
but this is linear in every case for duplicated columns.

and O(nindex) rather then O(ndupeX)

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

fyi

In [3]: df = pd.DataFrame(np.random.rand(8,3),columns=list('abc'))
In [4]: df
Out[4]: 
          a         b         c
0  0.112553  0.612335  0.154043
1  0.962762  0.433515  0.028732
2  0.886075  0.268969  0.448379
3  0.116026  0.160916  0.350559
4  0.457144  0.815492  0.648910
5  0.555528  0.866252  0.681121
6  0.777283  0.963508  0.630396
7  0.111525  0.926707  0.904593

In [6]: df._data.blocks[0].items
Out[6]: Index([a, b, c], dtype=object)

In [7]: s = df.pop('b')

In [8]: df._data.blocks[0].items
Out[8]: Index([a], dtype=object)

In [9]: df._data.blocks
Out[9]: [FloatBlock: [a], 1 x 8, dtype float64, FloatBlock: [c], 1 x 8, dtype float64]

In [10]: df._consolidate_inplace()

In [11]: df._data.blocks
Out[11]: [FloatBlock: [a, c], 2 x 8, dtype float64]

In [12]: df['b'] = s

In [13]: df._data.blocks
Out[13]: 
[FloatBlock: [a, c], 2 x 8, dtype float64,
 FloatBlock: [b], 1 x 8, dtype float64]

In [14]: df._consolidate_inplace()

In [15]: df._data.blocks
Out[15]: [FloatBlock: [a, c, b], 3 x 8, dtype float64]

In [16]: df._data.blocks[0].items
Out[16]: Index([a, c, b], dtype=object)

@ghost
Copy link
Author

ghost commented Mar 20, 2013

got it.

@ghost
Copy link
Author

ghost commented Mar 20, 2013

The test case in the issue puts you in ObjectBlock.convert() when self is an instance
of FloatBlock. There's a comment suggesting that shouldn't happen.
is that as it should be? , or does convert() is exactly wher the block
is converted into an ObjectBlock?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

I looked at your fix, looks fine to me....i was just pointing out above that it possibily could be a problem when the block is mutated

the above comment is testing something else, I was passing convert_numeric=True by default (which is wrong), so was force converting string like numbers e.g. '2.0' to floats

@ghost
Copy link
Author

ghost commented Mar 20, 2013

It's clear to me I don't have the full picture, holding off till I do.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

I just realized something...you cant delete a single item from a frame by position (because the block doesn't allow it), I think wes punted the whole issue and just made it so that you can't mutate when you have a non-unique index, so to do anything useful you have to reindex anyhow...

your fix is correct (as it allows you to 'do' things with the frame)

@ghost
Copy link
Author

ghost commented Mar 20, 2013

for both cases or just dupes? I'm not comfortable merging this until
I have a mental model of what's going on. prefer to wait till 0.12 unless
you're sure all exits are covered.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

unique indicies act like my example (you have have to use get_loc becuase the order could be different than the data)

for dups, the index throws an exception (see in _get_locs), so you can never use it, so order is positional by definition)

I created an issue #3092 so can explore this in 0.12....

push for now I guess

@ghost
Copy link
Author

ghost commented Mar 20, 2013

Where's this exception raised by get_loc you descibe?
all I see is index.pyx:IndexEngine->get_loc()->_get_loc_duplicates()->_maybe_get_bool_indexer()
which does a linear scan and returns a mask.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

ref_locs is the indexer into ref_items of the BlockManager
BY Definition it shouldn't work for dups

This is only used in set_ref_items, and only when maybe_rename is True, which only occurs when there is a renaming operation

In [6]: df = pd.DataFrame(np.random.randn(8,4))

In [12]: df = pd.DataFrame(np.random.randn(8,4))

In [13]: df._data.blocks[0].ref_locs
Out[13]: array([0, 1, 2, 3])

In [14]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [15]: df._data.blocks[0].ref_locs
---------------------------------------------------------------------------

/mnt/home/jreback/pandas/pandas/core/internals.py in ref_locs(self)
     52     def ref_locs(self):
     53         if self._ref_locs is None:
---> 54             indexer = self.ref_items.get_indexer(self.items)
     55             indexer = com._ensure_platform_int(indexer)
     56             if (indexer == -1).any():

/mnt/home/jreback/pandas/pandas/core/index.pyc in get_indexer(self, target, method, limit)
    835 
    836         if not self.is_unique:
--> 837             raise Exception('Reindexing only valid with uniquely valued Index '
    838                             'objects')
    839 

Exception: Reindexing only valid with uniquely valued Index objects

This is the root of all evil, this should raise the same as above (but doesn't even if
I consolidate)......

In [16]: df = pd.DataFrame(np.random.randn(8,4))

In [17]: df.columns = ['a']*4

In [18]: df._data.blocks[0].ref_locs
Out[18]: array([0, 1, 2, 3])

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

I think you can throw this under the bus (I mean in 0.11); can fix the dups stuff later

@ghost
Copy link
Author

ghost commented Mar 21, 2013

I've lost sight of you 10 miles ago, so I won't merge this until I get
a handle on what's going on backstage.

If you're sure this is correct, go ahead and merge it for 0.11 yourself.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

ok...this was failing in 0.10.1 as well....pushing to 0.12...I think can fix this properly (e.g. by actually fixing the indexer), but too complicated for now

@ghost
Copy link
Author

ghost commented Mar 21, 2013

agreed. it looked so innocent.

@ghost
Copy link
Author

ghost commented Mar 21, 2013

if we're pushing it back to 0.12, at least we should put in a check+warning.
no data is better then wrong data.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

it probably is......but just making sure

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

I am going to reverse, I think this is ok, if you look at the interleave method in internals, this is the same way wes handled dups

@ghost
Copy link
Author

ghost commented Mar 21, 2013

I'm way behind you on this. take it away as you see fit.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

moving back to 0.12, I see another issue that's also broken like this, #1943

@ghost
Copy link
Author

ghost commented Mar 31, 2013

disabled applymap for frames with dupe columns until this gets fixed proper.
7916e76

@jreback
Copy link
Contributor

jreback commented Apr 1, 2013

great...., though technically this only is a problem if they are not-unique in a single block IIRC

@ghost
Copy link
Author

ghost commented Apr 1, 2013

I can live with that, probably not the worse thing in the world if it always
works or always doesn't as far as UX goes.
You're planning to tackle this in 0.12 right? or is this too hairy to deal with
without a root canal?

@jreback
Copy link
Contributor

jreback commented Apr 1, 2013

hopefully can do it in 0.12 w/o being too radical I think....just a crown maybe

@ghost ghost closed this Apr 1, 2013
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant