#2786: applymap fails with dupe columns, ObjectBlock convert() method bombs #3102

ghost · 2013-03-20T01:58:30Z

@jreback, does this look ok? travis is still running.
#2786

jreback · 2013-03-20T02:11:35Z

if you delete then insert a column in the same block at a different position
then try this you will get mixed up columns

I would only do this if its on a non unique index (eg leave the unique case as is)

but I am not sure of the guarantees w.r.t. ordering of the items in block
when there are dups

I think blocks need some method to deal with the location mapping problem

push to 0.12?

@wesm any thoughts?

ghost · 2013-03-20T02:28:40Z

I can't believe these two are not equivalent.

for i in  range(len(self.items)):
    values = self.values[i]

and

for i,c in  enumerate(self.items)):
    values = self.get('c')

That would mean the order of items returned via enumerate(items)
is different from the of the values in ndarray.
That would be very strange, In the dupe case even more so - what
else could server as the mapping from one to the other? not the label, certainly.

jreback · 2013-03-20T02:31:47Z

get uses get_loc on the items indexer
this is normally equivalent to range
except on an insertion/deletion operation when it could changed
(to avoid actually moving the data around)

ghost · 2013-03-20T02:33:43Z

I'm looking at get_loc right now.

ghost · 2013-03-20T02:37:22Z

for monotonic it does a binary search
for duplicates it does a linear scan over self._get_index_values() :
for unique, non-monotonic is does self.mapping.get_item(val),
which is an implementation detail of khash. so maybe, maybe not.

Ok, will limit to dupes.

jreback · 2013-03-20T02:37:23Z

in the dup case the indexer doesn't exist

ghost · 2013-03-20T02:54:01Z

which I guess is the reason why get_loc delegates to get_loc_duplicates which does a linear scan.

This btw can made more efficient by hash chaining, couldn't it? that's linear in the worst case,
but this is linear in every case for duplicated columns.

and O(nindex) rather then O(ndupeX)

jreback · 2013-03-20T10:09:13Z

fyi

In [3]: df = pd.DataFrame(np.random.rand(8,3),columns=list('abc'))
In [4]: df
Out[4]: 
          a         b         c
0  0.112553  0.612335  0.154043
1  0.962762  0.433515  0.028732
2  0.886075  0.268969  0.448379
3  0.116026  0.160916  0.350559
4  0.457144  0.815492  0.648910
5  0.555528  0.866252  0.681121
6  0.777283  0.963508  0.630396
7  0.111525  0.926707  0.904593

In [6]: df._data.blocks[0].items
Out[6]: Index([a, b, c], dtype=object)

In [7]: s = df.pop('b')

In [8]: df._data.blocks[0].items
Out[8]: Index([a], dtype=object)

In [9]: df._data.blocks
Out[9]: [FloatBlock: [a], 1 x 8, dtype float64, FloatBlock: [c], 1 x 8, dtype float64]

In [10]: df._consolidate_inplace()

In [11]: df._data.blocks
Out[11]: [FloatBlock: [a, c], 2 x 8, dtype float64]

In [12]: df['b'] = s

In [13]: df._data.blocks
Out[13]: 
[FloatBlock: [a, c], 2 x 8, dtype float64,
 FloatBlock: [b], 1 x 8, dtype float64]

In [14]: df._consolidate_inplace()

In [15]: df._data.blocks
Out[15]: [FloatBlock: [a, c, b], 3 x 8, dtype float64]

In [16]: df._data.blocks[0].items
Out[16]: Index([a, c, b], dtype=object)

ghost · 2013-03-20T14:14:20Z

got it.

ghost · 2013-03-20T14:16:31Z

The test case in the issue puts you in ObjectBlock.convert() when self is an instance
of FloatBlock. There's a comment suggesting that shouldn't happen.
is that as it should be? , or does convert() is exactly wher the block
is converted into an ObjectBlock?

jreback · 2013-03-20T14:39:47Z

I looked at your fix, looks fine to me....i was just pointing out above that it possibily could be a problem when the block is mutated

the above comment is testing something else, I was passing convert_numeric=True by default (which is wrong), so was force converting string like numbers e.g. '2.0' to floats

ghost · 2013-03-20T15:04:12Z

It's clear to me I don't have the full picture, holding off till I do.

jreback · 2013-03-20T15:16:23Z

I just realized something...you cant delete a single item from a frame by position (because the block doesn't allow it), I think wes punted the whole issue and just made it so that you can't mutate when you have a non-unique index, so to do anything useful you have to reindex anyhow...

your fix is correct (as it allows you to 'do' things with the frame)

ghost · 2013-03-20T15:29:17Z

for both cases or just dupes? I'm not comfortable merging this until
I have a mental model of what's going on. prefer to wait till 0.12 unless
you're sure all exits are covered.

jreback · 2013-03-20T15:35:13Z

unique indicies act like my example (you have have to use get_loc becuase the order could be different than the data)

for dups, the index throws an exception (see in _get_locs), so you can never use it, so order is positional by definition)

I created an issue #3092 so can explore this in 0.12....

push for now I guess

ghost · 2013-03-20T15:43:20Z

Where's this exception raised by get_loc you descibe?
all I see is index.pyx:IndexEngine->get_loc()->_get_loc_duplicates()->_maybe_get_bool_indexer()
which does a linear scan and returns a mask.

jreback · 2013-03-20T15:50:24Z

ref_locs is the indexer into ref_items of the BlockManager
BY Definition it shouldn't work for dups

This is only used in set_ref_items, and only when maybe_rename is True, which only occurs when there is a renaming operation

In [6]: df = pd.DataFrame(np.random.randn(8,4))

In [12]: df = pd.DataFrame(np.random.randn(8,4))

In [13]: df._data.blocks[0].ref_locs
Out[13]: array([0, 1, 2, 3])

In [14]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [15]: df._data.blocks[0].ref_locs
---------------------------------------------------------------------------

/mnt/home/jreback/pandas/pandas/core/internals.py in ref_locs(self)
     52     def ref_locs(self):
     53         if self._ref_locs is None:
---> 54             indexer = self.ref_items.get_indexer(self.items)
     55             indexer = com._ensure_platform_int(indexer)
     56             if (indexer == -1).any():

/mnt/home/jreback/pandas/pandas/core/index.pyc in get_indexer(self, target, method, limit)
    835 
    836         if not self.is_unique:
--> 837             raise Exception('Reindexing only valid with uniquely valued Index '
    838                             'objects')
    839 

Exception: Reindexing only valid with uniquely valued Index objects

This is the root of all evil, this should raise the same as above (but doesn't even if
I consolidate)......

In [16]: df = pd.DataFrame(np.random.randn(8,4))

In [17]: df.columns = ['a']*4

In [18]: df._data.blocks[0].ref_locs
Out[18]: array([0, 1, 2, 3])

jreback · 2013-03-21T10:44:41Z

I think you can throw this under the bus (I mean in 0.11); can fix the dups stuff later

ghost · 2013-03-21T10:55:54Z

I've lost sight of you 10 miles ago, so I won't merge this until I get
a handle on what's going on backstage.

If you're sure this is correct, go ahead and merge it for 0.11 yourself.

jreback · 2013-03-21T11:19:56Z

ok...this was failing in 0.10.1 as well....pushing to 0.12...I think can fix this properly (e.g. by actually fixing the indexer), but too complicated for now

ghost · 2013-03-21T11:25:14Z

agreed. it looked so innocent.

ghost · 2013-03-21T11:26:18Z

if we're pushing it back to 0.12, at least we should put in a check+warning.
no data is better then wrong data.

jreback · 2013-03-21T11:26:28Z

it probably is......but just making sure

jreback · 2013-03-21T12:26:35Z

I am going to reverse, I think this is ok, if you look at the interleave method in internals, this is the same way wes handled dups

ghost · 2013-03-21T12:42:23Z

I'm way behind you on this. take it away as you see fit.

jreback · 2013-03-21T14:03:51Z

moving back to 0.12, I see another issue that's also broken like this, #1943

ghost · 2013-03-31T20:12:59Z

disabled applymap for frames with dupe columns until this gets fixed proper.
7916e76

jreback · 2013-04-01T15:05:29Z

great...., though technically this only is a problem if they are not-unique in a single block IIRC

ghost · 2013-04-01T15:15:06Z

I can live with that, probably not the worse thing in the world if it always
works or always doesn't as far as UX goes.
You're planning to tackle this in 0.12 right? or is this too hairy to deal with
without a root canal?

jreback · 2013-04-01T15:17:36Z

hopefully can do it in 0.12 w/o being too radical I think....just a crown maybe

BUG: ObjectBlock.convert() mishandles dupe columns

4848a6c

jreback mentioned this pull request Mar 21, 2013

ENH: create BlockManager positional indexer (for easier dupe cols support) #3092

Closed

ghost mentioned this pull request Apr 1, 2013

Enable applymap for dataframes with duplicate columns #3230

Closed

ghost closed this Apr 1, 2013

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#2786: applymap fails with dupe columns, ObjectBlock convert() method bombs #3102

#2786: applymap fails with dupe columns, ObjectBlock convert() method bombs #3102

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

ghost commented Mar 21, 2013

jreback commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 31, 2013

jreback commented Apr 1, 2013

ghost commented Apr 1, 2013

jreback commented Apr 1, 2013

#2786: applymap fails with dupe columns, ObjectBlock convert() method bombs #3102

#2786: applymap fails with dupe columns, ObjectBlock convert() method bombs #3102

Conversation

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

ghost commented Mar 20, 2013

jreback commented Mar 20, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

ghost commented Mar 21, 2013

jreback commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 31, 2013

jreback commented Apr 1, 2013

ghost commented Apr 1, 2013

jreback commented Apr 1, 2013