Enable easier transformations of multiple columns in DataFrame #342

wesm · 2011-11-07T00:40:35Z

things like

df[cols] = transform(df[cols])

should be possible in a mixed-type DataFrmae, per the mailing list discussion

The text was updated successfully, but these errors were encountered:

hatmatrix · 2011-12-03T05:21:09Z

Thanks Wes! I didn't realize you'd posted this, but was actually coming to the mailing list to suggest a transform function (much like in R).

np.random.seed(1)

dd = {
    'category':['a','a','b'],
    'x':np.random.rand(3),
    'y':np.random.rand(3),
    'z':np.random.rand(3)
    }

pd = pandas.DataFrame(dd)

In [81]: pd
Out[81]: 
   category  x          y        z     
0  a         0.417      0.3023   0.1863
1  a         0.7203     0.1468   0.3456
2  b         0.0001144  0.09234  0.3968

def scale(x):
    centred = x-x.mean()
    return centred/x.std()

print pd[['x','y','z']].apply(scale)

   x       y       z     
0  0.1047  1.118  -1.123 
1  0.9435 -0.3094  0.3282
2 -1.048  -0.8087  0.7946

Reassignments could be implemented in several ways, that I can think of:

pd[['x','y','z']] = pd[['x','y','z']].apply(scale)

a.

pd = pd.transform(pd[['x','y','z']].apply(scale))

where transform can accept similar arguments to DataFrame? E.g.,

b.

nparr = pd[['x','y','z']].apply(scale).values
pd = pd.transform(nparr,columns=['x','y','z'])

c.

dct = pd[['x','y','z']].apply(scale).to_dict()
pd = pd.transform(dct)

Depending on the implementation though, (1) may be better. In R, I believe any replacement of values of a subset will copy/modify the entire data frame and reassign the value to the original symbol, which leads to its inefficiency... but so in that case something like

pd[,c("x","y","z")] <- lapply(pd[,c("x","y","z")],base::scale)

is effectively the same as writing

pd <- do.call(transform,c(list(x),lapply(pd[,c("x","y","z")],base::scale)))

which is a convenient way of writing

pd <- transform(pd,x=newx,y=newy,z=newx)

and so on.

But if in pandas, individual columns rather than the entire DataFrame can be modified, then the reassignment to the entire pd DataFrame might not be the best idea. And a (1)-type implementation could be general enough to work around the limitation of "setting on mixed-type frames only allowed with scalar values" which are allowed in R - I'm not sure if it was a deliberate decision on your part to not allow this, but if not, could be useful in certain situations. For instance, permitting operations like

pd.ix[:1,'category'] = ['c','d']

Though, to be honest I've caught a bit of the functional-style bug so I'm a bit biased against partial reassignment over returning new values from functions, but I guess reassignment and rebinding is generally the way to go with large data sets... (and it would provide a consistent experience for R users).

…#342

wesm · 2011-12-19T05:21:30Z

I implemented option #1

pd[['x','y','z']] = pd[['x','y','z']].apply(scale)

in the above referenced commit. Wasn't very difficult in the end. Can address other kinds of transformations if we want at a later time.

hatmatrix · 2012-03-10T19:46:47Z

Thanks Wes - sorry for my extremely delayed response. But this is fantastic
news!

On Mon, Dec 19, 2011 at 6:21 AM, Wes McKinney <
[email protected]

wrote:

I implemented option #1
pd[['x','y','z']] = pd[['x','y','z']].apply(scale)
in the above referenced commit. Wasn't very difficult in the end. Can
address other kinds of transformations if we want at a later time.

Reply to this email directly or view it on GitHub:
https://github.com/wesm/pandas/issues/342#issuecomment-3199430

erichamers · 2018-12-13T18:11:45Z

The problem I have now is that I don't have the option to set types when reading data from a sql query, so it would be good if I could parse different data types for multiple columns.

I don't know if something like this has been implemented yet, but it would look something like this:

DataFrame.transform({'Column A': 'type A', 'Column B': 'type B', 'Column C': 'type C'})

wesm added a commit that referenced this issue Dec 19, 2011

ENH: can set multiple columns at once on DataFrame in __setitem__, per …

8ebeb7a

…#342

wesm closed this as completed Dec 19, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable easier transformations of multiple columns in DataFrame #342

Enable easier transformations of multiple columns in DataFrame #342

wesm commented Nov 7, 2011

hatmatrix commented Dec 3, 2011

wesm commented Dec 19, 2011

hatmatrix commented Mar 10, 2012

erichamers commented Dec 13, 2018

Enable easier transformations of multiple columns in DataFrame #342

Enable easier transformations of multiple columns in DataFrame #342

Comments

wesm commented Nov 7, 2011

hatmatrix commented Dec 3, 2011

wesm commented Dec 19, 2011

hatmatrix commented Mar 10, 2012

erichamers commented Dec 13, 2018