Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable easier transformations of multiple columns in DataFrame #342

Closed
wesm opened this issue Nov 7, 2011 · 4 comments
Closed

Enable easier transformations of multiple columns in DataFrame #342

wesm opened this issue Nov 7, 2011 · 4 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Nov 7, 2011

things like

df[cols] = transform(df[cols])

should be possible in a mixed-type DataFrmae, per the mailing list discussion

@hatmatrix
Copy link

Thanks Wes! I didn't realize you'd posted this, but was actually coming to the mailing list to suggest a transform function (much like in R).

np.random.seed(1)

dd = {
    'category':['a','a','b'],
    'x':np.random.rand(3),
    'y':np.random.rand(3),
    'z':np.random.rand(3)
    }

pd = pandas.DataFrame(dd)

In [81]: pd
Out[81]: 
   category  x          y        z     
0  a         0.417      0.3023   0.1863
1  a         0.7203     0.1468   0.3456
2  b         0.0001144  0.09234  0.3968

def scale(x):
    centred = x-x.mean()
    return centred/x.std()

print pd[['x','y','z']].apply(scale)

   x       y       z     
0  0.1047  1.118  -1.123 
1  0.9435 -0.3094  0.3282
2 -1.048  -0.8087  0.7946

Reassignments could be implemented in several ways, that I can think of:

pd[['x','y','z']] = pd[['x','y','z']].apply(scale)

a.

pd = pd.transform(pd[['x','y','z']].apply(scale))

where transform can accept similar arguments to DataFrame? E.g.,

b.

nparr = pd[['x','y','z']].apply(scale).values
pd = pd.transform(nparr,columns=['x','y','z'])

c.

dct = pd[['x','y','z']].apply(scale).to_dict()
pd = pd.transform(dct)

Depending on the implementation though, (1) may be better. In R, I believe any replacement of values of a subset will copy/modify the entire data frame and reassign the value to the original symbol, which leads to its inefficiency... but so in that case something like

pd[,c("x","y","z")] <- lapply(pd[,c("x","y","z")],base::scale)

is effectively the same as writing

pd <- do.call(transform,c(list(x),lapply(pd[,c("x","y","z")],base::scale)))

which is a convenient way of writing

pd <- transform(pd,x=newx,y=newy,z=newx)

and so on.

But if in pandas, individual columns rather than the entire DataFrame can be modified, then the reassignment to the entire pd DataFrame might not be the best idea. And a (1)-type implementation could be general enough to work around the limitation of "setting on mixed-type frames only allowed with scalar values" which are allowed in R - I'm not sure if it was a deliberate decision on your part to not allow this, but if not, could be useful in certain situations. For instance, permitting operations like

pd.ix[:1,'category'] = ['c','d']

Though, to be honest I've caught a bit of the functional-style bug so I'm a bit biased against partial reassignment over returning new values from functions, but I guess reassignment and rebinding is generally the way to go with large data sets... (and it would provide a consistent experience for R users).

@wesm
Copy link
Member Author

wesm commented Dec 19, 2011

I implemented option #1

pd[['x','y','z']] = pd[['x','y','z']].apply(scale)

in the above referenced commit. Wasn't very difficult in the end. Can address other kinds of transformations if we want at a later time.

@wesm wesm closed this as completed Dec 19, 2011
@hatmatrix
Copy link

Thanks Wes - sorry for my extremely delayed response. But this is fantastic
news!

On Mon, Dec 19, 2011 at 6:21 AM, Wes McKinney <
[email protected]

wrote:

I implemented option #1

pd[['x','y','z']] = pd[['x','y','z']].apply(scale)

in the above referenced commit. Wasn't very difficult in the end. Can
address other kinds of transformations if we want at a later time.


Reply to this email directly or view it on GitHub:
https://github.com/wesm/pandas/issues/342#issuecomment-3199430

@erichamers
Copy link

The problem I have now is that I don't have the option to set types when reading data from a sql query, so it would be good if I could parse different data types for multiple columns.

I don't know if something like this has been implemented yet, but it would look something like this:

DataFrame.transform({'Column A': 'type A', 'Column B': 'type B', 'Column C': 'type C'})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants