-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas.DataFrame.query to allow column name with space #6508
Comments
Hm. Many datasets do have this issue, but consider the amount of code needed to rename columns versus the amount of code needed to parse that new syntax (or something similar). Renaming columns is straightforward: cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, unicode)) else x)
df.columns = cols This is well tested and easy to debug. For more complicated replacement you can use regular expressions. The things that go into parsing are markedly less straightforward:
Something that might be useful as a happy medium is a |
I am not sure about the implementation. Maybe we can use normal brackets instead like
Anyway, the |
I think if u put quotes around the column name it might work on master the fix for allowing & and | makes these be treAted like single tokens |
It will be treated as a string, which is then turned into an internal temporary so that won't work. |
|
Maybe allowing the column to be referenced by its clean version?
I do this cleanup for autocompletion whenever it makes sense. |
Column names with spaces, dots, brackets and other invalid characters may be optionally auto-replaced by equivalent valid characters, such as underscore. This is also very handy for accessing columns as members of dataframe with dot syntax. |
Hello, It is a very important issue for us. There are cases where we cannot change the column names because they need to be preserved. That pandas puts an arbitrary requirement on column names is, IMHO, a bad design decision and bad programming practice. Sorry for complaining, but I really think that pandas should fix this issue properly. Please consider it. Thank you. |
@dgua you are welcome to submit a pull-request to fix. Note that in reality In any event, the recommended 'main' way of indexing has always been:
|
jreback, I agree with you. However, we do have people who use |
@dgua as I said, a pull-request from the community would get this done. I simply don't have time. |
R and dplyr use backtick( `` ) to quote column names with space and special characters. For example:
I wonder how hard it is to incorporate this into the pandas parser, so that functions like |
I like the proposal of zhiruiwang and looking at his upvotes others do to. So I have looked into the code a bit and basically pandas does some preprocessing on the expression and passes it to numexpr. Next to that a localdict containing among others the names of the dataframe columns is passed with it. I was thinking to just alter the expression by replacing every space within backticks to something else (like "_" or something less used to prevent name clashes) and remove the backticks. This creates a valid expression for numexpr. Next to that do the same to spaces in the names of columns when they are passed to the resolvers which eventually make up the localdict, so that the correct names can still be found by numexpr. Maybe the code that will do this can look like this, but I have not tested it. I first like to hear what others think of the idea. # Don't know if "_" is a good choice and don't know where to place this variable,
# since it has to be constant in two different files and ideally is only defined once.
SEPERATOR_REPLACING_SPACES = "_"
# Replace spaces in variables surrounded by backticks:
# pandas/pandas/core/computation/expr.py
import re
...
# new function
def _replace_spaces_backtickvariables(source):
return re.sub(r'`(.*?)`',
lambda m: m.group(1).replace(" ", SEPERATOR_REPLACING_SPACES),
source)
...
# adjusted function
def _preparse(source, f=compose(_replace_locals, _replace_booleans,
_rewrite_assign), g=lambda x: x):
...
g : callable
This takes a source string and returns an altered one
...
assert callable(g), 'g must be callable'
source = g(source)
...
# adjusted class
class PandasExprVisitor(BaseExprVisitor):
def __init__(self, env, engine, parser,
preparser=partial(_preparse,
f=compose(_replace_locals, _replace_booleans)
g=_replace_spaces_backtickvariables)):
# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py
# adjusted function
def eval(self, expr, inplace=False, **kwargs):
...
# line 3076
resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v)
for k, v in self.iteritems()), index_resolvers EDIT: fixed |
@zhiruiwang The |
@jreback You seem the one knowing most about this. What do you think of my approach as explained in the previous comment? Instead, we could also decide to solve it as dalelung proposed. So allow "dirty" names to referred to by their "clean" names ("this column name" can be referred to by "this_column_name" without the column actually changing the name) and don't use the `` encapsulation at all. This would than only require this single line to be changed. (If I understand the code correctly, not tested.) # Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py
# adjusted function
def eval(self, expr, inplace=False, **kwargs):
...
# line 3076
resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v)
for k, v in self.iteritems()), index_resolvers What do you think? Which approach do you think is best? Than I can try to make a pull request for it. |
This feature would be nice but I resolve with commands like these: Replace White Spacedf.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1) Starts with Numeric Valuedf.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1) |
Sorry, but that’s unsatisfactory. How do I get back to the original columns? What if ‘_’ is also used as part of a column name?
Users shouldn’t have to do this and it’s a serious problem of pandas.
DG
… On Jan 19, 2019, at 11:49, bscully27 ***@***.***> wrote:
This feature would be nice but I resolve with commands like these:
Replace White Space
df.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1)
Starts with Numeric Value
df.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#6508 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADtPtKBTb6T5ErsqMrQYFSVblj8eYOTKks5vE3a3gaJpZM4Bl2In>.
|
in sql, you just use square brackets and that's nice. |
@dgua do you have time to submit a PR? @zhiruiwang's suggestion of using backticks seems reasonable, if it can be implemented (we couldn't use them before, since they already have a meaning in Python 2). |
@TomAugspurger I would have some time after my exams, and already took a look into the code for the implementation (see my previous comments). I still have some questions:
|
I'm not familiar with this code, so you may be the expert here :)
Python 3.0 - 3.2? We require 3.5+ |
I think the backtick idea is rather good now, since in the For the plain Python context, the new names could have a prefix or suffix to prevent collisions (e.g. they all end with an underscore). |
@beojan yes, but maybe you can see those "collisions" as a feature. So you can refer to |
You can use If you really have two colliding columns, I don't see how that's a feature though. |
Well, if I have a dataframe:
With the feature implemented, without measures for colliding, I can now say:
And pandas would automatically refer to "column name" in this query. This is also earlier suggested by dalejung. You can now also leave the support for backticks out. I also don't think you would see any dataframes in the wild that looks like:
In which the collisions would cause a problem. So, in my view, it won't cause any collisions and gives an extra way to refer to the column. |
I don't think we'd be interested in partially supporting this. The point of
the PR would be to completely avoid ambiguity,
so the last example should work.
…On Wed, Jan 23, 2019 at 10:56 AM hwalinga ***@***.***> wrote:
Well, if I have a dataframe:
"column name" "name"
1 4 5
2 2 1
With the feature implemented, without measures for colliding, I can now
say:
df.query(column_name > 3)
And pandas would automatically refer to "column name" in this query. This
is also earlier suggested by dalejung. You can now also leave the support
for backticks out.
I also don't think you would see any dataframes *in the wild* that looks
like:
"column name" "name" "column_name"
1 3 5 6
2 2 1 9
In which the collisions would cause a problem.
So, in my view, it won't cause any collisions and gives an extra way to
refer to the column.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6508 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkl6f9IgUGBKlzranZpsOe3atH3Aks5vGJQ7gaJpZM4Bl2In>
.
|
@TomAugspurger Okay, that is clear. Do we than go with a suffix like "_" to prevent collisions, or do we go with replacing the space between the column names with a complex string, so we can rule out even more accidental collisions. |
this might work directly in the AST |
This only works for parser=pandas and engine=numexpr It works by replacing any backtick quoted variables to a clean version. For this, see: pandas/core/common.py::clean_column_name_with_spaces This happens before the query is passed and by changing the names of the localdict before it passed to numexpr.
This only works for parser=pandas and engine=numexpr It works by replacing any backtick quoted variables to a clean version. For this, see: pandas/core/common.py::clean_column_name_with_spaces This happens before the query is passed and by changing the names of the localdict before it passed to numexpr.
Hi everyone, Is there a timeline when will this feature be released? |
I think it's done in master, given the commit mentioned above. |
this will be in 0.25 - prob in a month or 2 |
Don't want to ruin the fun, but to prevent any disappointments: Pandas 0.25 will only be available for Python3. Also see https://pandas-docs.github.io/pandas-docs-travis/install.html#install-dropping-27 |
This would have been a good opportunity to allow column name that contain dots. Why was this not included? |
I implemented this and we didn't thought of it. You can open a new issue to bring it up again. There might be a good reason I am not aware of, but the maintainers are, that this is best not to be allowed. I think I can implement it again. It won't have the same solution as the space however. The reason is that the query string is parsed as python source code. You have to apply workarounds if you want certain syntax to be interpreted differently. And there probably won't be implemented a custom parser for this function. |
@hwalinga HI
I saw the changes you made in 0.25.0.
Use a function to correspond to input and output special symbols when parsing |
@zhaohongqiangsoliva I don't think I understand what you are trying to say. Can you elaborate a bit more on the problem you want to solve? |
@hwalinga
and I give a proposed change is to use Hex |
I second this. ANY character should be allowed in column names.
Thanks.
… On Jun 22, 2019, at 18:01, zhaohongqiangsoliva ***@***.***> wrote:
@hwalinga <https://github.com/hwalinga>
Sorry, my English is not good. My problem is that the problem of spaces between column names is now solved, but other symbols are still not solved, for example
. / \
and I give a proposed change is to use Hex
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#6508?email_source=notifications&email_token=AA5U7NAIAJXVHI7LYXNCGNDP33DO5A5CNFSM4AMXMIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYKUG3I#issuecomment-504709997>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5U7NCACMBXVVQT7Z4DUE3P33DO5ANCNFSM4AMXMITQ>.
|
@dgua @zhaohongqiangsoliva The problem is that the query has to become a valid Python expression. Using hex for disallowed characters won't solve this problem. Allowing spaces in the name is already based on hacking around the tokenize function ( If you really want this, you are off course free to open a new issue addressing this, and if you tag me in the issue, I will explain my solution to the maintainers. |
Able to do something like this would be nice
I came across many external data files which have spaces in the column names. It would be nice to be able to do quick analysis on the data without first renaming the columns.
The text was updated successfully, but these errors were encountered: