pandas.DataFrame.query to allow column name with space #6508

socheon · 2014-02-28T17:09:23Z

Able to do something like this would be nice

df.query('[col with space] < col')

I came across many external data files which have spaces in the column names. It would be nice to be able to do quick analysis on the data without first renaming the columns.

The text was updated successfully, but these errors were encountered:

cpcloud · 2014-02-28T18:14:39Z

Hm. Many datasets do have this issue, but consider the amount of code needed to rename columns versus the amount of code needed to parse that new syntax (or something similar).

Renaming columns is straightforward:

cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, unicode)) else x)
df.columns = cols

This is well tested and easy to debug. For more complicated replacement you can use regular expressions.

The things that go into parsing are markedly less straightforward:

How would this tokenize? (This isn't so straightforward to me, there might be an easy way to do it, but it seems like you'd have to create a new token to disambiguate a syntax error from list construction and then map the whole thing to a valid Python identifier).
How would this parse? (not too hard, it parses as a ColumnNode or some such object which gets looked up as a column in the frame)

Something that might be useful as a happy medium is a df = pd.clean_columns(df) function that will rename your columns to be valid Python identifiers so that you don't have to think too hard about it.

socheon · 2014-02-28T19:02:54Z

I am not sure about the implementation. Maybe we can use normal brackets instead like

df.query('(col with space) < col')

Anyway, the clean_columns method sounds like a good idea to me. Maybe it could also be a keyword argument in the read_csv function.

cpcloud · 2014-02-28T19:08:34Z

@socheon Let's try to keep our discussion to a single issue (I tend to get off topic too!). If you think there should be a clean_columns argument to read_csv, please open another issue.

@jreback What do you think about a top-level clean_columns function?

jreback · 2014-02-28T19:10:35Z

I think if u put quotes around the column name it might work on master

the fix for allowing & and | makes these be treAted like single tokens

cpcloud · 2014-02-28T19:11:38Z

It will be treated as a string, which is then turned into an internal temporary so that won't work.

cpcloud · 2014-02-28T19:13:50Z

df.query("'a column with a space' > 2") -> df.query("tmp_var_str_some_hex_value > 2")

dalejung · 2014-02-28T20:45:37Z

Maybe allowing the column to be referenced by its clean version?

df.query("column_with_space > 2")

I do this cleanup for autocompletion whenever it makes sense.

den-run-ai · 2015-04-09T17:27:32Z

Column names with spaces, dots, brackets and other invalid characters may be optionally auto-replaced by equivalent valid characters, such as underscore. This is also very handy for accessing columns as members of dataframe with dot syntax.

dgua · 2017-02-28T21:43:48Z

Hello,
I want to make a pledge for a resolution of this issue.

It is a very important issue for us. There are cases where we cannot change the column names because they need to be preserved.

That pandas puts an arbitrary requirement on column names is, IMHO, a bad design decision and bad programming practice.

Sorry for complaining, but I really think that pandas should fix this issue properly. Please consider it. Thank you.

jreback · 2017-02-28T22:28:05Z

@dgua you are welcome to submit a pull-request to fix. Note that in reality .queryis just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.

In any event, the recommended 'main' way of indexing has always been:

In [10]: df = DataFrame({'foo bar': [1, 2, 3, 4]})

In [11]: df[df['foo bar'] > 2]
Out[11]: 
   foo bar
2        3
3        4

dgua · 2017-02-28T22:32:06Z

jreback, I agree with you. However, we do have people who use .query extensively in their code because it's easier/more readable, so now we are in a conundrum to either change all that code or rename the columns, neither of which is desirable...

jreback · 2017-02-28T22:44:51Z

@dgua as I said, a pull-request from the community would get this done. I simply don't have time.

zhiruiwang · 2018-06-21T20:40:07Z

R and dplyr use backtick( `` ) to quote column names with space and special characters. For example:

df$`column with space`
df %>%
    mutate(`column with space` = 1)

I wonder how hard it is to incorporate this into the pandas parser, so that functions like query, eval or even df. dot syntax can deal with more general column names?

hwalinga · 2018-10-26T12:22:00Z

I like the proposal of zhiruiwang and looking at his upvotes others do to.

So I have looked into the code a bit and basically pandas does some preprocessing on the expression and passes it to numexpr. Next to that a localdict containing among others the names of the dataframe columns is passed with it.

I was thinking to just alter the expression by replacing every space within backticks to something else (like "_" or something less used to prevent name clashes) and remove the backticks. This creates a valid expression for numexpr. Next to that do the same to spaces in the names of columns when they are passed to the resolvers which eventually make up the localdict, so that the correct names can still be found by numexpr.

Maybe the code that will do this can look like this, but I have not tested it. I first like to hear what others think of the idea.

# Don't know if "_" is a good choice and don't know where to place this variable, 
# since it has to be constant in two different files and ideally is only defined once.
SEPERATOR_REPLACING_SPACES = "_"

# Replace spaces in variables surrounded by backticks:
# pandas/pandas/core/computation/expr.py 

import re

...

# new function
def _replace_spaces_backtickvariables(source):
    return re.sub(r'`(.*?)`', 
                  lambda m: m.group(1).replace(" ", SEPERATOR_REPLACING_SPACES), 
                  source)

...

# adjusted function
def _preparse(source, f=compose(_replace_locals, _replace_booleans,
                                _rewrite_assign), g=lambda x: x):
    ...
    g : callable
        This takes a source string and returns an altered one
    ...
    assert callable(g), 'g must be callable'
    source = g(source)

...

# adjusted class
class PandasExprVisitor(BaseExprVisitor):
    
    def __init__(self, env, engine, parser,
                 preparser=partial(_preparse, 
                                   f=compose(_replace_locals, _replace_booleans)
                                   g=_replace_spaces_backtickvariables)):

# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py

# adjusted function
def eval(self, expr, inplace=False, **kwargs):
    ...
    # line 3076
    resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v) 
                     for k, v in self.iteritems()), index_resolvers

EDIT: fixed _replace_spaces_backtickvariables (regex should've been lazy)

beojan · 2018-11-08T15:02:42Z

@zhiruiwang The df. syntax won't be able to use this since it's parsed by Python, but query and eval would work.

hwalinga · 2018-11-13T17:57:53Z

@jreback You seem the one knowing most about this.

What do you think of my approach as explained in the previous comment?

Instead, we could also decide to solve it as dalelung proposed. So allow "dirty" names to referred to by their "clean" names ("this column name" can be referred to by "this_column_name" without the column actually changing the name) and don't use the `` encapsulation at all.

This would than only require this single line to be changed. (If I understand the code correctly, not tested.)

# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py

# adjusted function
def eval(self, expr, inplace=False, **kwargs):
    ...
    # line 3076
    resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v) 
                     for k, v in self.iteritems()), index_resolvers

What do you think? Which approach do you think is best?

Than I can try to make a pull request for it.

bscully27 · 2019-01-19T19:48:55Z

This feature would be nice but I resolve with commands like these:

Replace White Space

df.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1)

Starts with Numeric Value

df.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1)

dgua · 2019-01-20T02:18:19Z

Sorry, but that’s unsatisfactory. How do I get back to the original columns? What if ‘_’ is also used as part of a column name? Users shouldn’t have to do this and it’s a serious problem of pandas. DG

…

On Jan 19, 2019, at 11:49, bscully27 ***@***.***> wrote: This feature would be nice but I resolve with commands like these: Replace White Space df.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1) Starts with Numeric Value df.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6508 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADtPtKBTb6T5ErsqMrQYFSVblj8eYOTKks5vE3a3gaJpZM4Bl2In>.

polidore · 2019-01-22T21:55:20Z

in sql, you just use square brackets and that's nice.

TomAugspurger · 2019-01-23T03:53:20Z

@dgua do you have time to submit a PR? @zhiruiwang's suggestion of using backticks seems reasonable, if it can be implemented (we couldn't use them before, since they already have a meaning in Python 2).

hwalinga · 2019-01-23T10:34:59Z

@TomAugspurger I would have some time after my exams, and already took a look into the code for the implementation (see my previous comments). I still have some questions:

What should SEPERATOR_REPLACING_SPACES be? If we pick just "_", instead of a more complex replacement string, it might interfere with already existing other columns, but if we choose "_" it has the extra feature that you can refer to this column name by `this column name` and this_column_name. (We could even drop the backticks feature.)
SEPERATOR_REPLACING_SPACES has to be consistent over two files. So, should it be declared somewhere else and imported into these files to make sure they are the same, or would a simple comment mentioning this suffice?
In the code I will alter, callable() is used. callable() is not available in 3.0-3.2, so should I also fix this at the same time?

TomAugspurger · 2019-01-23T12:01:24Z

I'm not familiar with this code, so you may be the expert here :)

In the code I will alter, callable() is used. callable() is not available in 3.0-3.2

Python 3.0 - 3.2? We require 3.5+

beojan · 2019-01-23T12:08:11Z

I think the backtick idea is rather good now, since in the eval or query context you wouldn't need to worry about how things are implemented. I wouldn't want to drop it.

For the plain Python context, the new names could have a prefix or suffix to prevent collisions (e.g. they all end with an underscore).

hwalinga · 2019-01-23T14:19:17Z

have a prefix or suffix to prevent collisions

@beojan yes, but maybe you can see those "collisions" as a feature. So you can refer to this column name by `this column name` and this_column_name. I also think it is a bit silly if you have two columns names this column name and this_column_name, but you never know.

beojan · 2019-01-23T14:22:45Z

You can use this_column_name_.

If you really have two colliding columns, I don't see how that's a feature though.

hwalinga · 2019-01-23T16:56:21Z

Well, if I have a dataframe:

    "column name" "name"
1   4              5
2   2              1

With the feature implemented, without measures for colliding, I can now say:

df.query(column_name > 3)

And pandas would automatically refer to "column name" in this query. This is also earlier suggested by dalejung. You can now also leave the support for backticks out.

I also don't think you would see any dataframes in the wild that looks like:

    "column name" "name" "column_name"
1   3              5     6
2   2              1     9

In which the collisions would cause a problem.

So, in my view, it won't cause any collisions and gives an extra way to refer to the column.

TomAugspurger · 2019-01-23T17:00:53Z

I don't think we'd be interested in partially supporting this. The point of the PR would be to completely avoid ambiguity, so the last example should work.

…

On Wed, Jan 23, 2019 at 10:56 AM hwalinga ***@***.***> wrote: Well, if I have a dataframe: "column name" "name" 1 4 5 2 2 1 With the feature implemented, without measures for colliding, I can now say: df.query(column_name > 3) And pandas would automatically refer to "column name" in this query. This is also earlier suggested by dalejung. You can now also leave the support for backticks out. I also don't think you would see any dataframes *in the wild* that looks like: "column name" "name" "column_name" 1 3 5 6 2 2 1 9 In which the collisions would cause a problem. So, in my view, it won't cause any collisions and gives an extra way to refer to the column. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6508 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkl6f9IgUGBKlzranZpsOe3atH3Aks5vGJQ7gaJpZM4Bl2In> .

hwalinga · 2019-01-23T17:58:47Z

@TomAugspurger Okay, that is clear.

Do we than go with a suffix like "_" to prevent collisions, or do we go with replacing the space between the column names with a complex string, so we can rule out even more accidental collisions.

jreback · 2019-01-23T18:01:19Z

this might work directly in the AST
I think these actually parse correctly but we don’t handle the reassembly correctly

This only works for parser=pandas and engine=numexpr It works by replacing any backtick quoted variables to a clean version. For this, see: pandas/core/common.py::clean_column_name_with_spaces This happens before the query is passed and by changing the names of the localdict before it passed to numexpr.

palnabarun · 2019-04-05T14:52:39Z

Hi everyone,

Is there a timeline when will this feature be released?

beojan · 2019-04-05T14:55:12Z

I think it's done in master, given the commit mentioned above.

jreback · 2019-04-05T16:10:00Z

this will be in 0.25 - prob in a month or 2

hwalinga · 2019-04-05T19:40:20Z

Don't want to ruin the fun, but to prevent any disappointments: Pandas 0.25 will only be available for Python3. Also see https://pandas-docs.github.io/pandas-docs-travis/install.html#install-dropping-27

palnabarun · 2019-04-06T04:42:00Z

@jreback Thanks for the information. I saw that the 0.25.0 milestone is due for May 1.

@hwalinga No worries. This will be used in Python3 only codebase.

danielhrisca · 2019-05-13T07:47:17Z

This would have been a good opportunity to allow column name that contain dots. Why was this not included?

hwalinga · 2019-05-13T10:26:26Z

@danielhrisca

I implemented this and we didn't thought of it. You can open a new issue to bring it up again. There might be a good reason I am not aware of, but the maintainers are, that this is best not to be allowed.

I think I can implement it again. It won't have the same solution as the space however. The reason is that the query string is parsed as python source code. You have to apply workarounds if you want certain syntax to be interpreted differently. And there probably won't be implemented a custom parser for this function.

solivehong · 2019-06-22T04:26:04Z

@hwalinga HI
I have encountered the same problem, but i table name is point

snv_df.query('Gene.refGene in ["MSH2","MSH3","MLH1","MLH3","MSH6","PMS2","PMS3"]'

I saw the changes you made in 0.25.0.
I have a suggestion，
When I read the source code of a perl software annovar, I found that can use hexadecimal to represent special symbols.
f.g.

comsic=comsic\x3dxxxxxxxxxxxxxxxxxxx

Use a function to correspond to input and output special symbols when parsing

hwalinga · 2019-06-22T19:31:50Z

@zhaohongqiangsoliva

I don't think I understand what you are trying to say. Can you elaborate a bit more on the problem you want to solve?

solivehong · 2019-06-23T01:00:48Z

@hwalinga
Sorry, my English is not good. My problem is that the problem of spaces between column names is now solved, but other symbols are still not solved, for example

.      /           \

and I give a proposed change is to use Hex

dgua · 2019-06-23T03:42:56Z

I second this. ANY character should be allowed in column names. Thanks.

…

On Jun 22, 2019, at 18:01, zhaohongqiangsoliva ***@***.***> wrote: @hwalinga <https://github.com/hwalinga> Sorry, my English is not good. My problem is that the problem of spaces between column names is now solved, but other symbols are still not solved, for example . / \ and I give a proposed change is to use Hex — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6508?email_source=notifications&email_token=AA5U7NAIAJXVHI7LYXNCGNDP33DO5A5CNFSM4AMXMIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYKUG3I#issuecomment-504709997>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5U7NCACMBXVVQT7Z4DUE3P33DO5ANCNFSM4AMXMITQ>.

hwalinga · 2019-06-24T11:49:19Z

@dgua @zhaohongqiangsoliva

The problem is that the query has to become a valid Python expression. Using hex for disallowed characters won't solve this problem. Allowing spaces in the name is already based on hacking around the tokenize function (from tokenize import generate_tokens). I don't think it is impossible to allow more characters in the name, but it will be based on hacking around the tokenize function again. (Pandas won't role their own Python parser.)

If you really want this, you are off course free to open a new issue addressing this, and if you tag me in the issue, I will explain my solution to the maintainers.

cpcloud added the API Design label Feb 28, 2014

jreback added this to the Someday milestone Mar 6, 2014

jorisvandenbossche mentioned this issue Apr 11, 2016

Query function gives error if column names have front slash #12858

Closed

jreback added Difficulty Advanced Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 28, 2017

jreback mentioned this issue Nov 12, 2017

DOC: Error msg using Python keyword in numexpr query #18221 #18248

Merged

4 tasks

hwalinga added a commit to hwalinga/pandas that referenced this issue Jan 26, 2019

TST: Add tests for backtick quoting (pandas-dev#6508)

3dcc953

hwalinga mentioned this issue Jan 26, 2019

ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

Merged

4 tasks

hwalinga added a commit to hwalinga/pandas that referenced this issue Feb 16, 2019

TST: Add tests for backtick quoting (pandas-dev#6508)

ff463ca

jreback modified the milestones: Someday, 0.25.0 Mar 10, 2019

jreback closed this as completed in #24955 Mar 20, 2019

hwalinga mentioned this issue Jul 5, 2019

The input column name in query contains special characters #27017

Closed

pandas.DataFrame.query to allow column name with space #6508

pandas.DataFrame.query to allow column name with space #6508

Comments

socheon commented Feb 28, 2014

cpcloud commented Feb 28, 2014

socheon commented Feb 28, 2014

cpcloud commented Feb 28, 2014

jreback commented Feb 28, 2014

cpcloud commented Feb 28, 2014

cpcloud commented Feb 28, 2014

dalejung commented Feb 28, 2014

den-run-ai commented Apr 9, 2015

dgua commented Feb 28, 2017

jreback commented Feb 28, 2017

dgua commented Feb 28, 2017

jreback commented Feb 28, 2017

zhiruiwang commented Jun 21, 2018

hwalinga commented Oct 26, 2018 • edited Loading

beojan commented Nov 8, 2018

hwalinga commented Nov 13, 2018

bscully27 commented Jan 19, 2019

Replace White Space

Starts with Numeric Value

dgua commented Jan 20, 2019 via email

polidore commented Jan 22, 2019

TomAugspurger commented Jan 23, 2019

hwalinga commented Jan 23, 2019 • edited Loading

TomAugspurger commented Jan 23, 2019

beojan commented Jan 23, 2019 • edited Loading

hwalinga commented Jan 23, 2019

beojan commented Jan 23, 2019 • edited Loading

hwalinga commented Jan 23, 2019

TomAugspurger commented Jan 23, 2019 via email

hwalinga commented Jan 23, 2019

jreback commented Jan 23, 2019

palnabarun commented Apr 5, 2019

beojan commented Apr 5, 2019

jreback commented Apr 5, 2019

hwalinga commented Apr 5, 2019

palnabarun commented Apr 6, 2019

danielhrisca commented May 13, 2019

hwalinga commented May 13, 2019

solivehong commented Jun 22, 2019

hwalinga commented Jun 22, 2019

solivehong commented Jun 23, 2019

dgua commented Jun 23, 2019 via email

hwalinga commented Jun 24, 2019

hwalinga commented Oct 26, 2018 •

edited

Loading

hwalinga commented Jan 23, 2019 •

edited

Loading

beojan commented Jan 23, 2019 •

edited

Loading

beojan commented Jan 23, 2019 •

edited

Loading