ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

hwalinga · 2019-01-26T22:55:13Z

closes pandas.DataFrame.query to allow column name with space #6508
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2019-01-27T00:49:43Z

Codecov Report

Merging #24955 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24955      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52398    52406       +8     
==========================================
+ Hits        48406    48414       +8     
  Misses       3992     3992

Flag	Coverage Δ
#multiple	`90.8% <100%> (ø)`	⬆️
#single	`42.89% <40%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.92% <100%> (ø)`	⬆️
pandas/core/common.py	`98.44% <100%> (+0.03%)`	⬆️
pandas/core/computation/expr.py	`88.77% <100%> (+0.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95f8dca...114b2c2. Read the comment docs.

codecov · 2019-01-27T00:49:43Z

Codecov Report

Merging #24955 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24955      +/-   ##
==========================================
+ Coverage   91.26%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       52982    53003      +21     
==========================================
+ Hits        48356    48376      +20     
- Misses       4626     4627       +1

Flag	Coverage Δ
#multiple	`89.83% <100%> (ø)`	⬆️
#single	`41.76% <60%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/computation/common.py	`89.47% <100%> (+3.75%)`	⬆️
pandas/core/frame.py	`96.79% <100%> (ø)`	⬆️
pandas/core/generic.py	`93.52% <100%> (ø)`	⬆️
pandas/core/computation/expr.py	`88.52% <100%> (+0.35%)`	⬆️
pandas/util/testing.py	`89.3% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...192c093. Read the comment docs.

hwalinga · 2019-01-27T01:34:15Z

Okay, apparently syntax is really important for the release notes. (Sorry, first time I am doing such a thing.) So, before I make another attempt. Is this correct?

 - You can now refer to a column name with spaces by quoting it in backticks for the functions :func:`DataFrame.query` and :func:`DataFrame.eval`. (:issue:`6508`)

One line, no linebreaks.

TomAugspurger

Could you add tests where your name cleaning would cause clashes? e.g. (I think)

df = pd.DataFrame({"A A": [1, 2], "A_A_": [3, 4]})

we would want `A A` to unambiguously refer to the column "A A". So we need to test cases where the user provides

- `A A`
- A_A_

hwalinga · 2019-01-28T17:36:46Z

@TomAugspurger

Yes, that will clash. But is it really that problematic that it clashes? Maybe it can be used as an advantage to the user, to use that name instead, if he/she doesn't like backtick quoting.

Anyway, there are a few ways we can do something about it.

Make clash less likely

So, if we use a suffix like: _SPECIAL_BACKTICK_QUOTED_SUFFIX, I think it is unlikely this will cause any problems in the real world. It is also most easy and elegant to implement. I only change this in the common.py function.

The problem with really solving is, that is complete ambiguity, is that when we are in the parser, we don't have access to the column names anymore. The column names are somewhere in a DeepChainMap in a Scope object and I don't know how and if we can extract only the column names from this DeepChainMap, since a lot more has been added to it. I can try to reverse engineer this process, but if the process is altered, it will break. Not worth it. So we can do:

Bring the parser logic to the api level
Only when we are at the eval function of the DataFrame we have correct access to the collumn names and can perform a non-ambiguous replacement of the back quoted string.
Bring a class with the column names all into the parser
We can bring a class containing the column names all the way to the parser.

jreback

I believe this patch need to tokenize based on a quoted expression, else as this is quite error prone.

pandas/tests/frame/test_query_eval.py

hwalinga · 2019-01-29T10:38:22Z

I think it isn't that big of a problem to catch backtick quoting with a simple regex. Backticks are not likely to occur normally in those queries.

But if we want it to be based on tokenization of a quoted expression, it is a bit more tricky. Tokenization is done by python's own tokenization function and that will split the backtick quoted string into multiple tokens.

There are two ways to still make this happen.

Write some logic around Python's tokenization that catches a backtick and appends all following tokens to it. Still a bit a hack.
Fork Python's tokenization method and add support for backtick quoted strings. Shouldn't be too difficult to adjust since it already can detect normally quoted strings, but you pull in a lot of code and all changes to that method in the Python core than has to be manually applied to the Pandas tokenization method. Also, I don't know how simple that is to make Python2 compatible, but maybe I don't have to?

I can change the test to use the fixtures and add some more mixed cases, but what does IOW mean?

jreback · 2019-01-29T12:42:08Z

@hwalinga .query is just a user convenience function, it is not meant to fully replicate the api at all. you can always and unambiguously do df[df['columns with spaces']>1] to get complete api functionaility. I am not averse to using a uuid in the column name (this is how you do this to guarantee that you never have clashes), but still prefer to tokenize properly (and of course I don't mean adding our own custom tokenizer).

hwalinga · 2019-01-29T15:24:21Z

@jreback If it is just a convenience function, why not make it even more convenient? :)

But let's focus on what has to be done first. One by one.

I think I can write some kind of wrapper around Python's tokenizer to catch backtick quoted strings.

I don't think a uuid will help. Eventually numexpr needs to receive valid Python identifiers, so I think a more complex suffix will be enough to prevent clashes. This would than be the same as how @ is replaced by local variables.

Since changing the approach to the tokenizer instead is a different approach, is it okay that I make a different pull request for that?

hwalinga · 2019-02-16T12:25:22Z

@jreback I changed the tokenize function so that backtick quoted strings are now a single token.

PS. The failed test on macOS is an off by one error in a datetime index test. Seems unrelated to my changes.

pandas/core/frame.py

hwalinga · 2019-03-09T21:35:03Z

@jreback Can you see if this is now ready to be included?

WillAyd · 2019-03-09T22:03:20Z

@hwalinga looks like you have a merge conflict in the whatsnew that needs to be resolved

pandas/core/computation/common.py

pandas/core/computation/expr.py

pandas/core/frame.py

jreback

pls respond to the questions

pandas/compat/__init__.py

pandas/core/computation/expr.py

jreback

small comments, otherwise lgtm. ping on green.

jreback · 2019-03-10T22:41:56Z

pandas/core/frame.py

@@ -2967,6 +2967,15 @@ def query(self, expr, inplace=False, **kwargs):
            The query string to evaluate.  You can refer to variables
            in the environment by prefixing them with an '@' character like
            ``@a + b``.
+
+            .. versionadded:: 0.25.0


can you add an example in the Examples section as well

Done, but don't know what this means:

1 Warnings found:
No extended summary found
Docstring for "pandas.DataFrame.query" correct. :)

pandas/core/frame.py

jreback · 2019-03-10T22:42:43Z

pandas/core/generic.py

@@ -38,6 +38,7 @@
 import pandas.core.algorithms as algos
 from pandas.core.base import PandasObject, SelectionMixin
 import pandas.core.common as com
+from pandas.core.computation.common import _remove_spaces_column_name


import this locally in the function (as we have some restricted import about computation)

jreback · 2019-03-20T01:56:11Z

can you merge master

hwalinga · 2019-03-20T08:09:25Z

@jreback Merge conflicts resolved.

jreback · 2019-03-20T12:24:22Z

thanks @hwalinga nice patch!

* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...

hwalinga changed the title ~~Quoting column names containing spaces with backticks to use them in query and eval.~~ ENH: Quoting column names containing spaces with backticks to use them in query and eval. Jan 27, 2019

TomAugspurger reviewed Jan 28, 2019

View reviewed changes

jreback requested changes Jan 28, 2019

View reviewed changes

pandas/tests/frame/test_query_eval.py Outdated Show resolved Hide resolved

pandas/tests/frame/test_query_eval.py Outdated Show resolved Hide resolved

gfyoung added the API Design label Jan 30, 2019

hwalinga added 4 commits February 15, 2019 22:43

TST: Add tests for backtick quoting (pandas-dev#6508)

ff463ca

Update docstring query about quoting backtick variables

db9c769

Fixed whatsnew entry

22686fd

Backtick quotes are now tokenized. More tests and pytest fixtures

bfebb9d

hwalinga force-pushed the quoting-names-backticks branch from 2f0b462 to bfebb9d Compare February 16, 2019 00:05

hwalinga added 2 commits February 16, 2019 02:04

Use compat.map; No import alias (operator) to prevent name shadowing

a65f5a5

Fix import order; Remove debug print;

da60955

jreback requested changes Feb 16, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

hwalinga added 2 commits February 25, 2019 00:02

Add 'versionadded' and move column resolvers logic to common.py.

2125068

Solve merge conflict whatsnew doc.

5200b0c

jreback requested changes Mar 10, 2019

View reviewed changes

pandas/core/computation/common.py Show resolved Hide resolved

pandas/core/computation/expr.py Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

hwalinga added 3 commits March 10, 2019 19:13

More clarity in comments; Moved column resolver to class; Use uuid

63c25bf

Merge conflict

b104766

uuid3 python2/3 compatible function added

e496671

jreback requested changes Mar 10, 2019

View reviewed changes

pandas/compat/__init__.py Outdated Show resolved Hide resolved

pandas/core/computation/expr.py Show resolved Hide resolved

Reverted uuid3

d3877d1

jreback requested changes Mar 10, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone Mar 10, 2019

Local import for computation.common; Added example in query

bb62d73

Solve merge conflict whatsnew doc.

192c093

jreback approved these changes Mar 20, 2019

View reviewed changes

jreback merged commit 02ada08 into pandas-dev:master Mar 20, 2019

hwalinga mentioned this pull request Jul 5, 2019

The input column name in query contains special characters #27017

Closed

hwalinga mentioned this pull request Aug 29, 2019

Add function to clean up column names with special characters #28215

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

hwalinga commented Jan 26, 2019

codecov bot commented Jan 27, 2019

codecov bot commented Jan 27, 2019 •

edited

Loading

hwalinga commented Jan 27, 2019

TomAugspurger left a comment

hwalinga commented Jan 28, 2019 •

edited

Loading

jreback left a comment

hwalinga commented Jan 29, 2019

jreback commented Jan 29, 2019

hwalinga commented Jan 29, 2019

hwalinga commented Feb 16, 2019

hwalinga commented Mar 9, 2019

WillAyd commented Mar 9, 2019

jreback left a comment

jreback left a comment

jreback Mar 10, 2019

hwalinga Mar 10, 2019

jreback Mar 10, 2019

hwalinga Mar 10, 2019

jreback commented Mar 20, 2019

hwalinga commented Mar 20, 2019

jreback commented Mar 20, 2019

ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

Conversation

hwalinga commented Jan 26, 2019

codecov bot commented Jan 27, 2019

Codecov Report

codecov bot commented Jan 27, 2019 • edited Loading

Codecov Report

hwalinga commented Jan 27, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

hwalinga commented Jan 28, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

hwalinga commented Jan 29, 2019

jreback commented Jan 29, 2019

hwalinga commented Jan 29, 2019

hwalinga commented Feb 16, 2019

hwalinga commented Mar 9, 2019

WillAyd commented Mar 9, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Mar 10, 2019

Choose a reason for hiding this comment

hwalinga Mar 10, 2019

Choose a reason for hiding this comment

jreback Mar 10, 2019

Choose a reason for hiding this comment

hwalinga Mar 10, 2019

Choose a reason for hiding this comment

jreback commented Mar 20, 2019

hwalinga commented Mar 20, 2019

jreback commented Mar 20, 2019

codecov bot commented Jan 27, 2019 •

edited

Loading

hwalinga commented Jan 28, 2019 •

edited

Loading