Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955

Merged
merged 14 commits into from
Mar 20, 2019

Conversation

hwalinga
Copy link
Contributor

@codecov
Copy link

codecov bot commented Jan 27, 2019

Codecov Report

Merging #24955 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24955      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52398    52406       +8     
==========================================
+ Hits        48406    48414       +8     
  Misses       3992     3992
Flag Coverage Δ
#multiple 90.8% <100%> (ø) ⬆️
#single 42.89% <40%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.92% <100%> (ø) ⬆️
pandas/core/common.py 98.44% <100%> (+0.03%) ⬆️
pandas/core/computation/expr.py 88.77% <100%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95f8dca...114b2c2. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 27, 2019

Codecov Report

Merging #24955 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24955      +/-   ##
==========================================
+ Coverage   91.26%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       52982    53003      +21     
==========================================
+ Hits        48356    48376      +20     
- Misses       4626     4627       +1
Flag Coverage Δ
#multiple 89.83% <100%> (ø) ⬆️
#single 41.76% <60%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/computation/common.py 89.47% <100%> (+3.75%) ⬆️
pandas/core/frame.py 96.79% <100%> (ø) ⬆️
pandas/core/generic.py 93.52% <100%> (ø) ⬆️
pandas/core/computation/expr.py 88.52% <100%> (+0.35%) ⬆️
pandas/util/testing.py 89.3% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...192c093. Read the comment docs.

@hwalinga
Copy link
Contributor Author

Okay, apparently syntax is really important for the release notes. (Sorry, first time I am doing such a thing.) So, before I make another attempt. Is this correct?

 - You can now refer to a column name with spaces by quoting it in backticks for the functions :func:`DataFrame.query` and :func:`DataFrame.eval`. (:issue:`6508`)

One line, no linebreaks.

@hwalinga hwalinga changed the title Quoting column names containing spaces with backticks to use them in query and eval. ENH: Quoting column names containing spaces with backticks to use them in query and eval. Jan 27, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests where your name cleaning would cause clashes? e.g. (I think)

df = pd.DataFrame({"A A": [1, 2], "A_A_": [3, 4]})

we would want `A A` to unambiguously refer to the column "A A". So we need to test cases where the user provides

- `A A`
- A_A_

@hwalinga
Copy link
Contributor Author

hwalinga commented Jan 28, 2019

@TomAugspurger

Yes, that will clash. But is it really that problematic that it clashes? Maybe it can be used as an advantage to the user, to use that name instead, if he/she doesn't like backtick quoting.

Anyway, there are a few ways we can do something about it.

  • Make clash less likely

So, if we use a suffix like: _SPECIAL_BACKTICK_QUOTED_SUFFIX, I think it is unlikely this will cause any problems in the real world. It is also most easy and elegant to implement. I only change this in the common.py function.


The problem with really solving is, that is complete ambiguity, is that when we are in the parser, we don't have access to the column names anymore. The column names are somewhere in a DeepChainMap in a Scope object and I don't know how and if we can extract only the column names from this DeepChainMap, since a lot more has been added to it. I can try to reverse engineer this process, but if the process is altered, it will break. Not worth it. So we can do:

  • Bring the parser logic to the api level
    Only when we are at the eval function of the DataFrame we have correct access to the collumn names and can perform a non-ambiguous replacement of the back quoted string.

  • Bring a class with the column names all into the parser
    We can bring a class containing the column names all the way to the parser.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this patch need to tokenize based on a quoted expression, else as this is quite error prone.

pandas/tests/frame/test_query_eval.py Outdated Show resolved Hide resolved
pandas/tests/frame/test_query_eval.py Outdated Show resolved Hide resolved
@hwalinga
Copy link
Contributor Author

I think it isn't that big of a problem to catch backtick quoting with a simple regex. Backticks are not likely to occur normally in those queries.

But if we want it to be based on tokenization of a quoted expression, it is a bit more tricky. Tokenization is done by python's own tokenization function and that will split the backtick quoted string into multiple tokens.

There are two ways to still make this happen.

  • Write some logic around Python's tokenization that catches a backtick and appends all following tokens to it. Still a bit a hack.

  • Fork Python's tokenization method and add support for backtick quoted strings. Shouldn't be too difficult to adjust since it already can detect normally quoted strings, but you pull in a lot of code and all changes to that method in the Python core than has to be manually applied to the Pandas tokenization method. Also, I don't know how simple that is to make Python2 compatible, but maybe I don't have to?


I can change the test to use the fixtures and add some more mixed cases, but what does IOW mean?

@jreback
Copy link
Contributor

jreback commented Jan 29, 2019

@hwalinga .query is just a user convenience function, it is not meant to fully replicate the api at all. you can always and unambiguously do df[df['columns with spaces']>1] to get complete api functionaility. I am not averse to using a uuid in the column name (this is how you do this to guarantee that you never have clashes), but still prefer to tokenize properly (and of course I don't mean adding our own custom tokenizer).

@hwalinga
Copy link
Contributor Author

@jreback If it is just a convenience function, why not make it even more convenient? :)

But let's focus on what has to be done first. One by one.


I think I can write some kind of wrapper around Python's tokenizer to catch backtick quoted strings.


I don't think a uuid will help. Eventually numexpr needs to receive valid Python identifiers, so I think a more complex suffix will be enough to prevent clashes. This would than be the same as how @ is replaced by local variables.


Since changing the approach to the tokenizer instead is a different approach, is it okay that I make a different pull request for that?

@hwalinga hwalinga force-pushed the quoting-names-backticks branch from 2f0b462 to bfebb9d Compare February 16, 2019 00:05
@hwalinga
Copy link
Contributor Author

@jreback I changed the tokenize function so that backtick quoted strings are now a single token.

PS. The failed test on macOS is an off by one error in a datetime index test. Seems unrelated to my changes.

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
@hwalinga
Copy link
Contributor Author

hwalinga commented Mar 9, 2019

@jreback Can you see if this is now ready to be included?

@WillAyd
Copy link
Member

WillAyd commented Mar 9, 2019

@hwalinga looks like you have a merge conflict in the whatsnew that needs to be resolved

pandas/core/computation/common.py Show resolved Hide resolved
pandas/core/computation/expr.py Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls respond to the questions

pandas/compat/__init__.py Outdated Show resolved Hide resolved
pandas/core/computation/expr.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, otherwise lgtm. ping on green.

@@ -2967,6 +2967,15 @@ def query(self, expr, inplace=False, **kwargs):
The query string to evaluate. You can refer to variables
in the environment by prefixing them with an '@' character like
``@a + b``.

.. versionadded:: 0.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example in the Examples section as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but don't know what this means:

1 Warnings found:
No extended summary found
Docstring for "pandas.DataFrame.query" correct. :)

pandas/core/frame.py Show resolved Hide resolved
@@ -38,6 +38,7 @@
import pandas.core.algorithms as algos
from pandas.core.base import PandasObject, SelectionMixin
import pandas.core.common as com
from pandas.core.computation.common import _remove_spaces_column_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import this locally in the function (as we have some restricted import about computation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback added this to the 0.25.0 milestone Mar 10, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master

@hwalinga
Copy link
Contributor Author

@jreback Merge conflicts resolved.

@jreback jreback merged commit 02ada08 into pandas-dev:master Mar 20, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

thanks @hwalinga nice patch!

thoo added a commit to thoo/pandas that referenced this pull request Mar 20, 2019
* upstream/master: (55 commits)
  PERF: Improve performance of StataReader (pandas-dev#25780)
  Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784)
  BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588)
  BUG-24971 copying blocks also considers ndim (pandas-dev#25521)
  CLN: Panel reference from documentation (pandas-dev#25649)
  ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955)
  BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
  DOC: clean bug fix section in whatsnew (pandas-dev#25792)
  DOC: Fixed PeriodArray api ref (pandas-dev#25526)
  Move locale code out of tm, into _config (pandas-dev#25757)
  Unpin pycodestyle (pandas-dev#25789)
  Add test for rdivmod on EA array (GH23287) (pandas-dev#24047)
  ENH: Support datetime.timezone objects (pandas-dev#25065)
  Cython language level 3 (pandas-dev#24538)
  API: concat on sparse values (pandas-dev#25719)
  TST: assert_produces_warning works with filterwarnings (pandas-dev#25721)
  make core.config self-contained (pandas-dev#25613)
  CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721)
  TST: Check pytables<3.5.1 when skipping (pandas-dev#25773)
  DOC: Fix typo in docstring of DataFrame.memory_usage  (pandas-dev#25770)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.DataFrame.query to allow column name with space
5 participants