-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Quoting column names containing spaces with backticks to use them in query and eval. #24955
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24955 +/- ##
==========================================
+ Coverage 92.38% 92.38% +<.01%
==========================================
Files 166 166
Lines 52398 52406 +8
==========================================
+ Hits 48406 48414 +8
Misses 3992 3992
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24955 +/- ##
==========================================
+ Coverage 91.26% 91.27% +<.01%
==========================================
Files 173 173
Lines 52982 53003 +21
==========================================
+ Hits 48356 48376 +20
- Misses 4626 4627 +1
Continue to review full report at Codecov.
|
Okay, apparently syntax is really important for the release notes. (Sorry, first time I am doing such a thing.) So, before I make another attempt. Is this correct?
One line, no linebreaks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests where your name cleaning would cause clashes? e.g. (I think)
df = pd.DataFrame({"A A": [1, 2], "A_A_": [3, 4]})
we would want `A A` to unambiguously refer to the column "A A". So we need to test cases where the user provides
- `A A`
- A_A_
Yes, that will clash. But is it really that problematic that it clashes? Maybe it can be used as an advantage to the user, to use that name instead, if he/she doesn't like backtick quoting. Anyway, there are a few ways we can do something about it.
So, if we use a suffix like: The problem with really solving is, that is complete ambiguity, is that when we are in the parser, we don't have access to the column names anymore. The column names are somewhere in a DeepChainMap in a Scope object and I don't know how and if we can extract only the column names from this DeepChainMap, since a lot more has been added to it. I can try to reverse engineer this process, but if the process is altered, it will break. Not worth it. So we can do:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this patch need to tokenize based on a quoted expression, else as this is quite error prone.
I think it isn't that big of a problem to catch backtick quoting with a simple regex. Backticks are not likely to occur normally in those queries. But if we want it to be based on tokenization of a quoted expression, it is a bit more tricky. Tokenization is done by python's own tokenization function and that will split the backtick quoted string into multiple tokens. There are two ways to still make this happen.
I can change the test to use the fixtures and add some more mixed cases, but what does IOW mean? |
@hwalinga |
@jreback If it is just a convenience function, why not make it even more convenient? :) But let's focus on what has to be done first. One by one. I think I can write some kind of wrapper around Python's tokenizer to catch backtick quoted strings. I don't think a uuid will help. Eventually numexpr needs to receive valid Python identifiers, so I think a more complex suffix will be enough to prevent clashes. This would than be the same as how @ is replaced by local variables. Since changing the approach to the tokenizer instead is a different approach, is it okay that I make a different pull request for that? |
2f0b462
to
bfebb9d
Compare
@jreback I changed the tokenize function so that backtick quoted strings are now a single token. PS. The failed test on macOS is an off by one error in a datetime index test. Seems unrelated to my changes. |
@jreback Can you see if this is now ready to be included? |
@hwalinga looks like you have a merge conflict in the whatsnew that needs to be resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls respond to the questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, otherwise lgtm. ping on green.
@@ -2967,6 +2967,15 @@ def query(self, expr, inplace=False, **kwargs): | |||
The query string to evaluate. You can refer to variables | |||
in the environment by prefixing them with an '@' character like | |||
``@a + b``. | |||
|
|||
.. versionadded:: 0.25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example in the Examples section as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but don't know what this means:
1 Warnings found:
No extended summary found
Docstring for "pandas.DataFrame.query" correct. :)
pandas/core/generic.py
Outdated
@@ -38,6 +38,7 @@ | |||
import pandas.core.algorithms as algos | |||
from pandas.core.base import PandasObject, SelectionMixin | |||
import pandas.core.common as com | |||
from pandas.core.computation.common import _remove_spaces_column_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import this locally in the function (as we have some restricted import about computation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
can you merge master |
@jreback Merge conflicts resolved. |
thanks @hwalinga nice patch! |
* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...
git diff upstream/master -u -- "*.py" | flake8 --diff