-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn that evaluate() should not be used on user input #442
Comments
Yeah developers seem to be using NumExpr more and more as a parser for handling input from a UI. It would be much safer if we could use
Sanitization could be enabled by default but give the user the means to turn it off. |
Alright I implemented a check in 4b2d89c that forbids certain operators used in the expression. Namely: If we want to ban Regardless I think this is a good step forward and we're at the point where we should do a new release. |
Looks good to me! |
Ok, made some more improvements. I can ban attribute access to everything but
I also strip the string of all whitespace beforehand. |
Closing with release of 2.8.5. |
Hi @robbmcleod, just wanted to point out that you can still access other attributes because Python translates some utf chars into ascii chars automatically (mainly greek alphabet used in math), thus:
Results in:
It might be better to whitelist |
@jan-kubena ok thanks for the heads up. The trouble here is that decimal needs to work, and numbers can appear in variable names, but not at the start. Do you happen to know where the documentation for which character sets are mangled is? The really proper way to do it would be to back-port the |
Apparently the I'm of two minds about this. I could regex against "[a-zA-Z0-9_]+" instead of just "__". But for the security conscious, it does feel to me that NumExpr shouldn't be able to access private variables in the |
Hi folks, I have a simpler example which also breaks in the new version: import pandas as pd
name = "Mass (kg)"
df = pd.DataFrame({name: [200, 300, 400]})
df.query(f"`{name}` < 300")
I understand |
Hi folks, import numexpr
numexpr.evaluate('a__b / 2', {'a__b': 4}) Or a oneliner:
In numexpr==2.8.4 that one works perfectly. Is it an intentional feature or something that can be workedaround easily (sending some kwarg to ignore the error)? Many thanks! |
@nicoddemus and @zorion, just waiting to hear back from Pandas' devs on the original report: pandas-dev/pandas#54449 before I do anything. I've tried in the past to establish some sort of line of communication with them and gotten crickets. |
Hi, thanks for your reply. I think we are having a legit use of dunder but it is not allowed in a breaking change from 2.8.4 to 2.8.5 so we have to fix our version to 2.8.4 (or lower) and this is an ok workaround for now. Many thanks in advance! |
Hi, one of the pandas devs here. I don't maintain the eval code (I don't think anyone still here does anymore), so not really the best person to comment on this. Is there a way to gate this stricter checking behind an option? Thanks, |
@lithomas1 and company, I see a few approaches here.
In order to avoid forcing everyone to do an emergency release, we probably need to do both, assuming the option defaults to sanitize. |
Thanks, this sounds good to me. It would be nice to actually fix pandas, however, I'm not too sure what the future of |
There's actually two changes in numexpr 2.8.5 that fail pandas tests - this, and a change to integer overflow behaviour (possibly a result of the negative-powers changes, but I'm not sure yet) pandas-dev/pandas#54546 |
@rebecca-palmer You can make a new issue if you want, but NumExpr could never cover |
@zorion, @nicoddemus, @lithomas1 I made a push in 397cc98 that should hopefully fix these issues, if you could please test?
If you have a chance please test and provide me with any feedback you may have. |
Hi @robbmcleod, Thanks for attempting a fix. My original example now passes, but this one breaks (reduced from the actual code): import pandas as pd
name = "II (MM)"
df = pd.DataFrame({name: [200, 300, 400]})
df.query(f"`{name}` >= 3.1e-05")
The problem in this case seems to be the scientific notation, if I change import pandas as pd
name = "II (MM)"
df = pd.DataFrame({name: [200, 300, 400]})
df.query(f"`{name}` >= 0.000031")
|
@nicoddemus @robbmcleod I think changing _attr_pat to r'.\b(?!(real|imag|\d+e?)\b)' (i.e. adding 'e?') fixes that, but I haven't actually tested it. |
Perhaps a more reliable approach would be to use ast.parse, and reject the tree if we find statements, lambdas, dunder import, etc? |
@robbmcleod I've added some comments in the commit - do those notify anyone when it's already on the main branch? |
It needs to match I definitely don't get notifications on commits. |
@nicoddemus I did use |
FWIW, NumExpr being a legacy piece of code seems to be in fairly widespread use in other legacy systems (e.g. the Pandas There were definitely some growing pains with writing the sanitizer, but to me that was expected. It does seem to work now. It's very hard for me to see any way to bypass it and execute malicious code. I did try and write a whitelist, but that's considerably harder to regex. Regarding the choice to default to I've been thinking if we did want to default to not sanitizing the input string, we could instead show a warning to the user. This warning could be suppressed by setting an environment variable, such as |
I have a workaround that involves running an external parser based on pyparsing to pre-validate expressions before passing them on to numexpr. This is not fully tested, but I am considering this for my own use. I realize it might not be possible to add an additional package requirement to numexpr, but maybe a similar approach would be practical as opposed to a sanitation approach? This would admittedly cause a performance hit but maybe not huge for the typical numexpr use cases. I suppose the fact that python eval is even involved might mean there are edge cases and a pre-parser might break some expected functionality of numexpr? I think it's fine for my use case though. from pyparsing import (
infix_notation,
one_of,
OpAssoc,
Literal,
Forward,
Group,
Suppress,
Optional,
delimited_list,
ParserElement,
)
from numexpr.necompiler import vml_functions
from pyparsing.common import pyparsing_common
ParserElement.enablePackrat()
LPAREN, RPAREN = map(Suppress, "()")
NUMEXPR_FUNCS = vml_functions + ["where"]
def get_parser():
integer = pyparsing_common.integer
real = pyparsing_common.real | pyparsing_common.sci_real
imaginary = (real | integer) + one_of("j J")
arith_expr = Forward()
fn_call = Group(
one_of(NUMEXPR_FUNCS)
+ LPAREN
- Group(Optional(delimited_list(arith_expr)))
+ RPAREN
)
operand = (
fn_call | imaginary | real | integer | pyparsing_common.identifier
)
bitwise_operators = one_of("& | ~ ^")
comparison_operators = one_of("< <= == != >= >")
unary_arithmetic = one_of("-")
binary_arithmetic = one_of("+ - * / ** % << >>")
arith_expr << infix_notation(
operand,
[
(bitwise_operators, 2, OpAssoc.LEFT, None),
(comparison_operators, 2, OpAssoc.LEFT, None),
(unary_arithmetic, 1, OpAssoc.RIGHT, None),
(binary_arithmetic, 2, OpAssoc.LEFT, None),
],
)
return arith_expr
def test_passing_expressions():
parser = numexpr_expression_parser.get_parser()
result, parse_results = parser.runTests([
"where(a) == 1+2e6j",
"1 + 2.0 + _abc + sin(o)",
"1 + 2.0 + __abc", # __abc is a valid identifier
"1 + 2.0 + _abc + sin(o)",
])
assert result
def test_failing_expressions():
parser = numexpr_expression_parser.get_parser()
result, parse_results = parser.runTests([
"eval(123)"
])
assert not result |
That would be very welcome. pip-audit is now failing my builds due to PYSEC-2023-163 (this issue). |
I added the means to turn the
Generally speaking I think this shouldn't be any more so a security hole than allowing people to pass I tested with Otherwise everything is good to release 2.8.6. I'll give everyone a day to comment. |
@smorken we could consider adding that as a code snippet to the documentation? Perhaps some section titled "Using NumExpr for evaluating user inputs?" |
Sure, by all means, it's mostly just cobbled together from |
2.8.6 has been released, we'll see if there are any further troubles. ::crosses fingers:: @smorken if you want to write a gist I can link to it? |
The 2.8.6 version was just flagged by the same org that flagged 2.8.5: https://vulners.com/osv/OSV:PYSEC-2023-163 I suspect that the real problem is that the LangChain code is the real vulnerability and that numexpr is just exposing what Python itself exposes--an eval that can execute arbitrary code. In my view, the library itself should not be tagged unless it can be exploited by means other than using its standard API in normal ways. But any application that exposes |
I came across this from a test failure in my X-ray data analysis codes that uses a library (pyFAI) that uses I am shocked to learn that numexpr uses Trying to parse Python expressions yourself is foolhardy, especially since Python exposes its own parser with With |
I spent a couple of hours packaging and testing that snippet I posted here before. It's a strict infix pre-parser that supports (much of) the |
@smorken. Well, that would definitely preserve the errant behavior of #449:
Look I am an outsider here, so feel free to cast me as a bad guy I do not mean any ill-will to anyone. But, I am just shocked to learn that I am dismayed at the attempts discussed here to try to create a pre-parser so that Please do not use |
@newville thanks for taking the time to try it out. Added a fix here, to do with the pyparsing objects. I really don't want to distract from the issue here at hand any further with those scripts though, and to be clear I am not proposing that they be added into numexpr as a robust solution. As |
Alright, I'm going to try and provide a bit of history of this project because we have so many people coming here without context. NumExpr (NE) was originally written by David Cooke in the late 2000s for Python 2.5 as a way to accelerate NumPy calculations. He's no longer involved in open source and no one knows where he is or how to contact him. Because NE predates many things in modern Python, the source has a lot of technical debt and it implements its own Abstract Syntax Tree (AST), because the Incidentally the https://greentreesnakes.readthedocs.io/en/latest/ , if you want to understand how it actually works. Francesc Alted took over maintenance in that void because PyTables used NumExpr to do queries. I was using NumExpr 2015-16 to accelerate some scientific calcs without having to implement customer functions in the C-API all the time. After I quit my Swiss post-doc and moved back to Canada I started on a project to make "NumExpr 3.0" which had the potential to fix all the shortfalls in 2.0. However, it was an (overly) ambitious project, and I got a paying job, and my free time evaporated. The 3.0 branch does use the My thinking at that time was that people would stop using NE and switch to Numba. I saw NE3 as having a potentially niche for "write once; execute on big data" scripts where the user didn't want to unravel their vectorized calculations to use Numba. Little did I know at that time that a lot of people were using NE2 for a purpose it was never (IMO) intended to be used for: parsing user inputs. This is very clear if you look at the myriad of reported issues on this repo where people are using NumExpr to parse singletons (and hence running into issues with the NE AST), whereas in my opinion NE was intended to be a blocked-calculation virtual machine to avoid being calculation limited by memory bandwidth. NE is extremely inefficient for parsing singleton inputs. CPython itself is more efficient. I've tried repeatedly in the Issues tracker to discourage people from using NE in this way, to no avail. I personally do not have the bandwidth to implement a new AST in the package. In 2017 yes, but not in 2023. Now, if someone wants to take over maintenance of the project, I'd be thrilled to hand it over. It's possible money could be sourced from one of the open-source funding agencies. The adjacent packages: NumPy, Pandas, and PyTables are all funded. Francesc has approached me in the past asking if I wanted to be part of one of the grant applications, and I said no. I've also been asked to consult for companies using NumExpr internally and again I said no (although this has dried up over the past couple of years). For me, it's not about the money, it's about my personal time. If I ask someone, "can you please write a unit test for this edge case?" and I get a |
Thanks @robbmcleod for maintaining this project! As this issue highlights, numexpr is a very much used piece of code. Since the effort of refactoring/rewriting numexpr is too huge, a disclaimer in the documentation about security issues as initially proposed sounds more affordable. Regarding the issue with scientific notation in v2.8.6, PR #451 proposes both a test and a fix for it. |
If memory serves, It is OK to play the "this is a very old codebase" card, but the ast module is hardly new. At many points in the process, the developers apparently decided to stick with "home-built" instead of "standard library". That's OK, if they are able and willing to maintain it. FWIW, the origins of
Yes, the But also: all of the well-meaning suggestions and bugfixes here about regular expressions, "dunder" names, and using Most of what is dangerous about
Well, definitely Thanks to you and Fransesc (and David Cooke) for doing that -- it is much appreciated.
Yeah, I understand that...
I understand that too. The numexpr devels might just decide that replacing
Well, I think that many of us will understand trying to maintain software, especially on lunch breaks ;). It looks like you were updating and releasing versions until fairly recently, but maybe I am not understanding some things. Is someone else maintaining this? |
As previously noted, dunder_pat is still blocking some things that aren't dunders. Hence, pandas still fails a test. |
@robbmcleod what, if anything, blocks the above two fixes from being applied, to at least fix the known unnecessary breakage? (See #452 if you prefer a proper pull request.) In the longer term, it looks to me like everyone here agrees that moving to I may be interested in becoming a maintainer and/or contributing to that work, but this is not a promise at this point. |
Yes, please merge #452 and #451 (both fix literals using scientific notation, #452 adds better checking for dunder names, while #451 adds a test for numeric literals using scientific notation). As it stands, downstream packages must give a specific and not-the-latest version for |
One can also take the opportunity to produce wheels for forthcoming Python 3.12. Although 3.12 is not final yet (Oct, 2nd is the tentative date), Python folks will not be introducing ABI changes after existing 3.12rc2, so extensions built on it should work well with the forthcoming 3.12 final. Also, the NumPy team is already producing wheels for 3.12, so this dependency should be ready too. @robbmcleod I'm willing to help in doing the release in case you don't have lots of time right now. BTW, thanks for all the time that you have put in the project so far; you have done a most excellent job in maintaining the project. |
I am in the process to release 2.8.7, with the suggestions here. If you want to test how the candidate looks like, please go to #453 and give it a try. My plan is to do a release as soon as possible (hopefully by tomorrow). Also, and after talking with @robbmcleod , I have added an advert in the README where it is said that the project is looking for (much needed) new maintainers. If anyone here is ready for tackling that, please speak. Thanks! |
Maybe this issue can be closed now, right? |
I'm a bit late to the show but I only noticed yesterday that numexpr also fails with simple complex numbers like 1.0j which affects my software https://github.com/chipmuenk/pyfda. I think parsing complex numbers should be legit use case for numexpr. |
@chipmuenk: yes, that sounds like a bug, sorry. Untested fix: |
@rebecca-palmer could you open a PR adding a test for the new complex case too? thanks in advance! |
There has been an issue for this since 2018... #323 |
The evaluate() function eventually calls eval() on the data provided. eval() is extremely dangerous when supplied with user input and to my knowledge it isn't mentioned that the function does this. I would add a warning in the documentation about this. As a proof-of-concept, the following code should execute the command 'echo verybad' on your computer when ran:
The text was updated successfully, but these errors were encountered: