Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call PySys_SetArgv when initializing interpreter. #2341

Merged
merged 33 commits into from
Aug 26, 2021

Conversation

drmoose
Copy link
Contributor

@drmoose drmoose commented Jul 29, 2020

This PR is to suggest automatically calling PySys_SetArgvEx (or an equivalent function) upon initialization of a pybind11::scoped_interpreter.

Summary of Changes

  • Both initialize_interpreter and scoped_interpreter now have three additional arguments:
    • int argc = 0 and char** argv = nullptr -- argument count and values, meant to be passed in directly from the calling program's int main().
    • bool add_current_dir_to_path
  • When add_current_dir_to_path is true (the default, to try to maintain some semblance of backwards compatibility), the directory that gets added to the path will be dirname(argv[0]) rather than just .. If all arguments are supplied their default values (or argv[0] == '') this is equivalent to the current behavior.

Rationale

  1. sys.argv is an AttributeError unless PySys_SetArgvEx gets called. This creates problems for GUI python code (such as matplotlib) since there's code that reads sys.argv in the Debian/Ubuntu version of both tkinter and gdk. In both cases, there's no reason the value supplied needs to be int main()s real argv, the property just needs to exist.

  2. The existing initialize_interpreter code adds the current working directory to sys.path, which is very similar to CVE-2008-5983. cpython itself solved this problem by introducing the PySys_SetArgvEx function which takes an extra argument that specifies whether to modify the path.

  3. There's a significant amount of work involved in getting the arguments in the right format and figuring out exactly which overload to call. It took me a day of reading through the cpython git history to find all of these edge cases, so I'd like to save others the effort. In particular,

    • PySys_SetArgvEx itself only exists in python 2.6.6-2.7.x and python >= 3.1.3. For other versions, one must call the old PySys_SetArgv and then correct the CVE-2008-5983 behavior by hand.
    • PySys_SetArgvEx only supports null arguments (special-casing them to sys.argv=[""]) in python 3.8+
    • In python 3.x, PySys_SetArgv and PySys_SetArgvEx take a wchar_t** instead of a char** as their parameter, and the conversion is nontrivial
    • Python itself provides a conversion function (Py_DecodeLocale) but only in 3.5+
    • mbstowcs(nullptr, ...) isn't reliable on BSD
    • Without reading Python's source code, it's difficult to tell who owns what memory when passing a converted value.

Suggested changelog entry:

* ``pybind11::scoped_interpreter`` and ``initialize_interpreter`` have new arguments to allow ``sys.argv`` initialization.

@drmoose drmoose force-pushed the set_interpreter_argv branch from e6e1a1d to 1801811 Compare July 29, 2020 23:51
}
#if PY_MAJOR_VERSION >= 3
// SetArgv* on python 3 takes wchar_t, so we have to convert.
wchar_t** widened_argv = new wchar_t*[static_cast<unsigned>(argc)];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why static_cast<unsigned>? Is it to avoid sign conversion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Although I couldn't reproduce it on any of my local compilers, github's clang gave an error there. argc is guaranteed to be positive by the if guard above, so the cast is safe.

# if PY_MINOR_VERSION >= 5
// From Python 3.5 onwards, we're supposed to use Py_DecodeLocale to
// generate the wchar_t version of argv.
widened_argv[ii] = Py_DecodeLocale(safe_argv[ii], nullptr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're trying to avoid sign conversions, ii should be unsigned as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but that would make the loop condition a signed comparison error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be cleaner if I did

size_t argc_unsigned = static_cast<unsigned>(argc)

at the top of the function and then just used size_ts throughout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 ...nope. If I do that I get narrowing conversion errors on the PySys_SetArgv calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the impossible things to get completely right without using some casts. The array indices are size_t, yet argv is an int.

I'm not going to nag about this, but there is some inconsistency in this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved slightly (ii is now size_t) but still a little awkward.

Copy link
Collaborator

@bstaletic bstaletic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the initial two comments outside the review.

There's a lot of manual memory management going on here. There are even some non default deletes, like PyMem_RawFree(). Then there's checking against nullptr after a new expression, which is pointless, since new throws std::bad_alloc. This should all be fixed with liberal use of std::unique_ptr.

#else
// python 2.x
# if PY_MINOR_VERSION < 6 || (PY_MINOR_VERSION == 6 && PY_MICRO_VERSION < 6)
# define NEED_PYRUN_TO_SANITIZE_PATH 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. My worry was that if I made it a bool, a smart optimizer would be able to see that the PyRun_SimpleString call is unreachable code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really, really don't like using macros for flow control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed how this code is gated in the rewrite; it's still gated by the PY_*_VERSION macros (because I'm not sure how to not do that), but I avoid the #define...#undef, which I agree was hideous.

// From Python 3.5 onwards, we're supposed to use Py_DecodeLocale to
// generate the wchar_t version of argv.
widened_argv[ii] = Py_DecodeLocale(safe_argv[ii], nullptr);
# define FREE_WIDENED_ARG(X) PyMem_RawFree(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be PyMem_RawFree() or PyMem_Free()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyMem_RawFree() was recommended specifically by the Py_DecodeLocale docs

@drmoose drmoose changed the title Call PySys_SetArgv when initializing interpreter. WIP: Call PySys_SetArgv when initializing interpreter. Jul 30, 2020
@drmoose
Copy link
Contributor Author

drmoose commented Jul 30, 2020

Apologies for the initial two comments outside the review.

I'm not sure what this means, but I appreciate all the comments, especially this one:

There's a lot of manual memory management going on here. There are even some non default deletes, like PyMem_RawFree(). Then there's checking against nullptr after a new expression, which is pointless, since new throws std::bad_alloc. This should all be fixed with liberal use of std::unique_ptr.

The superfluous null checking was just a dumb habit on my part (because everything in the python API needs to be null checked, I was sort-of on autopilot). The manual deletion was unbelievably ugly but it was there because I didn't have a better idea. I've rewritten the function to avoid it. Even though I'm not confident I've done this correctly (having learned just now, from you that there's an array-special-case unique_ptr<T[]>), I'm much happier with how this looks now.

};
std::vector< std::unique_ptr<wchar_t[], pymem_rawfree_deleter> > widened_argv_entries;
# else
std::vector< std::unique_ptr<wchar_t[]> > widened_argv_entries;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bstaletic is there a way to avoid the std::vector I've added here and still get smart-pointer deletion?

@drmoose drmoose changed the title WIP: Call PySys_SetArgv when initializing interpreter. Call PySys_SetArgv when initializing interpreter. Jul 30, 2020
@drmoose
Copy link
Contributor Author

drmoose commented Jul 30, 2020

I'm not sure what's going on with the CI... the mac stage changed from "skipped" to "failed", but there aren't any errors shown when I click details. Everything seems fine on my test mac.

Pushing a pointless amend to trigger a retry....

@drmoose drmoose force-pushed the set_interpreter_argv branch from 6c6598a to 0379418 Compare July 30, 2020 04:33
@drmoose drmoose force-pushed the set_interpreter_argv branch from 0379418 to 3ff967d Compare July 30, 2020 04:42
@drmoose drmoose requested a review from bstaletic July 30, 2020 05:46
@bstaletic
Copy link
Collaborator

@drmoose Sorry for the late reply. Check this out: https://gist.github.com/bstaletic/ba0afff792396ea438937948b2d80582

It's still not completely clean, but I believe it is a lot more readable.

@drmoose
Copy link
Contributor Author

drmoose commented Aug 28, 2020

Sorry it took me a while to get back to this. I've converted the changes from your gist into commits and pushed to a set_argv_bstaletic branch on my fork. I've also cherry-picked some (but not all) of those changes to this PR. Here's the commits and my rationale:

These three are unambiguously cleaner than what I had before. Although I had to do a little bit of subsequent tweaking to get something that compiles, this was great.

This one scares me a bit, since it always uses the obsolete approach. Since Python have added PySys_SetArgvEx I feel like using it where possible will be more future-proof. However, the multiline #if-gate I have picking between PySys_SetArgv and PySys_SetArgvEx is ridiculous and unreadable (which I imagine is why you decided to remove it in your version). 1d7f2b3 is an attempt at a more-readable version of the check.

Here, I suspect we just don't agree about whether auto is a good thing. I've left this one off the PR because in my view it adds more lines of code without really clarifying what's going on. If you're just vehemently opposed to auto, though, I don't really have any strong objection.

@drmoose
Copy link
Contributor Author

drmoose commented Aug 17, 2021

@rwgk It's meant to be compatible with the signature of int main(int argc, char** argv), because in order for the paths to work correctly, argv[0] must be set, and the only reliable, cross-platform way I know to get that value is to save the one passed into main when the program begins.

So, most of the time, the person trying to use pybind embed starts with a char**. If the API changed to vector<string> they'd need to convert, but converting from char** to vector<string> is nontrivial because of locales (argv may be in the end user's local codepage, std::string is expected to be UTF-8).

@rwgk
Copy link
Collaborator

rwgk commented Aug 17, 2021

Sounds good, thanks! Would it still work for your purposes if we added const as below (tested with c++11, ..., c++20)?

#include <iostream>

void foo(int argc, const char *const cargv[]) {
    for (int i = 0; i < argc; i++) {
        std::cout << cargv[i] << std::endl;
    }
}

int main(int argc, char *argv[]) {
    foo(argc, argv);
    const char *const cargv[]{"hello", "world"};
    foo(2, cargv);
    return 0;
}

@Skylion007 Skylion007 requested a review from henryiii August 17, 2021 15:02
char** safe_argv = argv;
std::unique_ptr<char*[]> argv_guard;
std::unique_ptr<char[]> argv_inner_guard;
if (nullptr == argv || argc <= 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (nullptr == argv || argc <= 0) {
if (argv == nullptr || argc <= 0) {

I prefer the variable first (such as in the second half of this line, in fact).

if (nullptr == argv || argc <= 0) {
argv_guard.reset(safe_argv = new char*[1]);
argv_inner_guard.reset(safe_argv[0] = new char[1]);
safe_argv[0][0] = '\0';
Copy link
Collaborator

@henryiii henryiii Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function really modify the char** it is passed? Here it's modifying a new one, for example. It doesn't seem directly obvious why this is not const. I found const char* const* preferable in some cases in CLI11, IIRC.

You can always be more strict; if you have a char**, you can call a const char* const* on it. But you can't (without feelling really bad about yourself about casting, anyway) go the other way.

@henryiii
Copy link
Collaborator

This patch adds const, okay to commit?

diff --git a/include/pybind11/embed.h b/include/pybind11/embed.h
index 432185e4..246122c3 100644
--- a/include/pybind11/embed.h
+++ b/include/pybind11/embed.h
@@ -96,7 +96,7 @@ struct wide_char_arg_deleter {
     }
 };

-inline wchar_t* widen_chars(char* safe_arg) {
+inline wchar_t* widen_chars(const char* safe_arg) {
 #if PY_VERSION_HEX >= 0x030500f0
     wchar_t* widened_arg = Py_DecodeLocale(safe_arg, nullptr);
 #else
@@ -115,18 +115,16 @@ inline wchar_t* widen_chars(char* safe_arg) {
 }

 /// Python 2.x/3.x-compatible version of `PySys_SetArgv`
-inline void set_interpreter_argv(int argc, char** argv, bool add_program_dir_to_path) {
+inline void set_interpreter_argv(int argc, const char* const* argv, bool add_program_dir_to_path) {
     // Before it was special-cased in python 3.8, passing an empty or null argv
     // caused a segfault, so we have to reimplement the special case ourselves.
-    char** safe_argv = argv;
-    std::unique_ptr<char*[]> argv_guard;
-    std::unique_ptr<char[]> argv_inner_guard;
-    if (nullptr == argv || argc <= 0) {
-        argv_guard.reset(safe_argv = new char*[1]);
-        argv_inner_guard.reset(safe_argv[0] = new char[1]);
-        safe_argv[0][0] = '\0';
+    bool special_case = (nullptr == argv || argc <= 0);
+
+    const char* const empty_argv[] {"\0"};
+    const char* const* safe_argv = special_case ? empty_argv : argv;
+    if(special_case)
         argc = 1;
-    }
+
 #if PY_MAJOR_VERSION >= 3
     auto argv_size = static_cast<size_t>(argc);
     // SetArgv* on python 3 takes wchar_t, so we have to convert.
@@ -174,7 +172,7 @@ PYBIND11_NAMESPACE_END(detail)
  \endrst */
 inline void initialize_interpreter(bool init_signal_handlers = true,
                                    int argc = 0,
-                                   char** argv = nullptr,
+                                   const char* const* argv = nullptr,
                                    bool add_program_dir_to_path = true) {
     if (Py_IsInitialized() != 0)
         pybind11_fail("The interpreter is already running");
@@ -258,7 +256,7 @@ class scoped_interpreter {
 public:
     scoped_interpreter(bool init_signal_handlers = true,
                        int argc = 0,
-                       char** argv = nullptr,
+                       const char* const* argv = nullptr,
                        bool add_program_dir_to_path = true) {
         initialize_interpreter(init_signal_handlers, argc, argv, add_program_dir_to_path);
     }

@Skylion007
Copy link
Collaborator

bump @drmoose, if no objections will commit the const patch.

@henryiii
Copy link
Collaborator

I'll commit the patch, and we can roll back if there are objections. But you can always be more strict, and can't be less, so I can't see why this would do anything but help.

@henryiii henryiii force-pushed the set_interpreter_argv branch 2 times, most recently from 67d8767 to fc9421e Compare August 21, 2021 18:47
@henryiii henryiii force-pushed the set_interpreter_argv branch from fc9421e to 318495e Compare August 23, 2021 03:50
@drmoose
Copy link
Contributor Author

drmoose commented Aug 23, 2021

@henryiii Sorry I didn't reply earlier; I wanted to get back to a computer before commenting because I wanted to make sure that there was a way to add const to the signature without having to const_cast it away (potential UB) for the Python 2 code path. It seems in the meantime you've solved that problem yourself with 318495e so 👍 and thank you.

/// Python 2.x/3.x-compatible version of `PySys_SetArgv`
inline void set_interpreter_argv(int argc, const char* const* argv, bool add_program_dir_to_path) {
// Before it was special-cased in python 3.8, passing an empty or null argv
// caused a segfault, so we have to reimplement the special case ourselves.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: shouldn't we add an debug assert that argc >= 0? We are casting a signed int to an unsigned size_t

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's covered by the argc <= 0 below, combined with argc = 1.

// python 2.x
std::vector<std::string> strings{safe_argv, safe_argv+argv_size};
std::vector<char*> char_strings{argv_size};
for (std::size_t i=0; i<argv_size; ++i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this doesn't look like it's clang-formatted properly @henryiii

Ralf W. Grosse-Kunstleve added 2 commits August 26, 2021 12:35
The only manual change is an added empty line between pybind11 and system `#include`s.

```
git diff -U0 --no-color master | python3 $HOME/clone/llvm-project/clang/tools/clang-format/clang-format-diff.py -p1 -style=file -i
```
@rwgk
Copy link
Collaborator

rwgk commented Aug 26, 2021

Thanks @drmoose and all reviewers! Merging.

@rwgk rwgk merged commit 930bb16 into pybind:master Aug 26, 2021
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Aug 26, 2021
@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants