-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undefined C behavior going beyond end of struct via a [1] arrays (C99 flexible arrays) #84301
Comments
The correct C99 way to do this is using a char[]. PyBytesObject and unicode's struct encoding_map both do this. Unclear to me if we should backport this to earlier versions or not (because PyBytesObject may be exposed?) Probably, but I also doubt it is a big deal as compilers are generally not _yet_ making use of this detail AFAIK. |
From the PR comment thread (as I opened that first): """Well, there was no other choice in ISO C89 than using char ob_sval[1];, no? Is char ob_sval[]; supported by the C compiler supported by CPython? Like Visual Studio, GCC, clang and xlc (AIX)? (I don't think that we officially support xlc, but it's more "best effort" support.) You can use the new buildbot label to test you change on more platforms.""" - vstinner Per https://www.python.org/dev/peps/pep-0007/ we require some C99 features as of CPython 3.6. It does not currently list Flexible array member. I'll be very surprised if we find any compiler that does not support this. I'll run this through the buildbot testing as you suggested and assuming nothing important falls out, see that we add this to the C99 required feature list. |
It may be worth considering C-API extensions written in C++. Flexible array members are not part of the C++ standard, although GCC, Clang, and MSVC support them as an extension. GCC and Clang will issue warnings with Note that GCC also explicitly supports trailing one-element arrays (the current pattern) as an extension. https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html |
By the way, the current trend is to make more and more structures opaque in the C API. See for example: |
""" PyObject *f_localsplus[1]; in PyFrameObject Do we need to update all of them? Do you want to update them all? I believe we probably do. But I suggest multiple PRs. I'll update the issue title. I'm also going to ask clang/llvm folks that prompted me to look into this for comments. |
I concur with Sam that we should keep compatibility with C++ in header files. |
Isn't the only actual way for a C .h file to be compatible with C++ via extern "C" { which all of our non-meta include files appear to have already? |
The following C++ code fails to build: #ifdef __cplusplus
# include <cstdlib>
#else
# include <stdlib.h>
#endif
#ifdef __cplusplus
extern "C" {
#endif
typedef struct {
int x;
int y;
char array[];
} mystruct_t;
#ifdef __cplusplus
}
#endif
int main()
{
size_t size = 2;
mystruct_t *obj = (mystruct_t *)malloc(sizeof(mystruct_t) - 1 + size);
obj->array[0] = 'O';
obj->array[1] = 'K';
free(obj);
return 0;
} Error: $ LANG= g++ -pedantic -Werror x.cpp
x.c:14:10: error: ISO C++ forbids flexible array member 'array' [-Werror=pedantic]
14 | char array[];
| ^~~~~
cc1plus: all warnings being treated as errors |
Modules/hashtable.c and Modules/hashtable.h use a different approach. The variable size data is *not* part of the structure: typedef struct {
/* used by _Py_hashtable_t.buckets to link entries */
_Py_slist_item_t _Py_slist_item;
Py_uhash_t key_hash;
} _Py_hashtable_entry_t; In memory, we have: [_Py_slist_item, key_hash, key, data] where key size is table->key_size bytes (not stored in each table entry, only in the stable). Pointer to key and data is computed with these macros: #define _Py_HASHTABLE_ENTRY_PKEY(ENTRY) \
((const void *)((char *)(ENTRY) \
+ sizeof(_Py_hashtable_entry_t)))
#define _Py_HASHTABLE_ENTRY_PDATA(TABLE, ENTRY) \
((const void *)((char *)(ENTRY) \
+ sizeof(_Py_hashtable_entry_t) \
+ (TABLE)->key_size)) But this approach is more annoying to use, it requires to play with pointers and requires such ugly macros. |
How is it an undefined C behavior? It works well in practice, no? |
AFAIK extern "C" only affects mangling of function names. Because of overloading in C++ you can have several functions with the same name, and to distinguish "int abs(int)" from "float abs(float)" the C++ compiler mangles function names, that makes them incompatible with C. |
Famous last words ;) |
updates:
What I'm hearing from talking to our C++ compiler team is unfortunately sad: The C++ standard does not support flexible array member syntax on purpose because it leads to problems specific to C++ (ex: what do "new" and "del" do?) So some compilers will reject such code (just as some accept it treating it as C99 does). Meaning we can't do this in any public header file. One workaround would indeed be to do something similar to that hashtable code, but it is quite annoying and I don't know that we could actually change the definition of PyBytesObject that way as its internals could be referenced externally. (though all the bytes should line up regardless so even macros before and after such a change would be compatible?) Within our internal private pure C code we could move to use this feature; things in .h files are the cross language issue. Anyways I'm following up internally to better understand the motivation for wanting code to not use the "it's worked forever" technically undefined behavior of the trailing [1] member and out of bounds access. Pondering, I wonder if this could turn into a "-fwrapv" style of situation, we depend on that behavior working so we adopted the compiler flag when compilers started to care; so at most we might some day need to pass another compiler flag to ensure it stays? we'll see. I'm inclined not to move forward with my PRs for now. |
For me, the most sane option is to make structures opaque in the C API, and then flexible array members. I did something similar for atomic types. First, we got tons of build isssues with various C compilers and then with C++ compilers. I moved the header to our "internal C API", so basically I removed it from the public C API. Since that time, we stopped to get bug reports about pyatomic.h :-) |
agreed, being opaque seems ideal. |
PyBytesObject could become a struct that contains a single opaque internal struct. there is some code out there that references PyBytesObjects internals by field name but my searches across a broad swath of code so far seem to suggest that is so rare this change may be plausible (more research needed). People are using the PyBytes_ macros and APIs. yay! |
Another possibility yet would be: typedef struct {
PyObject_VAR_HEAD
Py_hash_t ob_shash;
char ob_sval;
} PyBytesObject;
#define PyBytes_AS_STRING(op) (assert(PyBytes_Check(op)), \
&(((PyBytesObject *)(op))->ob_sval)) Not sure whether that would be UB... |
How about: #ifdef __cplusplus
char array[1];
#else
char array[];
#endif ? |
#ifdef __cplusplus
char array[1];
#else
char array[];
#endif Does it change the size of the structure between C and C++, sizeof(PyBytesObject)? Also, does the size of the struture matter? :-) I guess that the impart part is the ABI: offset of the members. |
off the top of my head that might actually work as I _think_ "empty" things are required to consume an unused byte of size no matter what meaning sizeof shouldn't change? Some testing and standards perusing for C99 is in order to confirm that though. |
See also issue #93585 "PyCode_Type.tp_basictype change in Python 3.11 broke Cython". |
(Coming from #94250 which was marked as a duplicate) Bump.
The g++ #ifdef __GNUC__
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wpedantic"
#elif defined(_MSC_VER)
#pragma warning(push)
#pragma warning(disable:4200) // nonstandard extension used: zero-sized array in struct/union
#endif
typedef struct {
int x;
int y;
char array[];
} mystruct_t;
#ifdef __GNUC
#pragma GCC diagnostic pop
#elif defined(_MSC_VER)
#pragma warning(pop)
#endif clang |
GCC |
In Python 3.12, there are many bytes objects which are initialized statically in Include/internal/pycore_runtime_init.h. Problem: if PyBytesObjects ends with One option is to declare a different but similar structure ending with It seems like such cast is not an undefined behavior. Full example: // clang x.c -o x -O2 -fsanitize=undefined && ./x
// gcc x.c -o x -O2 -fsanitize=undefined && ./x
#include <stdio.h>
typedef struct {
int size;
char ob_sval[];
} PyBytesObject;
int main() {
struct {
int size;
char ob_sval[3];
} raw_abc = { .size = 1, .ob_sval = { 'a', 'b', 'c' } };
PyBytesObject *abc = (PyBytesObject *)&raw_abc;
printf("bytes: \"%c%c%c\"\n", abc->ob_sval[0], abc->ob_sval[1], abc->ob_sval[2]);
return 0;
} Test with gcc (GCC) 12.1.1 and clang version 14.0.0:
Note: GCC has an extension allowing to initialize |
This is already done for statically initializing ascii strings (identifiers) and it works. |
LPC 2022 "Where are we on security features?" is related to this. https://www.youtube.com/watch?v=L2Ydq3iFwsY&t=5550s mentions a |
The Linux kernel has a similar issue with flexible arrays: https://lwn.net/Articles/908817/ I proposed PR #97017 to convert PyBytes_AS_STRING() static inline function to a regular function, it would open the ability to make the PyBytesObject structure opaque and avoids the issue in the public C API at least. But this PR might be rejected because of its impact on performance. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: