gh-116738: Make _json module safe in the free-threading build #119438

eendebakpt · 2024-05-22T21:00:21Z

(updated description)

Writing JSON files (or encoding to a string) is not thread-safe in the sense that when encoding data to json while another thread is mutating the data, the result is not well-defined (this is true for both the normal and free-threading build). But the free-threading build can crash the interpreter while writing JSON because of the usage of methods like PySequence_Fast_GET_ITEM. In this PR we make the free-threading build safe by adding locks in three places in the JSON encoder.

Reading from a JSON file is safe: objects constructed are only known to the executing thread. Encoding data to JSON needs a bit more care: mutable Python objects such as a list or a dict could be modified by another thread during encoding.

When encoding a list use Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST to project against mutation the list
When encoding a dict, we use a critical section for iteration over exact dicts (PyDict_Next is used there). The non-exact dicts use PyMapping_Items to create a list of tuples. PyMapping_Items itself is assumed to be thread safe, but the resulting list is not a copy and can be mutated.

Issue: Audit all built-in modules for thread safety #116738

The script below was used to test the free-threading implementation. Similar code was added to the tests.

Test script

import json
from threading import Thread
import time

class JsonThreadingTest:
    
    def __init__(self, number_of_threads=4, number_of_json_dumps=10):
    
        self.data = [ [], [], {}, {}, {}]
        self.json = {str(ii): d for ii, d in enumerate(self.data)}
        self.results =[]
        self.number_of_threads=number_of_threads
        self.number_of_json_dumps =number_of_json_dumps
            
    def modify(self, index):
        while self.continue_thread:
            for d in self.data:
                if isinstance(d, list ):
                    if len(d)>20:
                        d.clear()
                    else:
                        d.append(index)
                else:
                    if len(d)>20:
                        try:
                            d.pop(list(d)[0])
                        except KeyError:
                            pass
                    else:
                        if index%2:                            
                            d[index] = index
                        else:
                            d[bytes(index)] = bytes(index)
                    
    def test(self):
        self.continue_thread = True
        self.modifying_threads = []
        for ii in range(self.number_of_threads):
            t = Thread(target=self.modify, args=[ii])
            self.modifying_threads.append(t)

        self.results.clear()
        for t in self.modifying_threads:
            print(f'start {t}')
            t.start()
            
        for ii in range(self.number_of_json_dumps):
            print(f'dump {ii}')
            time.sleep(0.01)
            
            indent = ii if ii%3==0 else None
            if ii%5==0:
                try:
                    j = json.dumps(self.data, indent=indent, skipkeys=True)
                except TypeError:
                        pass
            else:
                j = json.dumps(self.data, indent=indent)
            self.results.append(j)
        self.continue_thread= False
        
        print([hash(r) for r in self.results])
            


t=JsonThreadingTest(number_of_json_dumps=102, number_of_threads=8)
t0=time.time()
t.test()
dt=time.time()-t0
print(t.results[-1])        
print(f'Done: {dt:.2f}')

The test script with t=JsonThreadingTest(number_of_json_dumps=102, number_of_threads=8) is a factor 25 faster using free-threading. Nice!

nineteendo · 2024-05-22T21:50:01Z

You need to include the file that defines that macro.

Modules/_json.c

nineteendo

Revert newlines

Modules/_json.c

Co-authored-by: Nice Zombies <[email protected]>

Include/internal/pycore_critical_section.h

nineteendo

Looks good, maybe add a comment why we don't lock when using PyMapping_Items.
Should we also make the Python implementation thread safe?

On a side note, how would this be ported to a fork of the _json module?

eendebakpt · 2024-08-15T18:59:56Z

Looks good, maybe add a comment why we don't lock when using PyMapping_Items. Should we also make the Python implementation thread safe?

On a side note, how would this be ported to a fork of the _json module?

@nineteendo Thanks for the questions. The result of PyMapping_Items (a list of tuples) can still be mutated from different threads (it is not a copy of the items), so needs to be protected by a lock. This I added to the PR.

It could very well be that the Python implementation of the JSON encoder is already safe to use (in the sense that the interpreter does not crash) under free threading. (to my knowledge most python statements and builtins have already been made thread safe). If not, then that should be addressed in a separate PR.

I do not fully understand the question about porting. Is it not up to the person who forked _json to decide if and how to port and changes?

nineteendo · 2024-08-16T07:46:09Z

It could very well be that the Python implementation of the JSON encoder is already safe to use

I was thinking about race conditions: we first check if the container is empty and only iterate later over the items.
So, we could probably end up with this instead of []:

[
    
]

Is it not up to the person who forked _json to decide if and how to port and changes?

Well, I'm that person. See https://github.com/nineteendo/jsonyx. The main logic is mostly untouched. Can I use the public API for the critical sections?

eendebakpt · 2024-08-16T10:57:30Z

@nineteendo The implementation of free-threading (e.g. PEP703) is still work in progress, so things may change. But currently the critical sections for Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST are in the internal API (see https://github.com/python/cpython/tree/main/Include#the-python-c-api), so they are not part of the public API. If you want to continue supporting the fork with free-threading, I think it is best to ask advice on discourse.

About the race condition: I think there are no guarantees for the result of the json encoder when the data to be encoded is mutated. So yes, race conditions can occur, and depending on how the data is mutated the json output may differ. But this is accepted behaviour. The goal of this PR is intended to prevent the interpreter crashing.

nineteendo · 2024-08-17T07:29:21Z

I think there are no guarantees for the result of the json encoder when the data to be encoded is mutated.

Shouldn't the empty list always be on a single line? So, without indentation. You can test this by overwriting dict.__len__() and list.__len__():

import io
import json

class BadDict(dict):
    def __len__(self) -> int:
        return 1

class BadList(list):
    def __len__(self) -> int:
        return 1

fp = io.StringIO()
json.dump([BadDict(), BadList()], fp, indent=4)
print(fp.getvalue())

Oh well, I managed to output invalid JSON. Some assumptions shouldn't be made.

eendebakpt · 2024-08-19T21:48:35Z

@nineteendo Interesting example. Note for recent python versions (in particular the current main branch) the json output depends on whether dump or dumps is used:

import io
import json

class BadDict(dict):
    def __len__(self) -> int:
        return 1

class BadList(list):
    def __len__(self) -> int:
        return 1

s=json.dumps([ BadList([])],  indent=4)
print(s)

print('--')
fp = io.StringIO()
json.dump([BadList()], fp, indent=4)
print(fp.getvalue())

has output

[
    []
]
--
[

    ]
]

This is something to do with the _one_shot parameter being passed around and the json.dump not using the C encoder. I investigate why this is.

The good thing is that while your example produces funny results, the interpreter does not crash (I checked the C code to make sure the bad list and bad dict are handled correctly).

nineteendo · 2024-08-20T07:47:37Z

This is something to do with the _one_shot parameter being passed around and the json.dump not using the C encoder. I investigate why this is.

It doesn't use the C encoder because it uses more memory than streaming to the file, but the Python implementation is 4x as slow. I thought about rewriting the C code to use streaming, but it will probably be slower as that would wrap _PyUnicodeWriter instead of using it directly. I opted to always use the C encoder.

eendebakpt · 2024-08-20T07:59:35Z

This is something to do with the _one_shot parameter being passed around and the json.dump not using the C encoder. I investigate why this is.

It doesn't use the C encoder because it uses more memory than streaming to the file, but the Python implementation is 4x as slow. I thought about rewriting the C code to use streaming, but it will probably be slower as that would wrap _PyUnicodeWriter instead of using it directly. I opted to always use the C encoder.

You are right. Using the C encoder works (i tried locally and it passes all the tests), but it would indeed use more memory. Would be nice to rewrite the C code so that it can work in streaming mode, but that is for another PR.

nineteendo · 2024-08-20T08:17:43Z

I think we should create a separate issue. Do we fix the race condition or just sub classes (like float and int)?
Fixing the race condition would create a shallow copy of the base class, while fixing just sub classes would use base class methods.

eendebakpt · 2024-08-20T17:14:10Z

I think we should create a separate issue. Do we fix the race condition or just sub classes (like float and int)? Fixing the race condition would create a shallow copy of the base class, while fixing just sub classes would use base class methods.

I am a bit lost here. Which race condition do you mean?

eendebakpt · 2024-08-20T19:07:51Z

The python implementation first checks if the list is empty and then iterates over it. Instead of making a shallow copy of the list, checking the length of the copy and iterating over it. A different thread could probably make the list empty between these two statements. (Like the subclass is simulating)
if not lst:
    yield "[]"
    return

time.sleep(10) # allow thread to modify the list
for value in lst:
    ...
My question is: do we fix just the broken subclass or also this?

In my opinion there is nothing to fix: when different threads are mutating the underlying data, we give no guarantees on the output. But we do guarantee we will not crash the python interpreter. The python implementation will not crash (since all individual python statements are safe). In this PR we modify the C implementation so that no crashes can occur. On the C side we want to make sure that if the underlying list is emptied we do not index into deallocated memory (this would crash the interpreter). (note: for the json encoder the C method that is unsafe for the list access is PyList_GET_ITEM)

There are some other PRs addressing safety under the free-threading builds and the feedback there was similar: address the crashes, but don't make guarantees on correct output (at the cost of performance). See
#120496 for example

nineteendo · 2024-08-20T19:55:29Z

There's a precedent for guarding against a broken int.__repr__() and float.__repr__(), so I've created an issue: #123183.

Make the _json module thread safe

da0e917

bedevere-app bot added the awaiting review label May 22, 2024

bedevere-app bot mentioned this pull request May 22, 2024

Audit all built-in modules for thread safety #116738

Open

eendebakpt commented May 22, 2024

View reviewed changes

Modules/_json.c Outdated Show resolved Hide resolved

eendebakpt added 2 commits May 23, 2024 00:10

Update Modules/_json.c

3797dfa

handle goto and return statements

366654c

eendebakpt mentioned this pull request May 24, 2024

PySequence_Fast needs new macros to be safe in a nogil world #119247

Open

nineteendo reviewed May 25, 2024

View reviewed changes

Modules/_json.c Outdated Show resolved Hide resolved

Modules/_json.c Outdated Show resolved Hide resolved

Apply suggestions from code review

5b72cdf

Co-authored-by: Nice Zombies <[email protected]>

eendebakpt commented May 25, 2024

View reviewed changes

Include/internal/pycore_critical_section.h Outdated Show resolved Hide resolved

eendebakpt added 5 commits May 25, 2024 12:47

Update Include/internal/pycore_critical_section.h

c4c24c3

rename macro

370191b

Merge branch 'main' into json_ft

93c4466

fix typo

eafd3c1

Merge branch 'json_ft' of github.com:eendebakpt/cpython into json_ft

daeec46

eendebakpt changed the title ~~Draft: gh-116738: Make _json module thread-safe #117530~~ gh-116738: Make _json module thread-safe #117530 May 31, 2024

eendebakpt changed the title ~~gh-116738: Make _json module thread-safe #117530~~ gh-116738: Make _json module thread-safe May 31, 2024

eendebakpt and others added 4 commits June 4, 2024 15:26

fix missing to exit critical section

d54baf2

revert changes to tests

e5fa305

📜🤖 Added by blurb_it.

d4ddf5d

Merge branch 'main' into json_ft

67d942f

eendebakpt mentioned this pull request Jun 16, 2024

gh-120496: Make enum_iter thread safe #120591

Open

eendebakpt changed the title ~~gh-116738: Make _json module thread-safe~~ gh-116738: Make _json module safe in the free-threading build Aug 14, 2024

eendebakpt added 6 commits August 14, 2024 22:30

Merge branch 'main' into json_ft

4ffc1b2

sync with main

384ca59

sync with main

64e20aa

update news entry

e6ce9c9

fix normal build

34885a0

Merge branch 'main' into json_ft

2fe760b

eendebakpt requested a review from nineteendo August 14, 2024 21:00

nineteendo reviewed Aug 15, 2024

View reviewed changes

eendebakpt added 3 commits August 15, 2024 17:42

add lock around result of PyMapping_Items

eebccac

add tests

db8947c

fix argument of Py_END_CRITICAL_SECTION_SEQUENCE_FAST

c19ad14

This comment was marked as resolved.

Sign in to view

nineteendo mentioned this pull request Sep 5, 2024

Make c accelerator thread safe when GIL is disabled nineteendo/jsonyx#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-116738: Make _json module safe in the free-threading build #119438

gh-116738: Make _json module safe in the free-threading build #119438

eendebakpt commented May 22, 2024 •

edited

Loading

nineteendo commented May 22, 2024

nineteendo left a comment

nineteendo left a comment

eendebakpt commented Aug 15, 2024

nineteendo commented Aug 16, 2024

eendebakpt commented Aug 16, 2024

nineteendo commented Aug 17, 2024 •

edited

Loading

eendebakpt commented Aug 19, 2024

nineteendo commented Aug 20, 2024

eendebakpt commented Aug 20, 2024

nineteendo commented Aug 20, 2024

eendebakpt commented Aug 20, 2024

This comment was marked as resolved.

eendebakpt commented Aug 20, 2024

nineteendo commented Aug 20, 2024 •

edited

Loading

gh-116738: Make _json module safe in the free-threading build #119438

Are you sure you want to change the base?

gh-116738: Make _json module safe in the free-threading build #119438

Conversation

eendebakpt commented May 22, 2024 • edited Loading

nineteendo commented May 22, 2024

nineteendo left a comment

Choose a reason for hiding this comment

nineteendo left a comment

Choose a reason for hiding this comment

eendebakpt commented Aug 15, 2024

nineteendo commented Aug 16, 2024

eendebakpt commented Aug 16, 2024

nineteendo commented Aug 17, 2024 • edited Loading

eendebakpt commented Aug 19, 2024

nineteendo commented Aug 20, 2024

eendebakpt commented Aug 20, 2024

nineteendo commented Aug 20, 2024

eendebakpt commented Aug 20, 2024

This comment was marked as resolved.

eendebakpt commented Aug 20, 2024

nineteendo commented Aug 20, 2024 • edited Loading

eendebakpt commented May 22, 2024 •

edited

Loading

nineteendo commented Aug 17, 2024 •

edited

Loading

nineteendo commented Aug 20, 2024 •

edited

Loading