Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential regression in Python 3.11 (multiprocess shutdown?) #97641

Closed
carltongibson opened this issue Sep 29, 2022 · 24 comments
Closed

Potential regression in Python 3.11 (multiprocess shutdown?) #97641

carltongibson opened this issue Sep 29, 2022 · 24 comments
Assignees
Labels
3.11 only security fixes 3.12 bugs and security fixes extension-modules C modules in the Modules dir OS-windows topic-sqlite3

Comments

@carltongibson
Copy link
Contributor

Running the Django test suite against the Python 3.11 pre-releases, we have hit
a potential regression.

Steps to reproduce

git clone https://github.com/django/django.git
python3.11 -m venv env
.\env\Scripts\Activate.ps1
python -m pip install -e django
cd django\tests
python runtests.py --parallel

Bug report

On Python 3.8, 3.9, and 3.10 this runs without problem.

On Python 3.11 the following error in seen after the test suite completes,
during shutdown:

Traceback (most recent call last):
  File "C:\Users\carlt\src\django\tests\runtests.py", line 773, in <module>
    failures = django_tests(
               ^^^^^^^^^^^^^
  File "C:\Users\carlt\src\django\tests\runtests.py", line 432, in django_tests
    failures = test_runner.run_tests(test_labels)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlt\src\django\django\test\runner.py", line 1057, in run_tests
    self.teardown_databases(old_config)
  File "C:\Users\carlt\src\django\django\test\runner.py", line 980, in teardown_databases
    _teardown_databases(
  File "C:\Users\carlt\src\django\django\test\utils.py", line 378, in teardown_databases
    connection.creation.destroy_test_db(
  File "C:\Users\carlt\src\django\django\db\backends\base\creation.py", line 307, in destroy_test_db
    self._destroy_test_db(test_database_name, verbosity)
  File "C:\Users\carlt\src\django\django\db\backends\sqlite3\creation.py", line 111, in _destroy_test_db
    os.remove(test_database_name)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'other_1.sqlite3'

This looked similar to us to open issue #95027 but we were asked to report it
separately.

I've tested all the way back to a1 where (along with other issues now resolved)
this error still occurs:

(django-py311) PS C:\Users\carlt\src\django\tests> python --version
Python 3.11.0rc2
...
(django-py311) PS C:\Users\carlt\src\django\tests> python --version
Python 3.11.0a1

This is somewhat frustrating as we've tried to test on all platforms since the
first releases.
Our test suite would only run on Windows with Python 3.11 very recently as
there was a third-party dependency that was not compatible with Windows. We
will try to adjust to test without dependencies as well on Windows for future
versions. (Sorry about that.)

Please do let us know if we can provide further info. I imagine though the
easiest thing is for you to run this yourself.

Thanks.

//cc @felixxm

Your environment

  • CPython versions tested on: 3.11 pre-releases from a1 to rc2.
  • Operating system and architecture: Windows 11.
@carltongibson carltongibson added the type-bug An unexpected behavior, bug, or error label Sep 29, 2022
@sobolevn
Copy link
Member

sobolevn commented Sep 29, 2022

Based on C:\Users\carlt\src\django\django\db\backends\sqlite3\creation.py I think it might be sqlite related 🤔

@mdboom mdboom added the 3.11 only security fixes label Sep 29, 2022
@corona10
Copy link
Member

corona10 commented Oct 2, 2022

cc @erlend-aasland

@erlend-aasland
Copy link
Contributor

I don't have access to a PC with Windows 11, so unfortunately I am unable debug this :( If someone with Windows 11 could do a bisect, that would be super helpful.

A lot of changes happened between 3.10 entered the beta phase and the first 3.11 alpha was released:

$ git log --oneline --until=2021-10-05 --since=2021-05-03 Modules/_sqlite | wc -l
56

@carltongibson
Copy link
Contributor Author

If someone with Windows 11 could do a bisect, that would be super helpful.

Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that (frankly) I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe.

I was able to reproduce the issue at 0474d06, but I'm getting build errors by a3c11ce. A Windows expert would likely have more success.

(Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.)

@carltongibson
Copy link
Contributor Author

carltongibson commented Oct 4, 2022

If someone with Windows 11 could do a bisect, that would be super helpful.

Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe.

I was able to reproduce the issue at 0474d06, but I'm getting build errors by a3c11ce. A Windows expert would likely have more success.

I'm happy to keep trying, but I've reached a bit of a roadblock with building Python failing

(Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.)

@felixxm
Copy link
Contributor

felixxm commented Oct 5, 2022

Can someone add OS-Windows label? 🪟

@felixxm
Copy link
Contributor

felixxm commented Oct 5, 2022

I'm not sure if it's related 🤔 but after forcing spawn on Linux, I'm getting:

Exception ignored in: <function Pool.__del__ at 0x7f5548c9d580>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 268, in __del__
ResourceWarning: unclosed running multiprocessing pool <multiprocessing.pool.Pool state=RUN pool_size=8>

/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@tkanhe
Copy link

tkanhe commented Oct 5, 2022

Just checked on following PC specifications:
Edition: Windows 11 Home Single Language
Version: 21H2
OS build: 22000.1042

with Python 3.11.0rc2, Django-4.2.dev20221005120449

Output:

`PS C:\Users\TK\Desktop> cd django\testsPS C:\Users\TK\Desktop\django\tests> python runtests.py --parallel
Testing against Django installed in 'C:\Users\TK\Desktop\django\django' with up to 16 processes
Found 16023 test(s).
Creating test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Creating test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
System check identified no issues (17 silenced).
ssssssssssssssssssssssssssssssssssssssssssssss.................................................................ss.........s.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................s..............................................s..................................................................................................................s.....................s.....sss..ss.....................................................................................................................................................................................................................................................................................................................................................................................................s.....................................................................................................................................................................................s.................................................................s................sss...s.ss.....s..ss.sss..s......ss.s.................s........s.....................ss.....................s..s....s...s.....................s..............s......s...s...ss............s.....................

Ran 16021 tests in 74.720s

OK (skipped=1828, expected failures=5)
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...`

*truncked some of .......sss........ output.

Let me know any other info/tests needed; I will be happy to help in the future regarding testing on Windows 11.

@stic
Copy link

stic commented Oct 5, 2022

Couldn't replicate on auto (16 cores laptop).
python runtests.py --parallel
Bumping it up to ~40 started hitting the sharing violation on the removal of the 'other_1.sqlite3' (weirdly only the 1st i.e. other_1).
python runtests.py --parallel 40

So confirmed on latest 3.11.0rc2.

I've disabled Windows Defender on the folder under test (django) as well as %TEMP% to make sure Windows Defender it's not causing issues. That's due to fact that I've got quite a few issues with pickling (i.e. cleaning after that) e.g.

  File "c:\users\<username>\AppData\Local\Programs\Python\Python311\Lib\shutil.py", line 623, in _rmtree_unsafe
    os.rmdir(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\<username>\\AppData\\Local\\Temp\\django_jpnyol3z\\django_16co2dfg\\tmpcaxkg2x2\\test_project'

Not 100% sure if that's a good troubleshooting approach, but a few observations:
Pass that passes manages to call CloseFile from the first process, that is the process creating the other_1.sqlite3 and it's the same process that will try to delete the db file (clean-up). Crashing test run doesn't make such CloseFile call, last syscalls seen on system level are to unlock the file. In itself that shouldn't be a problem I guess (as it's the same PID trying to close / remove the file later on). However, there is quite a gap (up to 2sec on my box) between syscalls to unlock the file and close. Can't really figure it out yet (first time with Django test suit), but if you compare vs. a run on 3.10 there is a visible wait whilst: Creating test database for alias 'other'...
Working theory - we are coming back on the same PID (process) to clean-up a file, that we are still keeping open (for close operation) elsewhere?

@erlend-aasland
Copy link
Contributor

FTR, I'm in the process of installing a Windows 11 development environment now. Will hopefully be up and running by this evening, or at least tomorrow. I'll need to delve into how to debug CPython on Windows using Visual Studio, since I have little experience with both Windows and Visual Studio.

@erlend-aasland
Copy link
Contributor

cc. release-manager @pablogsal

@erlend-aasland
Copy link
Contributor

Couldn't replicate on auto (16 cores laptop). python runtests.py --parallel

Ditto (Win 11, 16 cores)

Bumping it up to ~40 started hitting the sharing violation on the removal of the 'other_1.sqlite3' (weirdly only the 1st i.e. other_1). python runtests.py --parallel 40

So confirmed on latest 3.11.0rc2.

Ditto. Bumped to 40, hit the PermissionError for other_1.sqlite3.

@erlend-aasland
Copy link
Contributor

Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself.

@erlend-aasland erlend-aasland added pending The issue will be closed if no feedback is provided and removed topic-sqlite3 type-bug An unexpected behavior, bug, or error labels Oct 18, 2022
@erlend-aasland
Copy link
Contributor

I'm for some reason unable to install Visual Studio 2015, so I'm unable to compile 3.10.0 beta1 at the moment. Thus, I'm once again unable to bisect this :(

@carltongibson
Copy link
Contributor Author

This still looks to be failing with 3.11.0, so still looks like a regression from 3.10.

@erlend-aasland erlend-aasland removed the pending The issue will be closed if no feedback is provided label Oct 25, 2022
@felixxm
Copy link
Contributor

felixxm commented Oct 26, 2022

I was able to fix this by adding an explicit connection close() (see django/django#16225). A similar issue occurred in sqlite-utils, see simonw/sqlite-utils#503. I'm not sure why it started showing up in Python 3.11 but I don't mind closing the issue.

@eryksun
Copy link
Contributor

eryksun commented Oct 26, 2022

Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself.

Sorry, that comment wasn't meant for this issue. I cannot reproduce this issue to diagnose it. For me, there's a significant window of time between when the worker processes access the "*.sqlite3" files and when the main process deletes them at shutdown. If something else in the main process sometimes has the file open without delete sharing, I haven't been able to reproduce it in 3.11. (Note that any locking/unlocking of the database file that may be observed in monitoring tools has nothing to do with a sharing violation. Byte-range locks can lead to locking violations for read and write operations, but not to a sharing violation.)

That said, for projects that are using multiprocessing for parallel testing, I think it would help if multiprocessing introduced optional support for process groups and, in Windows, job objects, in order to do a better job ensuring that the entire process tree has been given a chance to gracefully exit, and subsequently to forcefully kill any remaining processes.

@eryksun
Copy link
Contributor

eryksun commented Oct 26, 2022

I'm not sure why it started showing up in Python 3.11

3.11 switched to using functools.lru_cache for the statement cache. See #87028. This introduces a reference cycle involving the connection object, so the connection object no longer gets immediately deallocated, not until, for example, gc.collect() is called to break the cycle.

@eryksun eryksun added extension-modules C modules in the Modules dir 3.12 bugs and security fixes topic-sqlite3 labels Oct 26, 2022
@erlend-aasland
Copy link
Contributor

I'm not sure why it started showing up in Python 3.11

3.11 switched to using functools.lru_cache for the statement cache. See #87028. This introduces a reference cycle involving the connection object, so the connection object no longer gets immediately deallocated, not until, for example, gc.collect() is called to break the cycle.

Good point. I remember the Windows CI was unhappy in the first draft implementations of this enhancement. I'm not sure it is a bug that the connection object might live longer. Explicit resource control is advocated in the sqlite3 docs; closing connections explicitly is good practice.

@eryksun
Copy link
Contributor

eryksun commented Oct 27, 2022

From what I can tell, the old implementation avoids the circular reference by manually decrementing the reference count of the connection object:

/* By default, the Cache class INCREFs the factory in its initializer, and
* decrefs it in its deallocator method. Since this would create a circular
* reference here, we're breaking it by decrementing self, and telling the
* cache class to not decref the factory (self) in its deallocator.
*/
self->statement_cache->decref_factory = 0;
Py_DECREF(self);

I suppose you could have kept the statement cache type as a skeleton that uses an lru_cache.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 28, 2022

From what I can tell, the old implementation avoids the circular reference by manually decrementing the reference count of the connection object

True, it used that hack. Part of my motivation for getting rid of the (duplicate) LRU cache implementation in the _sqlite extension module was getting rid of this hack.

I suppose you could have kept the statement cache type as a skeleton that uses an lru_cache.

That would have been possible, but you'd keep the GC hack. I'm not sure that is a very good idea.

skshetry added a commit to skshetry/dvc that referenced this issue Nov 10, 2022
There seems to be a regression in Python 3.11, where the sqlite
connections are not deallocated, due to some internal changes in
Python 3.11, where they are now using LRU cache. They are not deallocated
until `gc.collect()` is not called.

See python/cpython#97641.
This affects only Windows, because when we try to remove the tempdir for
the exp run, the sqlite connection is open which prevents us from
deleting that folder.

Although this may happen in real scenario in `exp run`, I am only fixing
the tests by mocking `dvc.close()` and extending it by calling
`gc.collect()` after it. We could also mock `State.close()` but didnot
want to mock something that is not in dvc itself.

The `diskcache` uses threadlocal for connections, so they are expected
to be garbage collected, and therefore does not provide a good way to
close the connections. The only API it offers is `self.close()` and that
only closes main thread's connection. If we had access to connection,
an easier way would have been to explicitly call `conn.close()`.
But we don''t have such option at the moment.

Related: iterative#8404 (comment)
GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
skshetry added a commit to skshetry/dvc that referenced this issue Nov 11, 2022
There seems to be a regression in Python 3.11, where the sqlite
connections are not deallocated, due to some internal changes in
Python 3.11, where they are now using LRU cache. They are not deallocated
until `gc.collect()` is not called.

See python/cpython#97641.
This affects only Windows, because when we try to remove the tempdir for
the exp run, the sqlite connection is open which prevents us from
deleting that folder.

Although this may happen in real scenario in `exp run`, I am only fixing
the tests by mocking `dvc.close()` and extending it by calling
`gc.collect()` after it. We could also mock `State.close()` but didnot
want to mock something that is not in dvc itself.

The `diskcache` uses threadlocal for connections, so they are expected
to be garbage collected, and therefore does not provide a good way to
close the connections. The only API it offers is `self.close()` and that
only closes main thread's connection. If we had access to connection,
an easier way would have been to explicitly call `conn.close()`.
But we don''t have such option at the moment.

Related: iterative#8404 (comment)
GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
skshetry added a commit to iterative/dvc that referenced this issue Nov 16, 2022
There seems to be a regression in Python 3.11, where the sqlite
connections are not deallocated, due to some internal changes in
Python 3.11, where they are now using LRU cache. They are not deallocated
until `gc.collect()` is not called.

See python/cpython#97641.
This affects only Windows, because when we try to remove the tempdir for
the exp run, the sqlite connection is open which prevents us from
deleting that folder.

Although this may happen in real scenario in `exp run`, I am only fixing
the tests by mocking `dvc.close()` and extending it by calling
`gc.collect()` after it. We could also mock `State.close()` but didnot
want to mock something that is not in dvc itself.

The `diskcache` uses threadlocal for connections, so they are expected
to be garbage collected, and therefore does not provide a good way to
close the connections. The only API it offers is `self.close()` and that
only closes main thread's connection. If we had access to connection,
an easier way would have been to explicitly call `conn.close()`.
But we don''t have such option at the moment.

Related: #8404 (comment)
GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
altendky added a commit to Chia-Network/chia-blockchain that referenced this issue Feb 4, 2023
wallentx pushed a commit to Chia-Network/chia-blockchain that referenced this issue Feb 4, 2023
altendky added a commit to Chia-Network/chia-blockchain that referenced this issue Feb 28, 2023
wallentx pushed a commit to Chia-Network/chia-blockchain that referenced this issue Feb 28, 2023
drnlm pushed a commit to Sutekh-VtES/sutekh-mirror that referenced this issue Jul 31, 2023
Python 3.11 tweaked how the sqlite module worked, which
changed the implicit cleanup behavior, which can cause
problems on windows when temporary files are involved.
See the discussion in python/cpython#97641

The cursor juggling code in the database upgrade test implicitly
relied on this, so we need to tweak the logic so we explicitly
close the connection so the tests pass on Windows again.
@erlend-aasland
Copy link
Contributor

erlend-aasland commented Aug 17, 2023

I was able to fix this by adding an explicit connection close() (see django/django#16225). A similar issue occurred in sqlite-utils, see simonw/sqlite-utils#503. I'm not sure why it started showing up in Python 3.11 but I don't mind closing the issue.

Following gh-108015, sqlite3 will now emit a ResourceWarning if connection objects are closed implicitly. Explicit resource management is recommended. cc. @eryksun

@erlend-aasland
Copy link
Contributor

gh-108015 has now landed; Python 3.13 will emit a ResourceWarning if a connection object is closed implicitly. Explicit resource management is now recommended. @felixxm has reported that Django already started to clean up their test suite (django/django#17178).

It seems to me the problems mentioned in this issue can all be handled by improved resource management. Suggesting to close this.

@erlend-aasland
Copy link
Contributor

Closing as per #97641 (comment).

@erlend-aasland erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2024
@erlend-aasland erlend-aasland removed the pending The issue will be closed if no feedback is provided label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes extension-modules C modules in the Modules dir OS-windows topic-sqlite3
Projects
None yet
Development

No branches or pull requests

9 participants