Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for SplitBodyCache #1971

Merged
merged 3 commits into from
Jun 5, 2023

Conversation

tgolsson
Copy link
Contributor

@tgolsson tgolsson commented Jun 2, 2023

Pull Request Checklist

  • A news fragment is added in news/ describing what is new.
  • Test cases added for changed code.

Describe what you have changed in this PR.

Supersedes #1498, rather than rebasing. This adds support for using a split-body. In the current single-body cache, each file in the HTTP cache contains two different fragments: the serialized HTTP response, followed by the HTTP body data. This means that when response headers change (including timestamps), the whole cache object is rewritten to disk.

With this change the body is not written again, reducing disk pressure.

This change is compatible with existing caches, to the best of my ability to test it. However; it completely prevents running an older PDM against a new cache. In that case, the body in the file is "None" and the deserialization breaks. Not sure if there's a great workaround.

@tgolsson tgolsson force-pushed the ts/split-body-cache-redo branch from b85c418 to 3e2ba47 Compare June 2, 2023 19:49
Copy link
Collaborator

@frostming frostming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However; it completely prevents running an older PDM against a new cache.

I don't think we need to consider this issue.

@frostming frostming merged commit 06fd648 into pdm-project:main Jun 5, 2023
@sanmai-NL
Copy link
Contributor

@tgolsson Why does caches.get_body() return a BinaryIO rather than binary data? This easily leads to file handle leaks.

@frostming
Copy link
Collaborator

@sanmai-NL It's based on https://github.com/psf/cachecontrol, maybe wrong place for asking.

@tgolsson
Copy link
Contributor Author

@sanmai-NL Why would it lead to file handle leaks?

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 12, 2023

@tgolsson Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak. That's why I asked why this function needs to return BinaryIO rather than plain binary data.

@sanmai-NL
Copy link
Contributor

@sanmai-NL It's based on https://github.com/psf/cachecontrol, maybe wrong place for asking.

I failed to find references to BytesIO in there.

@frostming
Copy link
Collaborator

@frostming
Copy link
Collaborator

frostming commented Jun 12, 2023

Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak

The io object will be passed to create an HTTPResponse object which will close it when consumed. I don't see why it will lead to leaks, every time you get a response it contains an open fp with it. It is not different from we open one ourselves and pass it to response.

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 12, 2023

Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak

The io object will be passed to create an HTTPResponse object which will close it when consumed. I don't see why it will lead to leaks, every time you get a response it contains an open fp with it. It is not different from we open one ourselves and pass it to response.

I noticed that CacheControl is (exclusively?) meant to be used with urllib3. urllib3 moves the IO object upon construction and closes HTTP responses after reading all bytes, rather than upon destruction. I see you designed get_body() to work with file handles rather than data, to conserve memory. Stepping aside the current urllib3 dependency, can you comment why a binary data generator wouldn't work? That could be context managed, unlike the bare IO object being passed around.

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 12, 2023

@tgolsson Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak. That's why I asked why this function needs to return BinaryIO rather than plain binary data.

Which Python version? Python file handles from open are closed when they get garbage collected, in every case I know of.

Stepping aside the current urllib3 dependency, can you comment why a binary data generator wouldn't work? That could be context managed, unlike the bare IO object being passed around.

There was actually some issues with returning a non-file in some situations, though the original PR is many months old at this point as cachecontrol was quite dead, so I can't say if it'd still repro or was related to issues in that specific version. I think in general it's nicer to avoid loading multiple gigabytes into memory if it can be avoided, especially if it's not certain it'll be needed. As I recall in some cases it'll actually just get streamed out again into the build directory, so we never need to keep it all in memory.

@sanmai-NL
Copy link
Contributor

@tgolsson I'll readily believe it wasn't a major defect with PDM. But please read this: https://realpython.com/why-close-file-python/

#2005

@frostming frostming mentioned this pull request Jun 12, 2023
2 tasks
@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 12, 2023

Re #2005 (comment)

Not sure what you mean with manually close? Do you mean, somehow enforcing that close() is called? I proposed a redesign to that effect, so that urllib3 will lazily load a cached file based on its path, without a file descriptor having to be passed between functions. I'm not sure how to solve it since I'm not so deeply involved in this.

@frostming
Copy link
Collaborator

using a global context manager that will closed after CLI handling, like what pip is doing

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 12, 2023

urllib3 will always close the file (when the response is GC'd); so I don't see why we'd need to manually close them. I guess if we had some malformed response data we could potentially leak them on non-CPython interpreters as we'd not be creating a HTTPResponse object.

@sanmai-NL
Copy link
Contributor

When will garbage collection happen? If PDM keeps running and accumulates responses to cache that aren't being read, the number of open files will increase to no limit. Right? Perhaps pointing out the actual source code paths will keep us focused.

@frostming
Copy link
Collaborator

frostming commented Jun 13, 2023

When will garbage collection happen? If PDM keeps running and accumulates responses to cache that aren't being read, the number of open files will increase to no limit. Right? Perhaps pointing out the actual source code paths will keep us focused.

Keep arguing with two philosophies seems meaningless. Can you quantify how many files are opened and how many are not closed when running a PDM command?

@sanmai-NL
Copy link
Contributor

If you explain the main code path from actual requests done by PDM into this caching layer, then I'll look into measurement.

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 13, 2023

To quote the docs (which you can find here):

.. c:function:: void Py_DECREF(PyObject *o)

   Decrement the reference count for object *o*.

   If the reference count reaches zero, the object's type's deallocation
   function (which must not be ``NULL``) is invoked.

   This function is usually used to delete a :term:`strong reference` before
   exiting its scope.

   The object must not be ``NULL``; if you aren't sure that it isn't ``NULL``,
   use :c:func:`Py_XDECREF`.

   .. warning::

      The deallocation function can cause arbitrary Python code to be invoked (e.g.
      when a class instance with a :meth:`__del__` method is deallocated).  While
      exceptions in such code are not propagated, the executed code has free access to
      all Python global variables.  This means that any object that is reachable from
      a global variable should be in a consistent state before :c:func:`Py_DECREF` is
      invoked.  For example, code to delete an object from a list should copy a
      reference to the deleted object in a temporary variable, update the list data
      structure, and then call :c:func:`Py_DECREF` for the temporary variable.

So the object is to be garbage-collected immediately when the refcount reaches 0, i.e, when it becomes unreachable from regular Python code. That is to say; unless there are islands of self-referential objects, it will be cleaned up immediately. If it does exist in an island it'll be handled during the sweep-and-prune GC phase.

If it isn't at 0, and isn't in an island, someone is keeping it around and I'd say forcing it to be closed would be a potential future bug.

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 13, 2023

@tgolsson To be clear, I knew that. The reference counting behavior is CPython specific, by the way. And even there, a deallocation does not guarantee that operating system resources related to the freed object, such as file handles, are released as well. See https://stackoverflow.com/questions/49512990/does-python-gc-close-files-too.

I'm just putting a question mark at offering an API that increases the likelihood of leaking memory or other resources. So I asked, are there no equally performant but more robust alternatives? The scenario I envisage is that a programmer calls get_body() many times in a frame that remains active, and so the refcount does not reach 0. Like PDM, doing a lot of get_body() calls, processing the results in a function that goes out of scope only after everything is done (seems typical for a CLI invocation). That would result in a leak if some IO object isn't closed due to a defect, or in any case a waste of resources at runtime. Indeed, in the container space I do have actual experience with resource-constrained operating system instances that run out of file descriptors at a peak in activity.

But I'm keen to measure this as asked by @frostming. Just would like more explaination about how the caching layer is actually (to be) used.

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 13, 2023

@tgolsson To be clear, I knew that. The reference counting behavior is CPython specific, by the way. And even there, a deallocation does not guarantee that operating system resources related to the freed object, such as file handles, are released as well. See https://stackoverflow.com/questions/49512990/does-python-gc-close-files-too.

I'm aware that it is CPython specific. It is after all in the previous link you told me to read. Perhaps you forgot? And it's also quite obvious from the fact that I link the CPython documentation.

And as I've written before; the HTTPResponse does close the file explicitly when dropped, via __del__. So the issue occurs if/when the body for some reason doesn't get constructed, which would happen if we switch request headers, and a few potential other cases:

https://github.com/psf/cachecontrol/blob/ffdfe8dd65715573911925a5d231bc530f2db4e9/cachecontrol/serialize.py#L100-L150

Doesn't mean there's anything wrong with the code in PDM or this PR. For what it's worth, the caching layer is used to construct the session used by unearth when calling PyPi. This code is here:

https://github.com/pdm-project/pdm/blob/4b06c6a40f0b3026c2283a1362d44cf168dc3dae/src/pdm/models/session.py#L38C1-L60

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 13, 2023

Please calm down sir. I'll respond to your comments soon, but I would prefer if you keep it businesslike.

@tgolsson
Copy link
Contributor Author

My apologies; I wasn't trying to be rude. But reciting back what I'm saying and explaining it to me is quite condescending, as is telling me to 'go read up' by just pointing out other things to read (which is why I quote + reference instead what's important in the docs!). Gets quite frustrating, which is why I called it out.

With that out of the way I'll try to be better. Back on topic: I hooked the FileCache struct to validate my hypothesis that files get closed.

image

This isn't an error or fake graph; that's actually two lines almost perfectly super-imposed.

image

This is the wrapper:

class Wrap:
    _open = []
    _close = []
    _files_open = 0
    _files_open_now = []

    def __init__(self, inner):
        self.buffer = inner
        Wrap._open.append(time.perf_counter_ns())
        Wrap._files_open += 1
        Wrap._files_open_now.append(Wrap._files_open)

    def close(self):
        self.buffer.close()
        Wrap._files_open -= 1
        Wrap._close.append(time.perf_counter_ns())

    def __getattr__(self, name):
        if name in ("close",):
            return self.__dict__[name]

        return getattr(self.buffer, name)

and then simply wrapping the file in get_body like so:

    def get_body(self, key: str) -> BinaryIO | None:
        path = self._get_cache_path(key)
        with contextlib.suppress(OSError):
            return cast(BinaryIO, Wrap(open(f"{path}.body", "rb")))

        return None

With pdm lock on PDM itself; I see 168 files opened and closed within the ~10s that takes, and never at any point is there more than 1 file open. Same with pdm sync, pdm install, etc, with varying timestamps and setups.

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 16, 2023

@tgolsson I think you've misunderstood my inquiry.

@tgolsson Why does caches.get_body() return a BinaryIO rather than binary data? This easily leads to file handle leaks.

@tgolsson Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak. That's why I asked why this function needs to return BinaryIO rather than plain binary data.

You finally perform some experiment. Thanks for that, it adds some information.
But, this experiment only gives evidence that suggests this current implementation in PDM hardly leaks file handles or any resource through the caching layer. Was that then what I claimed? I only asked about the motivation for this design, compared to an alternative design that guarantees proper closing of file handles and other resources and does not rely on careful usage of the API. I expected you or others who have something to say, to respond with a discussion on design alternatives, not in some sort of defense.

Most of the following statements of fact you seem to believe in, I think, based on your defensive response. But they are disconnected from my own input in this thread:

  • The current implementation in PDM (this PR) leaks file handles.
  • The current implementation is designed by @tgolsson and is not the responsibility of the cachecontrol architecture.
  • The current implementation leaks file handles and this is unacceptable to me.
  • I think that @tgolsson should provide evidence that the current implementation as used with PDM does not leak file handles.
  • An implementation that's only correct under CPython is acceptable for PDM.

Lastly, you write you understand CPython vs non-CPython interacts with the design. I gave you pointers, you responded with information that is already known, apparently to both of us. I hope you now see why that doesn't help the discussion.

  1. Instead, I expected you to motivate your own concession that under non-CPython, leaks may occur in more types of cases. (Please mind, I'm not claiming your concession in this referenced message covers all cases or is correct itself.)
  2. Moreover, I expected you to analyze the cases I described in add support for SplitBodyCache #1971 (comment).

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Jun 16, 2023

And as I've written before; the HTTPResponse does close the file explicitly when dropped, via __del__. So the issue occurs if/when the body for some reason doesn't get constructed, which would happen if we switch request headers, and a few potential other cases:

https://github.com/psf/cachecontrol/blob/ffdfe8dd65715573911925a5d231bc530f2db4e9/cachecontrol/serialize.py#L100-L150

Tracing back from HTTPResponsein cachecontrol, urllib3 and its superclasses starting with your reference, I didn't find a __del__ implementation that closes the file handle. In C source code of CPython's standard library garbage collection is tied (to some extent) to releasing resources like file handles. Please kindly point out what code I missed. Surely I will have missed the destructor method you referred to, in navigating quickly across GitHub.

@tgolsson
Copy link
Contributor Author

Sorry for wall of text.

[...] I expected you or others who have something to say, to respond with a discussion on design alternatives, not in some sort of defense.

You did ask, but you also claimed that the design leads to leaking file handles. So if you expect discussion, I could as easily say I expect proof - otherwise it's just theorycrafting about file handles.

But, I also don't agree with that sentiment, because in the intended usage and under regular (i.e, happy path) operation the leaks cannot occur. The API is designed in an upstream crate, and a proper usage in-place is safe. It isn't a general-purpose cache utility, and I don't think it has to be. So I'm more interested in proving whether or not it has leaks as is, as otherwise any change would be churn.

Re your statements, or claims to my thoughts; I hope my above reasoning makes sense. Code only needs to be fit for the purpose it's intended for; nothing more. The design could be improved - like any design can be - but if it works I think it doesn't matter. So proving it doesn't work is the most tractable path so we know what to change. So that leaves the following point:

An implementation that's only correct under CPython is acceptable for PDM.

And while I do not have any say as I'm merely a passing contributor; I do think that the opposite can be argued: An implementation that does not support X is acceptable for PDM. What X is can be argued. X might be Cython, Jython, or a solo project by John Doe. For example; I know Jython has IOBase, and IOBase as defined by the Python documentation shall hold a certain behavior. I'm arguing this behaviour guarantees correctness on the happy path.

For sanity, we thusly have to find an implementation of Python that is correct by the standard, but which does not guarantee good behaviour in some situation. This - and this is a leap of faith - could be any non-CPython interpreter, iff we also do not construct the HTTPResponse in cachecontrol (because the happy path is correct).

Sadly, I do not have every interpreter available at hand. As a remedy I've gone through some online and in containers using this snippet:

import os

def foo():
    print(len(os.listdir('/proc/%s/fd' % os.getpid())))
    x = open('x.txt', 'w')
    print(len(os.listdir('/proc/%s/fd' % os.getpid())))


print(len(os.listdir('/proc/%s/fd' % os.getpid())))
foo()
print(len(os.listdir('/proc/%s/fd' % os.getpid())))

Which ouputs something like 3 3 4 3 on a "good" interpreter and 3 3 4 4 on a bad. Doing this I can say that Cython and Stackless Python behave exactly like CPython. PyPy behaves different, but in a way that I think is still passable: since it doesn't GC in the same way CPython does; it takes a bit longer. If I explicitly gc.collect() it also closes the fd. That - as far as I can tell - also means that io.IOBase cannot be reliably used without calling gc.collect() - if that's fine, then the behavior of file handles is also fine. As a proof; I used my above Wrap, and its __del__ method never gets called either.

Ergo; if the behavior here leads to leaking file handles then using io.IOBase leads to leaking file handles under PyPy. The extension of that reasoning is that PyPy isn't a correctly implemented interpreter under those specific rules. I.e,

x = WrapWithDel(open(...))

and

x = IOBaseDerivative(open(...))

would both leak the file handle, even if both are guaranteed to close it if __del__ is called.

Tracing back from HTTPResponsein cachecontrol, urllib3 and its superclasses starting with your reference, I didn't find a del implementation that closes the file handle. In C source code of CPython's standard library garbage collection is tied (to some extent) to releasing resources like file handles. Please kindly point out what code I missed. Surely I will have missed the destructor method you referred to, in navigating quickly across GitHub.

It's in the code you link to - Py_tp_finalize corresponds to __del__. The binding occurs on line 882, and it's defined on line 296. The call to close occurs on line 321. Since that invokes it through self, it'd find the most derived close to call; i.e., the one on HTTPResponse, which in turns calls the close on the file just like called it on Wrap.close in my experiment.

I'm just putting a question mark at offering an API that increases the likelihood of leaking memory or other resources. So I asked, are there no equally performant but more robust alternatives? The scenario I envisage is that a programmer calls get_body() many times in a frame that remains active, and so the refcount does not reach 0. Like PDM, doing a lot of get_body() calls, processing the results in a function that goes out of scope only after everything is done (seems typical for a CLI invocation).

A programmer does not call get_body many times in a frame, because this isn't a general purpose API and is only used for the purpose of caching urllib3. Chainsaws are dangerous in the hands of a fool; that isn't a flaw in their design if you use them for their intended purpose. Given the API here - and those in the upstream - changing those would be breaking changes and have to be dealt with as such, whether planned or not. The other case of course is adding new code in PDM, in which case whoever did it would have to design their code in such a way that file handles are closed, which also applies to the correct-by-definition happy path.

That would result in a leak if some IO object isn't closed due to a defect, or in any case a waste of resources at runtime. Indeed, in the container space I do have actual experience with resource-constrained operating system instances that run out of file descriptors at a peak in activity.

I agree that if an IO object isn't closed due to a defect, that's a bug to be fixed. ;-) And if your resource-constrained operating system is doing pdm install in containers you probably have an architectural problem. It's just not an environment where most build tools - whether it's PDM, Pants, Bazel, Cargo, or CMake would thrive - or be designed for.

@sanmai-NL
Copy link
Contributor

Thanks for the analysis. No problem about it being more or less text. Who doesn't want to read it can skip the topic.

My general response is: building robust interfaces matters. We can argue that there's no problem now, but what I'm looking for is that there's no problem tomorrow as well. The criterion for me is how much of an investment it had been to implement an alternative design that's more robust.

I agree that if an IO object isn't closed due to a defect, that's a bug to be fixed. ;-) And if your resource-constrained operating system is doing pdm install in containers you probably have an architectural problem. It's just not an environment where most build tools - whether it's PDM, Pants, Bazel, Cargo, or CMake would thrive - or be designed for.

The resource constrained OS resides within the container. The max number of file descriptors can sometimes be low by default (e.g., 1024). I hope you agree that building software within containers isn't unusual.

@j178 j178 mentioned this pull request Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants