-
-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for SplitBodyCache #1971
Conversation
b85c418
to
3e2ba47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However; it completely prevents running an older PDM against a new cache.
I don't think we need to consider this issue.
@tgolsson Why does |
@sanmai-NL It's based on https://github.com/psf/cachecontrol, maybe wrong place for asking. |
@sanmai-NL Why would it lead to file handle leaks? |
@tgolsson Because every caller of |
I failed to find references to |
|
The io object will be passed to create an |
I noticed that |
Which Python version? Python file handles from
There was actually some issues with returning a non-file in some situations, though the original PR is many months old at this point as cachecontrol was quite dead, so I can't say if it'd still repro or was related to issues in that specific version. I think in general it's nicer to avoid loading multiple gigabytes into memory if it can be avoided, especially if it's not certain it'll be needed. As I recall in some cases it'll actually just get streamed out again into the build directory, so we never need to keep it all in memory. |
@tgolsson I'll readily believe it wasn't a major defect with PDM. But please read this: https://realpython.com/why-close-file-python/ |
Not sure what you mean with manually close? Do you mean, somehow enforcing that |
using a global context manager that will closed after CLI handling, like what pip is doing |
urllib3 will always close the file (when the response is GC'd); so I don't see why we'd need to manually close them. I guess if we had some malformed response data we could potentially leak them on non-CPython interpreters as we'd not be creating a HTTPResponse object. |
When will garbage collection happen? If PDM keeps running and accumulates responses to cache that aren't being read, the number of open files will increase to no limit. Right? Perhaps pointing out the actual source code paths will keep us focused. |
Keep arguing with two philosophies seems meaningless. Can you quantify how many files are opened and how many are not closed when running a PDM command? |
If you explain the main code path from actual requests done by PDM into this caching layer, then I'll look into measurement. |
To quote the docs (which you can find here):
So the object is to be garbage-collected immediately when the refcount reaches 0, i.e, when it becomes unreachable from regular Python code. That is to say; unless there are islands of self-referential objects, it will be cleaned up immediately. If it does exist in an island it'll be handled during the sweep-and-prune GC phase. If it isn't at 0, and isn't in an island, someone is keeping it around and I'd say forcing it to be closed would be a potential future bug. |
@tgolsson To be clear, I knew that. The reference counting behavior is CPython specific, by the way. And even there, a deallocation does not guarantee that operating system resources related to the freed object, such as file handles, are released as well. See https://stackoverflow.com/questions/49512990/does-python-gc-close-files-too. I'm just putting a question mark at offering an API that increases the likelihood of leaking memory or other resources. So I asked, are there no equally performant but more robust alternatives? The scenario I envisage is that a programmer calls But I'm keen to measure this as asked by @frostming. Just would like more explaination about how the caching layer is actually (to be) used. |
I'm aware that it is CPython specific. It is after all in the previous link you told me to read. Perhaps you forgot? And it's also quite obvious from the fact that I link the CPython documentation. And as I've written before; the HTTPResponse does close the file explicitly when dropped, via Doesn't mean there's anything wrong with the code in PDM or this PR. For what it's worth, the caching layer is used to construct the session used by |
Please calm down sir. I'll respond to your comments soon, but I would prefer if you keep it businesslike. |
@tgolsson I think you've misunderstood my inquiry.
You finally perform some experiment. Thanks for that, it adds some information. Most of the following statements of fact you seem to believe in, I think, based on your defensive response. But they are disconnected from my own input in this thread:
Lastly, you write you understand CPython vs non-CPython interacts with the design. I gave you pointers, you responded with information that is already known, apparently to both of us. I hope you now see why that doesn't help the discussion.
|
Tracing back from |
Sorry for wall of text.
You did ask, but you also claimed that the design leads to leaking file handles. So if you expect discussion, I could as easily say I expect proof - otherwise it's just theorycrafting about file handles. But, I also don't agree with that sentiment, because in the intended usage and under regular (i.e, happy path) operation the leaks cannot occur. The API is designed in an upstream crate, and a proper usage in-place is safe. It isn't a general-purpose cache utility, and I don't think it has to be. So I'm more interested in proving whether or not it has leaks as is, as otherwise any change would be churn. Re your statements, or claims to my thoughts; I hope my above reasoning makes sense. Code only needs to be fit for the purpose it's intended for; nothing more. The design could be improved - like any design can be - but if it works I think it doesn't matter. So proving it doesn't work is the most tractable path so we know what to change. So that leaves the following point:
And while I do not have any say as I'm merely a passing contributor; I do think that the opposite can be argued: An implementation that does not support X is acceptable for PDM. What X is can be argued. X might be Cython, Jython, or a solo project by John Doe. For example; I know Jython has IOBase, and IOBase as defined by the Python documentation shall hold a certain behavior. I'm arguing this behaviour guarantees correctness on the happy path. For sanity, we thusly have to find an implementation of Python that is correct by the standard, but which does not guarantee good behaviour in some situation. This - and this is a leap of faith - could be any non-CPython interpreter, iff we also do not construct the HTTPResponse in cachecontrol (because the happy path is correct). Sadly, I do not have every interpreter available at hand. As a remedy I've gone through some online and in containers using this snippet: import os
def foo():
print(len(os.listdir('/proc/%s/fd' % os.getpid())))
x = open('x.txt', 'w')
print(len(os.listdir('/proc/%s/fd' % os.getpid())))
print(len(os.listdir('/proc/%s/fd' % os.getpid())))
foo()
print(len(os.listdir('/proc/%s/fd' % os.getpid()))) Which ouputs something like Ergo; if the behavior here leads to leaking file handles then using x = WrapWithDel(open(...)) and x = IOBaseDerivative(open(...)) would both leak the file handle, even if both are guaranteed to close it if
It's in the code you link to -
A programmer does not call
I agree that if an IO object isn't closed due to a defect, that's a bug to be fixed. ;-) And if your resource-constrained operating system is doing |
Thanks for the analysis. No problem about it being more or less text. Who doesn't want to read it can skip the topic. My general response is: building robust interfaces matters. We can argue that there's no problem now, but what I'm looking for is that there's no problem tomorrow as well. The criterion for me is how much of an investment it had been to implement an alternative design that's more robust.
The resource constrained OS resides within the container. The max number of file descriptors can sometimes be low by default (e.g., 1024). I hope you agree that building software within containers isn't unusual. |
Pull Request Checklist
news/
describing what is new.Describe what you have changed in this PR.
Supersedes #1498, rather than rebasing. This adds support for using a split-body. In the current single-body cache, each file in the HTTP cache contains two different fragments: the serialized HTTP response, followed by the HTTP body data. This means that when response headers change (including timestamps), the whole cache object is rewritten to disk.
With this change the body is not written again, reducing disk pressure.
This change is compatible with existing caches, to the best of my ability to test it. However; it completely prevents running an older PDM against a new cache. In that case, the body in the file is "None" and the deserialization breaks. Not sure if there's a great workaround.