-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6667: [Python] remove cyclical object references in pyarrow.parquet #5476
Conversation
Why do you care about
|
As the CI shows, this PR breaks the unit tests. I recommend you run the unit tests using In this case, the weakref breaks pickling. I would suggest trying to untangle the mess that is the |
Hi, sorry for not catching the test failures in the first push.
The new patch creates a metadata storage object for the purpose of breaking the cycle in
I'm analyzing my codebase trying to identify why my garbage collector is running very frequently during calculations and my analysis had me find this reference cycle. I would be uncomfortable raising my garbage collection threshold or disabling it outright for performance gains unless my analysis actually shows that I never actually collect any garbage, and the |
Hmm, I'd suggest taking some steps back :-) The Python garbage collector is running frequently by design. However, there are different kinds of garbage collections: incremental and full garbage collections. Only the full garbage collections can be costly, and even then, in most applications they have a minimal cost. You may use gc.callbacks if you want to find out more information about this. Generally, it's expected for object-oriented code (and/or code using closures, etc.) to have reference cycles. As long as those reference cycles don't keep a costly object alive (such as an open file or a large array), it shouldn't be much of a problem. |
Looks like Python 2.7 can't pickle bound methods. |
…quet Refactor _build_nested_path to no longer have a reference cycle. Move some metadata attributes from ParquetDataset to another sub-object to avoid a reference cycle.
BTW @pitrou just realized you've had to come in after my garbage collection changes before... tornadoweb/tornado#1782 Small world isn't it? 😃 This time you managed to stop me before I added weakrefs instead of after. |
Looks like this is passing save for lint issues. I'm going to fix your lint issues in hopes that @pitrou can merge this in the morning before a 0.15.0 release candidate is cut |
Please make pull requests using branches in the future, it's tedious and error prone for us to push commits to your master branch |
I can't rebase/force-push otherwise GitHub will force the PR closed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll wait for Travis-CI to pass and then merge.
Codecov Report
@@ Coverage Diff @@
## master #5476 +/- ##
=========================================
Coverage ? 66.35%
=========================================
Files ? 508
Lines ? 70129
Branches ? 0
=========================================
Hits ? 46533
Misses ? 23596
Partials ? 0
Continue to review full report at Codecov.
|
_build_nested_path has a reference cycle because the closured function refers to the parent cell which also refers to the closured function again. Address this by clearing the reference to the function from the parent cell before returning.
open_dataset_file is partialed with self inside the ParquetFile class. Prevent this by using a weakref instead.