Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6667: [Python] remove cyclical object references in pyarrow.parquet #5476

Closed
wants to merge 2 commits into from

Conversation

AaronOpfer
Copy link
Contributor

_build_nested_path has a reference cycle because the closured function refers to the parent cell which also refers to the closured function again. Address this by clearing the reference to the function from the parent cell before returning.

open_dataset_file is partialed with self inside the ParquetFile class. Prevent this by using a weakref instead.

@pitrou
Copy link
Member

pitrou commented Sep 23, 2019

Why do you care about _build_nested_path exactly? It's just a small function, it's not a problem if there's a reference cycle.

_open_dataset looks more serious, since it may keep alive an entire dataset.

python/pyarrow/parquet.py Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Sep 23, 2019

As the CI shows, this PR breaks the unit tests. I recommend you run the unit tests using pytest before pushing changes, since it will allow you to iterate more quickly.

In this case, the weakref breaks pickling. I would suggest trying to untangle the mess that is the _open_dataset_file indirection, perhaps by moving some functionality in a dedicated DatasetFileOpener (or DatasetParams?) class that would hold the necessary parameters.

@AaronOpfer
Copy link
Contributor Author

Hi, sorry for not catching the test failures in the first push.

I would suggest trying to untangle the mess that is the _open_dataset_file indirection, perhaps by moving some functionality in a dedicated DatasetFileOpener (or DatasetParams?) class that would hold the necessary parameters.

The new patch creates a metadata storage object for the purpose of breaking the cycle in ParquetDataset. The implementation of which I'm not super pleased about, so I'm interested in what your feedback is there.

Why do you care about _build_nested_path exactly? It's just a small function, it's not a problem if there's a reference cycle.

I'm analyzing my codebase trying to identify why my garbage collector is running very frequently during calculations and my analysis had me find this reference cycle. I would be uncomfortable raising my garbage collection threshold or disabling it outright for performance gains unless my analysis actually shows that I never actually collect any garbage, and the _build_nested_path stackframes are swept on every collection. If I can read between the lines a bit, I agree with you that it is a minor problem when it comes to actual bytes of memory wasted.

@pitrou
Copy link
Member

pitrou commented Sep 24, 2019

I'm analyzing my codebase trying to identify why my garbage collector is running very frequently during calculations and my analysis had me find this reference cycle.

Hmm, I'd suggest taking some steps back :-) The Python garbage collector is running frequently by design. However, there are different kinds of garbage collections: incremental and full garbage collections. Only the full garbage collections can be costly, and even then, in most applications they have a minimal cost. You may use gc.callbacks if you want to find out more information about this.

Generally, it's expected for object-oriented code (and/or code using closures, etc.) to have reference cycles. As long as those reference cycles don't keep a costly object alive (such as an open file or a large array), it shouldn't be much of a problem.

python/pyarrow/parquet.py Outdated Show resolved Hide resolved
python/pyarrow/parquet.py Outdated Show resolved Hide resolved
@AaronOpfer
Copy link
Contributor Author

Looks like Python 2.7 can't pickle bound methods. ☹️ I'll go ahead and make the code uglier to fix it.

…quet

Refactor _build_nested_path to no longer have a reference cycle.

Move some metadata attributes from ParquetDataset to another sub-object to avoid a reference cycle.
@AaronOpfer
Copy link
Contributor Author

BTW @pitrou just realized you've had to come in after my garbage collection changes before...

tornadoweb/tornado#1782
tornadoweb/tornado#2230

Small world isn't it? 😃 This time you managed to stop me before I added weakrefs instead of after.

@wesm
Copy link
Member

wesm commented Sep 25, 2019

Looks like this is passing save for lint issues. I'm going to fix your lint issues in hopes that @pitrou can merge this in the morning before a 0.15.0 release candidate is cut

@wesm
Copy link
Member

wesm commented Sep 25, 2019

Please make pull requests using branches in the future, it's tedious and error prone for us to push commits to your master branch

@wesm
Copy link
Member

wesm commented Sep 25, 2019

I can't rebase/force-push otherwise GitHub will force the PR closed

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll wait for Travis-CI to pass and then merge.

@pitrou pitrou closed this in 502865d Sep 25, 2019
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@c6faaed). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5476   +/-   ##
=========================================
  Coverage          ?   66.35%           
=========================================
  Files             ?      508           
  Lines             ?    70129           
  Branches          ?        0           
=========================================
  Hits              ?    46533           
  Misses            ?    23596           
  Partials          ?        0
Impacted Files Coverage Δ
python/pyarrow/parquet.py 92.5% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6faaed...f4909e0. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants