Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow MemoryMappedFile.close() does not release memory #34423

Open
cwang9208 opened this issue Mar 3, 2023 · 4 comments
Open

[Python] pyarrow MemoryMappedFile.close() does not release memory #34423

cwang9208 opened this issue Mar 3, 2023 · 4 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@cwang9208
Copy link

cwang9208 commented Mar 3, 2023

Describe the usage question you have. Please include as many useful details as possible.

path = ""
files = os.listdir(path)
expr = pc.field("l_shipdate") <= datetime.date(1998, 12, 1)

for file in files:
    source = pa.memory_map(path + file)
    table = pa.ipc.RecordBatchFileReader(source).read_all().filter(expr)
    source.close()

I have a simple program to test the memory usage of MemoryMappedFile. In my above test, I'll loop over hundreds of 7GB files. I find the system memory usage continues to increase until exhausted even though I called source.close().

Is there anything wrong with my code or is this a bug?

Thanks for help in advance.

Component(s)

Python

@cwang9208 cwang9208 added the Type: usage Issue is a user question label Mar 3, 2023
@cwang9208 cwang9208 changed the title pyarrow MemoryMappedFile close does not release memory pyarrow MemoryMappedFile.close() does not release memory Mar 3, 2023
@kou kou changed the title pyarrow MemoryMappedFile.close() does not release memory [Python] pyarrow MemoryMappedFile.close() does not release memory Mar 3, 2023
@westonpace
Copy link
Member

Assuming you aren't saving Table or references to its data somewhere then yes, I would expect it to release the memory. A memory mapped file should call munmap once the file is closed and all references to the mapped memory are released (e.g. table is destroyed).

@cwang9208
Copy link
Author

@westonpace Hi, thanks for your reply. So you mean my code is correct and this should be a bug, right?

@westonpace
Copy link
Member

Probably. What happens if you put a manual gc run in the loop?

import gc
...
for file in files:
    source = pa.memory_map(path + file)
    table = pa.ipc.RecordBatchFileReader(source).read_all().filter(expr)
    source.close()
    gc.collect()

I'm wondering if some dangling python object (e.g. maybe the pa.ipc.RecordBatchReader) is hanging around and keeping references to buffers.

@cwang9208
Copy link
Author

@westonpace I already tried that (code below), but the memory will still be exhausted (watch -n 1 free -mh)

import pyarrow.compute as pc
import pyarrow as pa
import gc

path = ""
files = os.listdir(path)
expr = pc.field("l_shipdate") <= datetime.date(1998, 12, 1)

for file in files:
    source = pa.memory_map(path + file)
    reader = pa.ipc.RecordBatchFileReader(source)
    table = reader.read_all().filter(expr)
    del table
    del reader
    source.close()
    gc.collect()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants