-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] How to parallelize RecordBatch reading? #38275
Comments
How long does this actually takes:
Would the bottle neck on open and concate the files rather than extract record batches? If you bottlenect is on reading/deserialize or io, perhaps thread api would help. |
@mapleFU Thank you for reply! Actually in my scenario I have a very long list of indices lists so I will need to repeat the I/O operation many times. The major performance concern is on extracting record batches from disks. Is there anything possible to block multi-threading RecordBatch take operation? And could you please provide more details about what is the recommended practice to parallelize multiple RecordBatch reading including how should I use thread API to accelerate? Thank you |
What's the version of arrow are you using?
|
Hi, I am using 13.0.0 ( I have some confusions here. I think actual data reading only happens at when I execute take operation for each RecordBatch and and is the Thank you! @mapleFU |
https://stackoverflow.com/questions/18883414/evaluation-of-list-comprehensions-in-python From the link above, I've a long time doesn't write Python, but I remember only comprehension like Besides, def read_all(self):
"""
Read all record batches as a pyarrow.Table.
Returns
-------
Table
"""
cdef shared_ptr[CTable] table
with nogil:
check_status(self.reader.get().ToTable().Value(&table))
return pyarrow_wrap_table(table) I guess it will materialize them at once.. |
@mapleFU Hi, thank you for reply. The memory-mapped files are pretty large and cannot fit into the CPU RAM at once I believe.. BTW, if I convert all those files from arrow streaming format to arrow file (random access format) and use I learned this way from: https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data
|
Oh that's interesting, so here when initialize, the Can you report the OS and disk you're actually using? I'm a bit busy on work days but I'll try to reproduce this when I have spare time. Also cc @kou do you have any advice on this? |
Hi @mapleFU . Thank you for reply! Here is my OS:
and the filesystem underlying is actually a remote storage system. But I have reproduced this on my personal laptop which has ubuntu-22.04 and a local NVMe disk installed so I think whether the disk is from remote storage doesn't matter. Sorry but what |
With Mmap, arrow will create a memoryMapped file[1]. And FileSystem has some memory size, when memory is not enough, it will "swap out" the mmap page to the block storage. And when next visit this part of data, the data might be re-load from block storage. I'm not sure it's this problem, but I guess you can try to just
Or you can profile how time spend with flamegraph. It will make things more clear. [1] https://arrow.apache.org/docs/cpp/api/io.html#_CPPv4N5arrow2io16MemoryMappedFileE |
You expected that I think that you can't do it because (What is your real problem? Do you want to parallelize Apache Arrow data read?) |
@kou Hi, Thank you for reply! Yes essentially I would like to parallelize Apache Arrow data read and I hope I can also understand its mechanism better too. Sorry for the confusion but let me describe more details of my problem. I have 20 arrow files and each of them is 55GB. Since they cannot be directly loaded to RAM at once so they are memory-mapped: mmap_files = [pa.memory_map(os.path.join(dir_path, file_name), 'r') for file_name in file_names]
mmap_tables = [pa.ipc.open_stream(memory_mapped_stream).read_all() for memory_mapped_stream in mmap_files] The The real concern is that it seems when the file size is large, the reading of multiple random indices of RecordBatch becomes slow, which seems not an issue to smaller arrow file size (even the nbytes of RecordBatch is almost the same). Is this expected? Initially I was thinking about maybe when the file size is large, reading multiple non-contiguous RecordBatch will result in page fault and then block the reading. So I was trying to parallelize different RecordBatch's reading so that they would not block each other. As @mapleFU mentioned, maybe this is caused by "swap". |
Thanks for providing additional information. In your use case (random access), File format instead of Stream format may be better. Because we need to read all data from the beginning for Stream format but we don't need to do it for File format. If you don't need to use all data in your Arrow files, you don't need to load some (many) data into memory from disk by using File format. Could you try File format? |
@kou Hi, thank you for the advice. Yes this helps a lot. Now it runs at a satisfactory performance. I am curious about why. Since for stream format, after performing |
We need to load all I hope that this explanation helps you. (You may want to use |
Can we close this? |
Oh I forget the file format. Just a question, in this case, seems the data is stil mmaped, but Stream will need to parse the whole file schema, and mmap the area to arrow batches. When user want to loading all data to memory-mapped RecordBatches, how could this benefit the reading time? |
Does it mean that is the stream format useful with mmap, right? |
Got it, thanks! |
@kou Thank you for your explanation! I think I can close this issue now. |
Describe the usage question you have. Please include as many useful details as possible.
Currently I have some arrow streaming format files and I read them into a list of lists of RecordBatch
and I have indices of RecordBatch to read and I would like read them in parallel for efficiency.
however, it seems that they are not well parallelized based on the recorded time data.
I sorted the start_time and end_time to show this:
Component(s)
Python
The text was updated successfully, but these errors were encountered: