-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change location of tmp output #150
Comments
To add here: The matrix written to /tmp had a temporary size of ~ 900 MB. After merging and copy to the real location it has ~130 MB. I need to do some checks but my first impression is that 0.8.2 is compared to 0.7.11 also slower. |
I have done some tests. Maybe I have done something wrong when I updated the API call from 0.7.11 to 0.8.2. Therfore my src to write a matrix out in 0.7.11 is: split_factor = 1
if len(self.matrix.data) > 1e7:
split_factor = 1e4
matrix_data_frame = np.array_split(matrix_data_frame, split_factor)
cooler.io.create(cool_uri=pFileName,
bins=bins_data_frame,
pixels=matrix_data_frame,
append=self.appendData,
dtype=dtype_pixel) And in 0.8.2 it is: if len(self.matrix.data) > 1e7:
split_factor = 1e4
matrix_data_frame = np.array_split(matrix_data_frame, split_factor)
if self.appendData:
self.appendData = 'a'
else:
self.appendData = 'w'
cooler.create_cooler(cool_uri=pFileName,
bins=bins_data_frame,
pixels=matrix_data_frame,
mode=self.appendData,
dtypes=dtype_pixel) I tested now how long it takes to open a cool file, apply correction factors and write it back. Both with 10000 chuncks on Rao 2014 data, Runtime with 0.7.11 in total is 6 minutes and 9 seconds. Output of
For 0.8.2
Both files have a different file size in the end 354 MB vs 345 MB, a check for the content shows they are equal
Cooler version 0.7.11 writes everything directly to the given location, and as mentioned above, 0.8.2 writes it to tmp first. It creates here two files: One multicooler file with a file size of 3.3 GB and second one (I guess for merging) with 375 MB. That's an overhead of factor ~10 (!!!). I don't know what causes this massive overhead, but from what I see here I hope you can improve the performance back to the level of the 0.7.11 version. If I can support or help you somehow to achieve this, please contact me. Best, Joachim |
Hi Joachim,
The 0.7.11 behavior wasn't removed! If the input chunks are provided in the right order (I believe this is the case for you), pass Otherwise, it is assumed that the chunks can be in any order, so they are written as a series of partial coolers and then merged (two steps). The merge step is going to be very slow if your chunks are small because there will be too many of them -- (and if there are > 200 chunks it will do a 2-pass recursive merge!). If you can buffer the chunks into much larger ones, then there shouldn't be so much overhead. A simple wrapper generator like this can work. Though, this seems to be tripping up users, so we may have to factor such buffering into the function. Thanks for the benchmarking. Let me know the timing with Re: the original issue
Good point. It would probably be a more sensible default. Thoughts /any objections @mimakaev , @golobor ? Note that, in case you need it, there is a |
Yes, the difference can be attributed to the addition of the |
Hi, Thanks for your response, after setting Best, Joachim |
Temp files are now created in the output location by default. |
Hi,
can you change the intermediate location where a cool file is written? In the current version it goes to
/tmp
,/var/tmp
, and/usr/tmp
or to whatever is written in system environment variablesTMPDIR
,TMP
,TEMP
, see: https://docs.python.org/2/library/tempfile.html#tempfile.tempdirI think many non-developer users are not having knowledge about this and usually have a small root partition. Quite likely they wonder why the run out of disk space if the location they defined as output location is in /home and there is enough space.
It would be good if the temp file you create is created in the location given by
cool_uri
.Best,
Joachim
The text was updated successfully, but these errors were encountered: