-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scn.save_dataset() slows down gradually during a loop #2964
Comments
P.S. I also tried |
Without testing anything, one thing that should help at least a bit is to add Also the I'm not sure (again, without testing) I'm not sure how the I don't know what is causing the increase in processing time though. I remember we had some memory accumulation happening in Trollflow/Trollflow2 before we switched to separate processes for consecutive runs. |
Thanks @pnuu! I tried these options and they do help a lot to improve the baseline, but can't stop it coming down little by little (not that much as previous one though). Here's the result of my 142 datasets. If the data list is long enough this could still be a problem. I wonder if there're other things wrong here but can't figure them out... import glob
import logging
import os
import time as time_calc
import warnings
import numpy as np
import gc
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler("output.log"),
logging.StreamHandler()
]
)
os.environ.setdefault("DASK_ARRAY__CHUNK_SIZE", "16MiB")
os.environ.setdefault("DASK_NUM_WORKERS", "12")
os.environ.setdefault("OMP_NUM_THREADS", "8")
os.environ.setdefault("GDAL_NUM_THREADS", "8")
os.environ.setdefault("PSP_CONFIG_FILE", "D:/satpy_config/pyspectral/pyspectral.yaml")
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)
from satpy import Scene, find_files_and_readers, config, utils
from satpy.writers import compute_writer_results
utils.debug_on()
config.set(config_path=["D:/satpy_config"])
def satpy_sat_to_float32(folder, sat_name, scan_area, reader_name, band_image_list):
tmp_folder = f"{folder}/work_tmp"
start_load = time_calc.perf_counter()
files = find_files_and_readers(base_dir=folder, reader=reader_name)
scn = Scene(filenames=files)
scn.load(band_image_list)
end_load = time_calc.perf_counter()
print(f"Load sat dataset: {(end_load - start_load): .6f}")
start_save = time_calc.perf_counter()
res_list = []
for band_image in band_image_list:
output = (f"{sat_name}_{scn[band_image].attrs["sensor"]}_{band_image}_"
f"{scn.start_time.strftime("%Y%m%d%H%M%S")}_{scan_area}")
output_filename = f"{output}.tif"
res = scn.save_dataset(band_image, filename=output_filename, writer="geotiff", tiled=True, blockxsize=512, blockysize=512,
base_dir=tmp_folder, num_threads=8, compress=None, enhance=False, dtype=np.float32, compute=False,
fill_value=np.nan)
res_list.append(res)
compute_writer_results(res_list)
end_save = time_calc.perf_counter()
print(f"Save to float32 geotiff: {(end_save - start_save): .6f}")
del scn
gc.collect()
folders = glob.glob("C:/Users/45107/Downloads/Sat/Geo/H09*FLDK")
for folder in folders:
print(folder)
satpy_sat_to_float32(folder, "H09", "FLDK", "ahi_hsd", ["Visible004_1000", "Visible005_1000",
"Visible006_500", "Visible008_1000",
"NearInfraRed016_2000", "NearInfraRed022_2000",
"InfraRed038_2000", "InfraRed063_2000",
"InfraRed069_2000", "InfraRed073_2000", "InfraRed087_2000",
"InfraRed096_2000", "InfraRed105_2000", "InfraRed112_2000",
"InfraRed123_2000", "InfraRed133_2000"]) Debug output of the first and last dataset: |
@yukaribbba do you see the same trend in memory usage? |
@mraspaud I'm also guessing this is about memory, garbage collection or things like that. My memory remained normal during the workflow. Actually this is a part of a QThread but I do them in batch mode. Looks like I need to split all those data folders one by one like you said. Start the thread, do the job, quit the thread, move to next folder and start it again. Maybe this can avoid the problem? I'll give it a try. |
Threading might work, but I would recommend going all the way and using multiprocessing instead. Memory can be shared in threads, but in principle not with separate processes. |
|
At the time, the hardest part was to find a minimal example to reproduce the error... It would be great if that could be done, so we can investigate further! |
I don't recall the outcome of everything we tried last time this came up, but there are few things I think we could try:
|
Ok I have two minimal examples here, one for png: import glob
import os
import time as time_calc
import warnings
import numpy as np
os.environ.setdefault("DASK_ARRAY__CHUNK_SIZE", "16MiB")
os.environ.setdefault("DASK_NUM_WORKERS", "12")
os.environ.setdefault("OMP_NUM_THREADS", "8")
os.environ.setdefault("GDAL_NUM_THREADS", "8")
os.environ.setdefault("PSP_CONFIG_FILE", "D:/satpy_config/pyspectral/pyspectral.yaml")
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)
from satpy import Scene, find_files_and_readers, config
config.set(config_path=["D:/satpy_config"])
def satpy_sat_to_float32(folder, reader_name, band_image_list):
tmp_folder = f"{folder}/work_tmp"
start_load = time_calc.perf_counter()
files = find_files_and_readers(base_dir=folder, reader=reader_name)
scn = Scene(filenames=files)
scn.load(band_image_list)
end_load = time_calc.perf_counter()
print(f"Load sat dataset: {(end_load - start_load): .6f}")
start_save = time_calc.perf_counter()
scn.save_datasets(writer="simple_image",
filename="{platform_name}_{sensor}_{name}_{start_time:%Y%m%d%H%M%S}_{area.area_id}.png",
base_dir=tmp_folder, enhance=False, compress_level=0)
end_save = time_calc.perf_counter()
print(f"Save to int8 png: {(end_save - start_save): .6f}")
del scn
folders = glob.glob("C:/Users/45107/Downloads/Sat/Geo/H09*FLDK")
results = []
for folder in folders:
print(folder)
satpy_sat_to_float32(folder, "ahi_hsd",
[
"Visible004_1000", "Visible005_1000", "Visible006_500", "Visible008_1000",
"NearInfraRed016_2000", "NearInfraRed022_2000", "InfraRed038_2000", "InfraRed063_2000",
"InfraRed069_2000", "InfraRed073_2000", "InfraRed087_2000", "InfraRed096_2000",
"InfraRed105_2000", "InfraRed112_2000", "InfraRed123_2000", "InfraRed133_2000"
]
) another for geotiff: ......
scn.save_datasets(writer="geotiff",
filename="{platform_name}_{sensor}_{name}_{start_time:%Y%m%d%H%M%S}_{area.area_id}.tif",
base_dir=tmp_folder, blockxsize=512, blockysize=512, num_threads=8, compress=None,
enhance=False, dtype=np.float32, fill_value=np.nan)
...... |
Ok, new thing to try: Remove the
|
Sorry I'm not available till saturday. I'll check that once free again. |
As for geotiff, I don't have much space on SSD to hold the output of 284 datasets so I did it on a hard drive and the whole progress became slow, but you can still see the trend clearly. @djhoese |
@yukaribbba that's really interesting! so that means that the image saving is the leaking? |
@mraspaud It looks like that. I feel like something isn't cleaned up in the writer part so they got accumulated little by little. |
Have you tried the garbage collection collect after the Edit: Oops, fat fingered the close button instead of comment. |
I could imagine that being either:
After the writer's save_datasets method is called, I believe the Scene throws away (allows it to be garbage collected) the Writer object. Overall I don't think geotiff should be used for performance testing of this sort because there are too many unknowns in rasterio/gdal's caching to really tie down what's happening. PNG, although slower, should be "dumber" as far as background threads and shared caches. |
How do you force GDAL to clear its cache? |
Could it be similar to this? OSGeo/gdal#7908 |
Oh boy that tcmalloc "solution" is quite the hack. You're basically telling the dynamic library loader to use a different dependency library. I don't see it in graph form in this thread, but has memory usage been tracked over the execution time? Preferably with some profiling tool. If you lowered the |
OK several things newly found:
|
Put more timers in Edit: memeory usage seems to be normal. In every loop its maximum reaches 16-17GB and drops quickly. |
Sorry for not responding earlier. Out with a sick kid so I'll be a little slow for at least today. This is expected and is a downside to geotiff writing with rasterio and dask at the time we wrote this. So first, we have to tell dask to use a lock between threads to not write in parallel because rasterio doesn't (didn't?) support parallel geotiff writing. Dask also doesn't have a way to say "this file is related to these dask tasks and should be closed when they are completed". So we keep the file-like object open and have the user close it. I believe rioxarray has better handling of this, but none of the core trollimage developers have had a chance to migrate this code. |
OK I see. So is there any way to close it myself? |
That's one of the things |
very interesting! I have nothing against using GDAL directly as long the functionality is preserved. However I'm wondering if we are doing something wrong in riosave, or if the issue is in the rasterio library... I will try to check today what happens if we use rioxarray for saving instead of rasterio. |
Describe the bug
The results I need are some raw bands in float32 geotiffs, without any corrections or enhancements. During the process, cpu/memory usage and system temps remained a resonable level. I got what I want and no errors poped up. Just the processing time is longer and longer. The more datasets and products involved, the more significant this becomes. And it happens on several readers like
ami, abi
orfci,
not justahi
(I haven't tested others yet).To Reproduce
Expected behavior
Time consumption should be more stable.
Actual results
A part of the debug output:
debug.txt
Environment Info:
The text was updated successfully, but these errors were encountered: