Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented batch processing for the indexed raster class. #433

Merged
merged 27 commits into from
Sep 23, 2024

Conversation

elidwa
Copy link
Contributor

@elidwa elidwa commented Sep 20, 2024

I am using the spatial filter recommended by @dshean. After identifying all unique rasters within the AOI and the points that intersect each raster, batch-threaded readers process each raster for all its points. No raster is sampled by more than one thread, which greatly improves GDAL's performance.

Through extensive testing on an 8-core machine, I found that 20 reader threads is the optimal number. Adding more threads slows down performance, likely due to GDAL's global cache locks.

Prior to this change, sampling 64k points in 3DEP rasters took 140 seconds. Now it takes 35. Similarly, processing 525k points used to take 22 minutes, but now takes 4 minutes.

There is still a lot of room for further improvements, such as replacing vectors of features with a GEOS R-tree. I will continue working on these optimizations, but the current implementation is ready for use. I will also add more tests.

Note: These optimizations only apply to indexed class datasets. For single index files, such as VRTs, the code remains unchanged. We create multiple sampler objects and allow them to read the same VRT in parallel, each processing a different range of points. Once all the threads complete, the results are combined and returned. This part of the code has not been optimized yet.

I also resolved issue #430 by replacing the SR dictionary with std::unordered_map<std::string, bool> bandsDict. This fixed the problem, and the mutex is no longer required.

@jpswinski
Copy link
Member

@elidwa I did not realize that std::unordered_map<std::string, bool> was thread safe. Can you confirm that?

@elidwa
Copy link
Contributor Author

elidwa commented Sep 23, 2024

@jpswinski, std::unordered_map is not thread-safe when multiple threads are modifying it. However, in my context, it is safe because only the main thread populates the map, and the worker threads only read from it. This read-only operation is thread-safe.

The SR dictionary was intended to behave the same way, but it did not. Without a mutex protecting access to the dictionary, the application was crashing, which did not make sense to me since the threads were not modifying the dictionary.

@jpswinski
Copy link
Member

@elidwa That is good to know, thanks! Independent of this PR, as you mentioned, the intended behavior of the SR dictionary is that reads would be thread safe. So I need to investigate why that is not the case (because it is certainly assumed elsewhere).

@jpswinski jpswinski merged commit da9e4ae into main Sep 23, 2024
4 of 5 checks passed
@jpswinski jpswinski deleted the new_parquet_sampler branch September 23, 2024 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants