Skip to content

Commit

Permalink
refactor: reduce memory usage of geoparquet file saving (#123)
Browse files Browse the repository at this point in the history
* refactor: reduce memory usage of geoparquet file saving

* chore: apply refurb suggestions

* chore: refine empty columns test

* chore: add no cover pragmas
  • Loading branch information
RaczeQ authored Jun 3, 2024
1 parent 38fc7ba commit 26f71e8
Show file tree
Hide file tree
Showing 10 changed files with 610 additions and 395 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Test for parquet multiprocessing logic
- Test for new intersection step
- Option to pass URL directly as PBF path [#114](https://github.com/kraina-ai/quackosm/issues/114)
- Dedicated `MultiprocessingRuntimeError` for multiprocessing errors

### Changed

Expand All @@ -22,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `PbfFileReader`'s internal `geometry_filter` is additionally clipped by PBF extract geometry to speed up intersections [#116](https://github.com/kraina-ai/quackosm/issues/116)
- `OsmTagsFilter` and `GroupedOsmTagsFilter` type from `dict` to `Mapping` to make it covariant
- Tqdm's `disable` parameter for non-TTY environments from `None` to `False`
- Refactored final GeoParquet file saving logic to greatly reduce memory usage

## [0.8.1] - 2024-05-11

Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ Required:

- `geoarrow-pyarrow (>=0.1.2)`: For GeoParquet IO operations

- `geoarrow-pandas (>=0.1.1)`: For GeoParquet integration with GeoPandas

- `geopandas (>=0.6)`: For returning GeoDataFrames and reading Geo files

- `shapely (>=2.0)`: For parsing WKT and GeoJSON strings and fixing geometries
Expand Down
12 changes: 8 additions & 4 deletions dev/generate_resources_usage_plot.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@
"outputs": [],
"source": [
"def _execute_example(pbf_path, working_directory) -> tuple[Path, dict]:\n",
" path = qosm.convert_pbf_to_gpq(\n",
" path = qosm.convert_pbf_to_parquet(\n",
" pbf_path=pbf_path,\n",
" working_directory=working_directory,\n",
" ignore_cache=True,\n",
Expand Down Expand Up @@ -200,10 +200,14 @@
" if ts >= (operation_start_time - 0.1)\n",
" ]\n",
" cpu_values_adjusted = [\n",
" (val, ts - operation_start_time) for val, ts in cpu_values if ts >= (operation_start_time - 0.1)\n",
" (val, ts - operation_start_time)\n",
" for val, ts in cpu_values\n",
" if ts >= (operation_start_time - 0.1)\n",
" ]\n",
" disk_values_adjusted = [\n",
" (val, ts - operation_start_time) for val, ts in disk_values if ts >= (operation_start_time - 0.1)\n",
" (val, ts - operation_start_time)\n",
" for val, ts in disk_values\n",
" if ts >= (operation_start_time - 0.1)\n",
" ]\n",
"\n",
" fig = plt.figure(figsize=(20, 10))\n",
Expand Down Expand Up @@ -360,7 +364,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
555 changes: 285 additions & 270 deletions pdm.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies = [
"pyarrow>=14.0.0",
"duckdb>=0.10.2",
"geoarrow-pyarrow>=0.1.2",
"geoarrow-pandas>=0.1.1",
"typeguard>=3.0.0",
"psutil>=5.6.2",
"pooch>=1.6.0",
Expand Down
3 changes: 3 additions & 0 deletions quackosm/_exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ class GeometryNotCoveredError(Exception): ...


class InvalidGeometryFilter(Exception): ...


class MultiprocessingRuntimeError(RuntimeError): ...
3 changes: 2 additions & 1 deletion quackosm/_parquet_multiprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import pyarrow as pa
import pyarrow.parquet as pq

from quackosm._exceptions import MultiprocessingRuntimeError
from quackosm._rich_progress import TaskProgressBar # type: ignore[attr-defined]

# Using `spawn` method to enable integration with Polars and probably other Rust-based libraries
Expand Down Expand Up @@ -51,7 +52,7 @@ def _job(
f"Error in worker (PID: {current_pid},"
f" Parquet: {file_name}, Row group: {row_group_index})"
)
raise RuntimeError(msg) from ex
raise MultiprocessingRuntimeError(msg) from ex

if writer:
writer.close()
Expand Down
Loading

0 comments on commit 26f71e8

Please sign in to comment.