Replies: 13 comments 36 replies
-
I think identifying a separate column with simplified geometry information amenable to column statistics will probably be useful for quite some time and is independent of geoarrow encoding (although I will try to get the geoarrow encoding PR up before our next sync call anyway). Maybe as an example: {
"columns": {
"geometry": {
"encoding": "WKB",
"covering": {"geometry_bbox": {"encoding": "box"}}
}
} ...where the
I like this ordering (because the indices of all of those are compile-time constants even if you add in Z or M), but elsewhere in the spec we use |
Beta Was this translation helpful? Give feedback.
-
Having a spatially indexed parquet file will be far more useful for the GIS crowd. In context of cloud native data. Using the MBR as an index, will the data be stored in a manner making http range requests useable? Parquet's real power is columnar layout, working well with massive data sets. Moving huge amounts of data for one time use/storage is fine. Parquet as a cloud native format, moving huge amounts of data across a network for immediate use by a client negates any performance advantage of the layout, even with a spatial index. Even an 8 byte quadkey can have performance implications on a massive cloud native .parquet file, if useable at all. Have you considered R-Tree? Not in cloud context where spatial subsets of data are needed. What is the advantage of using a spatially indexed parquet file over GeoPackage format? Edit: As you develop this, it would be cool to see a performance comparison to GeoPackage. |
Beta Was this translation helpful? Give feedback.
-
@jwass @paleolimbot Thanks for the effort. I like the idea of having a explicit indexing column in addition to the current bbox info in the column metadata. In my opinion, this indexing column should be optional and recommended in the best practice. On a side note, Apache Sedona geoparquet reader is already leveraging the bbox in the column metadata to perform implicit geospatial filter pushdown. Additional resolution field?For the type of the spatial indexing system, I believe the Generic indexing systemWe should keep the indexing system generic. In other words, we should allow users to choose any indexing system from S2/H3/GeoHash. This is because the libraries that can generate these indexes might be only available to certain languages. E.g., the core of H3 is in C with bindings to Java. We ran into a problem before: Snowflake does not allow arbitrary C code execution via Java JNI. Consider geometries other than pointsWe should also consider geometries like polygons and linestrings. Different from points, these geometries might yield more than 1 H3/S2/GeoHash id. If we just choose one of the produced Id as the index value used in parquet fiels, filtering based on the index will give inaccurate filtering results. |
Beta Was this translation helpful? Give feedback.
-
@jwass thanks for opening the discussion! I am very much +1 on your proposal, standardizing a way to include such basic (but useful in many cases) bbox information at the geometry-level, with @paleolimbot's suggestions on how to encode that information in the metadata such that we can easily expand it to other types of information in the future. I think this would be a very useful start. |
Beta Was this translation helpful? Give feedback.
-
Remark: for 2D point datasets, it is a bit of a pity to have the same coordinate encoded 3 times... in minx,miny , in maxx,maxy and in the WKB blob... This is where a geoarrow encoding would make sense Regarding the bbox columns, if we want to optimize space a bit, we could suggest (require ?) the data type to be FLOAT32 instead of FLOAT64. For example the SQLite RTree uses Float32 to store bboxes, with rounding down for minx,minyx and rounding up for maxx, maxy |
Beta Was this translation helpful? Give feedback.
-
With a large data set, say 150M features like CONUS building footprints, what would the size of this meta data be? What would be minimum bytes per meta data group/row/block? Is it desirable or realistic to move MB's of meta data over a network?
This requires additional processing, that cloud native formats do not require.
The benefits are not pronounced. Downloading potentially huge amounts of meta data, processing it and then, finally, moving the results. This requires 2 extra steps cloud native formats do not require. |
Beta Was this translation helpful? Give feedback.
-
Out of curiosity: I think this |
Beta Was this translation helpful? Give feedback.
-
It seems like there's enough momentum to push this forward. At the GeoParquet meetup today, @jatorre expressed some caution and asked that we first do our homework to ensure most systems will be able to properly take advantage of the bbox proposal. Our plan is that I will submit the PR to flesh this out. Simultaneously, we'll do a survey of BigQuery, Athena, Snowflake, duckdb, etc. to ensure they can take full advantage of the specific proposal prior to merging it in. |
Beta Was this translation helpful? Give feedback.
-
thanks jacob,
…On Mon, Nov 20, 2023 at 8:20 PM Jacob Wasserman ***@***.***> wrote:
It seems like there's enough momentum to push this forward. At the
GeoParquet meetup today, @jatorre <https://github.com/jatorre> expressed
some caution and asked that we first do our homework to ensure most systems
will be able to properly take advantage of the bbox proposal.
Our plan is that I will submit the PR to flesh this out. Simultaneously,
we'll do a survey of BigQuery, Athena, Snowflake, duckdb, etc. to ensure
they can take full advantage of the specific proposal prior to merging it
in.
—
Reply to this email directly, view it on GitHub
<#188 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7GO43PKELTPPSQ7RRTJ3YFOUPXAVCNFSM6AAAAAA7JFYHUSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TMMRSHE2TS>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
--
*Javier de la Torre*
CSO
+34689414420 | ***@***.*** | CARTO <https://www.carto.com/>
<https://www.carto.com/>
|
Beta Was this translation helpful? Give feedback.
-
From my earlier comment I went to put together the pull request and realized there's no consensus on how to define the per-row box column in the geoparquet column metadata. I figured it makes more sense to iterate here rather than write up the specifics in the documentation/schema only to overhaul it later. There were a few constraints we came up with:
I think there are two directions we landed on: Coverings keyCreate a
Top-LevelAdd a fixed key to the top-level definition like this:
I'm partial to just adding a new key in the top-level. Since the keys will be fixed/enumerations (e.g. |
Beta Was this translation helpful? Give feedback.
-
Associated PR for this discussion: #191 |
Beta Was this translation helpful? Give feedback.
-
Looked through the PR and it looks great in general. The area I missing is page level indexing. The discussion so far only mentions row group level min/max stats, while page level stats are also available in Parquet and supported by all the libraries I know (parquet-mr, parquet-cpp, Apache Impala's own cpp scanner/writer (the project I work on)) This doesn't really affect how bounding boxes should be defined, as the current proposal with nested minx,maxx,miny,maxy would work perfectly well with page level indexes - but it would be good to mention it, as "smart enough" Parquet readers and writers can benefit from it. In my understanding the Parquet design provides page indexing as the main solution for fine grained min/max stat filtering. Meanwhile creating smaller row groups is a valid workaround in many uses cases, especially if the reader doesn't use page indexes efficiently. I saw some mentions of very small row group sizes, e.g. row_group_size=2000 in #183 |
Beta Was this translation helpful? Give feedback.
-
Now that #191 got merged and 1.1 is on the way, here's a quick look at how performance improved for spatial lookups on remote geoparquet files using duckdb. There are two main improvements:
We can look at the performance before/after each of these steps. (1) above isn't really a GeoParquet win other than our nudging some projects to implement row group filtering on struct fields. We tracked these in #191. The projects that now implement this (duckdb, gdal, arrow-cpp, sedona) should likely see similar performance improvements as what's documented here. This will query the same small bounding box region in the Boston area of the Overture Maps buildings dataset. The entire dataset is 2.3 billion rows, but only 891 are in the region of interest. Step 1 - Duckdb v0.9 on Overture Feb Release - baselineDuckdb v0.9 did not have row group filtering of nested fields Data Transferred: 63.4 GB Step 2 - Duckdb v0.10 on Overture Feb release (row group filtering)Data Transferred: 1.1 GB (57x improvement) Step 3 - Duckdb v0.10 on Overture March release (spatial partitioning)Data Transferred: 88.9 MiB (11.8x improvement over Step 2, 681x improvement over baseline) Next StepsFuture work:
Query profilesStep 1 - Duckdb v0.9 on Overture February release
Step 2 - Duckdb v0.10.0 on Overture February release
Step 3 - Duckdb v0.10 on Overture March release
@paleolimbot You had asked for a copy of these results I mentioned today. |
Beta Was this translation helpful? Give feedback.
-
After the GeoParquet meetup last week, I said I'd kick off the discussion around spatial indexing and partitioning as part of the drive to 1.1.0
Background & Motivation
GeoParquet files can dramatically improve the performance of spatial queries by including every row's minimum bounding rectangle (MBR) coordinates. With the MBR coordinates stored as ordinary Parquet columns, the min/max summary statistics in the Parquet footer metadata will automatically store the MBR coordinates for each row group. Readers can execute spatial queries by first using the Parquet metadata to quickly determine which row groups have an MBR that intersects a region of interest, needing to process only those row groups any further and ignoring the rest of the dataset. This allows us to use standard Parquet capabilities to make Parquet files behave like a spatial index. The benefit is even more pronounced in remote/cloud-native environments where only subsets of row groups need to be transferred over the network to clients rather than entire files.
Proposed approach and next steps
Define a standard representation for a row-level MBR
Overture Maps distributes large Parquet (soon to be GeoParquet) datasets. The Overture schema includes a bounding box column called
bbox
defined as a Parquet struct with fieldsminx
,maxx
,miny
,maxy
. This struct column allows the Parquet schema to present the MBR as a single column, but underneath is 4 separate arrays that take advantage of the summary statistics as explained above. Should we start with this definition as a straw man to understand where it helps or is lacking?Note: when the GeoArrow format is adopted and used, a separate MBR column may be entirely unnecessary.
Specify the MBR column in the GeoParquet Column Metadata
Add a new optional field for identifying the bounding box column of geometry (
bbox_column_name
)?Not In Scope
This definition for the MBR will not impose any requirement of how to store or sort spatial data within a GeoParquet file. Instead we should explore different strategies and techniques and measure their performance. We can then recommend best practices and develop tooling to help data producers and consumers.
Prior discussions and exploration:
Beta Was this translation helpful? Give feedback.
All reactions