-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make columnar-based access possible in addition to record-based model? #469
Comments
This is certainly possible from a technical perspective. I'd guess the biggest speed gains would come from copying the data from GDAL directly into a NumPy array, avoiding any intermediate Python objects. Fiona doesn't currently depend on NumPy for anything, so I think this would be best in an optional module (similar to what we've done with Shapely). This module could provide methods that wrap |
@snorfalorpagus @jorisvandenbossche I think that you may be overlooking how deeply record-based vector formats are. Except for the ones backed by relational databases, there is no efficient Since we must iterate over the records of a shapefile (using
With these optimizations, GeoPandas could make numpy structured arrays from the new fiona collection iterator, yes? Or implement something like the |
Yes, but I think what @snorfalorpagus does is, while doing this iteration, directly filling in numpy arrays on the C level (so you don't need to build up a 'list' (in the python sense) of values).
To be really sure about the need, I think we should try to do some benchmarking of the proof of concept @snorfalorpagus now made compared the the options you outline above.
Yes, that would be possible, and it would certainly also be faster as the current implementation, but how much faster is difficult to say. I would somewhat assume that the overhead of creating intermediate python objects (whether the full feature dict or flatter tuple) is the main bottleneck, but it might also be that the actual conversion from wkt to shapely/geopandas geometries takes the most time of the current |
I'm running an experiment with this idea over in Right now, it borrows heavily from the internals of It's still very early; I haven't even added write capability yet. But I thought it would be good to share some of the early results. The current API is intended as a direct read-to-memory function similar to I created quick initial benchmarks with Natural Earth data (Admin 0 and 1 at 10m and 110m) and recent versions of Compared to
Compared to
(note: this is not an apples-to-apples comparison, many things are conflated here including time to create These results suggest that while there are some speedups to be had in |
I have been reading up on this issue. Columnar access (reading) would be a major performance increase for applications of pygeos and specifically some applications I will be working on in the near future. Are you still open for including this in Fiona? I will have time to work on this. There seem to be two parts to this issue:
Maybe some of you are willing to shortly discuss ideas in a video session? Or shall I write up some API plan so that you can shoot? |
Apologies for not following up on this sooner. I'm definitely very interested in discussions around this in whatever form they take. I also am very interested in finding out if this is of interest to integrate into Fiona or manage in a stand-alone project. Since my post above, I've added write capability to I'd kindly suggest that If this isn't of interest to integrate into Fiona, there have been some discussions and interest around migrating it to a GeoPandas organization project in Github and integrating it as an optional dependency in GeoPandas (i.e., I'm not possessive of it, I just need the performance benefits). It does need a bit of work on the packaging side to make it more broadly accessible. There are significant performance wins regardless of where or how this lives on. I think the speedups are largely attributable to using a vectorized approach, and avoiding unnecessary geometry representation changes (esp. true for Fiona + GeoPandas). I don't want to fracture the community, but I also see a few tradeoffs to where this might live that I'll try to outline below. In particular, I'm keenly aware that trying to merge in a vectorized approach to Fiona could be a substantial undertaking and I most definitely do not want to impose on the goodwill of the Fiona maintainers. Overall approach:
Some challenges / tradeoffs to integrating this into Fiona:
Maintenance considerations
I'd really like to know what Fiona maintainers think about how we might approach this going forward: separate project or integrated directly into Fiona? |
@brendan-ward I'm inclined to close this issue. Few GIS formats support column-based access well enough and it seems like pyogrio has a great start on this and might be able to solve the core problem if it doesn't have to complicate itself with weird GIS concerns of fiona, |
As @sgillies pointed out in #469 (comment) the biggest bottleneck in the fiona + geopandas use case is probably the conversion of geometries to Python dictionaries and then again to binary geometries. The second biggest bottleneck is probably if more fields are converted to Python datatype as actually needed. The actual performance impact is hard to estimate without proper benchmarking though. I assume that for most data processing use cases the data io is a minor part of processing time, thus improvements in speed in the real world will probably only be noticeable for very large datasets. GDAL supports a broad variety of GIS formats, but I suspect only a really small number of formats are really suitable for large datasets. As Fiona, as well as GDAL, are designed for row-based access to the data, I was thinking it might worthwhile to directly use libraries optimized for such formats, e.g. spatiaLite to squeeze out the best performance. |
I've found file I/O to be a major bottleneck for some of my projects, compared to other data processing operations. I think it is important to note that in pyogrio, we are using the same common formats as here (shapefile, geopackage, etc), and at the OGR level, we are using the same OGR operations. There is nothing there using optimized access for columnar formats; it uses a row-based inner loop. What is different is that each column is stored into its own array while reading, so that the return value from a read operation is a set of arrays. I'm realizing now that by stating "vectorized all the way down" I may have contributed to some confusion about this. Maybe a better way of saying this is that from a Python perspective, it is vectorized; the loops in Cython account for the row-oriented structure of GIS data. Here are some of benchmarks that may be helpful here: Natural Earth Countries (Admin 0) - 1:100M Read:
Write to shapefile:
Natural Earth Countries (Admin 0) - 1:10M Read:
Write:
I think the differences are noticeable even for small datasets. Among other things, I think this hints at some possible optimizations in Fiona even without adopting a vectorized approach, especially for writing data. We'd need to do a bit more profiling to see where the hotspots are, but one difference may be between the level of data validation before writing that Fiona does whereas pyogrio does none. |
xref OSGeo/gdal#5830 adopted as RFC 86: Column-oriented read API for vector layers with target GDAL 3.6 |
Anyone have an idea what a nice Python API for this would look like? Would the usage be something like this? with fiona.open("example.shp") as collection:
df = pandas.DataFrame.from_records(collection.column(fields=["name", "area"])) |
Judging from the RFC, and especially the Then to get an iterator over pandas with fiona.open("example.shp") as collection:
for record_batch in collection.iter_batches(fields=["name", "area"]):
df = pa.Table.from_batches([record_batch]).to_pandas() |
@kylebarron thanks for the suggestion! That makes a lot of sense and is consistent with |
In #1121 I'm able to use OGR's new API with Cython, but I haven't made much progress towards a nice Python API yet. |
I've removed this from the 1.9 milestone. I'm thinking that Fiona 1.9 and 2.0 should stick to rows and let some other package take care of column-based vector data. |
And for future readers (that haven't read all of the above), one "other package" that we are developing for geopandas that focuses on columnar-based IO is pyogrio: https://github.com/geopandas/pyogrio/ (and this also exposes the RFC86 new columnar-oriented read API of GDAL) |
Related to geopandas/geopandas#491 (exploration of ways to make data ingestion faster in geopandas)
Currently
fiona
exposes a model where you access access the data by records (eg by iterating over a collection, accessing one record at a time). When the goal is to load the full dataset (eg to put all records in a geopandas GeoDataFrame), this record-based access can give some performance overhead.Therefore, I am wondering to what extent
fiona
would welcome additions to also make columnar-based access possible.With columnar-based access I mean that you could get the values of all records (so of the full collection) at once in an array per property and geometry.
The text was updated successfully, but these errors were encountered: