-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tile stats #656
Tile stats #656
Conversation
https://github.com/onthegomap/planetiler/actions/runs/6268905270 ℹ️ Base Logs e473c42
ℹ️ This Branch Logs f26c7b9
|
I prefer the output format with the least dependencies possible. I assume that is tsv.gz or NDJSON.
|
Sounds good to me. Any thoughts on the formats? Most recently I've been generated a single tsv.gz with columns:
You can either split them out into 2 tables (z,x,y->id, and id,layer->stats) or duplicate all of the z,x,y tile IDs into one denormalized table. Split makes it easy to query tile-level stats but duckdb struggles with joining to use tile coordinates with layer-level stats (too big to add an index), and denormalized makes it easier to work with layer-level stats, but duckdb struggles grouping by tile ID to get tile level stats. The denormalized version also gets a little big (4GB for the planet vs. <2GB for split). |
And in terms of outputting a high-level summary, the most useful ones have been:
|
Can we solve this via a DuckDB function definition instead of repeating information in the stats? |
Yeah, it looks like that could work - would just need to put it in documentation somewhere for people to copy/paste: D create or replace macro lat(z,y) as round((180/pi())*atan(0.5*(exp(pi()-2*pi()*(y/(2**z)))-exp(-(pi()-2*pi()*(y/(2**z)))))), 5);
D create or replace macro lon(z,x) as (180/pi())*atan(0.5*(exp(pi()-2*pi()*(y/(2**z)))-exp(-(pi()-2*pi()*(y/(2**z))))));
D create or replace macro url(z,x,y) as concat('https://onthegomap.github.io/planetiler-demo/#', z+0.5, '/', lat(z,y+0.5), '/', lon(z,x+0.5));
D select z,x,y,layer,round(layer_bytes/1000) as kbs, url(z,x,y) from stats order by layer_bytes desc limit 10; returns:
There's a benefit to having it persisted in parquet file to allow efficient bbox filtering, but if you're going to go through duckdb to create a parquet file you could just add it there. |
@bdon what do you think about the different format options?
|
so #4 would be great (though unclear to me what the TSV looks like), then #3, #1, #2 |
OK that makes sense. Would you omit "deduped_tile_id" column as well? Or "hilbert"? I was thinking 4 would be something like: Unfortunately it's hard to get tile-level stats from 3 since duckdb runs out of memory on the huge "group by" unless you limit to a small set of tiles first. One other route to go might be to have a row per tile, and included nested data in it... for example if you import a file: {"id":1,"layers":[{"id":"layer1","size":100}, {"id":"layer2","size":200}]}
{"id":2,"layers":[{"id":"layer2","size":300}, {"id":"layer3","size":400}]} Then you can do:
I'm not sure how performant querying that would be though, might be worth a test. |
I tried outputting newline-delimited json that looks like: {"z":0,"x":0,"y":0,"hilbert":0,"total_bytes":142374,"gzipped_bytes":82957,"layers":[{"name":"water","features":2,"total_bytes":8435,"attr_bytes":18,"attr_values":2},{"name":"landcover","features":11,"total_bytes":1578,"attr_bytes":27,"attr_values":2},{"name":"place","features":42,"total_bytes":115614,"attr_bytes":82249,"attr_values":4363},{"name":"water_name","features":6,"total_bytes":7042,"attr_bytes":5541,"attr_values":238},{"name":"boundary","features":32,"total_bytes":9689,"attr_bytes":807,"attr_values":37}]}
{"z":1,"x":0,"y":0,"hilbert":1,"total_bytes":103276,"gzipped_bytes":64171,"layers":[{"name":"water","features":2,"total_bytes":4139,"attr_bytes":18,"attr_values":2},{"name":"landcover","features":3,"total_bytes":468,"attr_bytes":27,"attr_values":2},{"name":"place","features":29,"total_bytes":68351,"attr_bytes":47302,"attr_values":2493},{"name":"water_name","features":7,"total_bytes":7136,"attr_bytes":5562,"attr_values":245},{"name":"boundary","features":13,"total_bytes":23165,"attr_bytes":277,"attr_values":14}]}
{"z":1,"x":0,"y":1,"hilbert":2,"total_bytes":38635,"gzipped_bytes":24721,"layers":[{"name":"water","features":2,"total_bytes":1340,"attr_bytes":18,"attr_values":2},{"name":"landcover","features":8,"total_bytes":982,"attr_bytes":27,"attr_values":2},{"name":"place","features":10,"total_bytes":25423,"attr_bytes":17164,"attr_values":915},{"name":"water_name","features":3,"total_bytes":3102,"attr_bytes":2431,"attr_values":97},{"name":"boundary","features":5,"total_bytes":7772,"attr_bytes":37,"attr_values":5}]}
{"z":1,"x":1,"y":1,"hilbert":3,"total_bytes":91365,"gzipped_bytes":52859,"layers":[{"name":"water","features":2,"total_bytes":2206,"attr_bytes":18,"attr_values":2},{"name":"landcover","features":1,"total_bytes":521,"attr_bytes":27,"attr_values":2},{"name":"place","features":23,"total_bytes":66179,"attr_bytes":48121,"attr_values":2174},{"name":"water_name","features":4,"total_bytes":6010,"attr_bytes":4716,"attr_values":196},{"name":"boundary","features":6,"total_bytes":16432,"attr_bytes":66,"attr_values":6}]}
{"z":1,"x":1,"y":0,"hilbert":4,"total_bytes":275506,"gzipped_bytes":166904,"layers":[{"name":"water","features":2,"total_bytes":3647,"attr_bytes":18,"attr_values":2},{"name":"place","features":76,"total_bytes":191026,"attr_bytes":129687,"attr_values":7421},{"name":"water_name","features":5,"total_bytes":5869,"attr_bytes":4640,"attr_values":193},{"name":"boundary","features":55,"total_bytes":74950,"attr_bytes":1212,"attr_values":55}]} The raw json.gz file was 3.3GB, duckdb file was 1.8GB and exported parquet file was 1.4GB. You can query tile-level stats instantly: select z,x,y,gzipped_bytes from 'nested_stats.parquet' order by gzipped_bytes desc limit 10; But a layer-level query takes 40 seconds: with unnested as (select z,x,y,unnest(layers) as layer from stats)
select z,x,y,layer.name, layer.total_bytes from unnested order by layer.total_bytes desc limit 10; vs. this query which is instant with option 3: select z,x,y,layer,layer_bytes from flat order by layer_bytes desc limit 10; But if you know you are going to be doing layer-level analysis you could create an unnested table once (~60 seconds), then use it for instant layer-level queries afterwards: create table unnested as (select z,x,y,unnest(layers) layer from stats); -- ~60 seconds
select z,x,y,layer.name,layer.total_bytes from unnested order by layer.total_bytes desc limit 10; -- instant This approach seems better if we're targeting duckdb because both tile and layer level queries are possible from one output file (duckdb doesn't blow up) but nested json is a little less standard/easy to work with in other tools, and the structured json output is ~1.5x bigger than tsv.gz |
@msbarry I would omit both of those, and simply not include any tile that occurs more than once in the statistics (anything I'm missing here?) My PoC is here: protomaps/go-pmtiles#75 for now it just outputs a next steps:
|
That sounds good to me, I could foresee wanting different flavors of stats so maybe
Yeah you need to parse raw protobuf. I'm pretty sure the tile is just a series of concatenated layers, and the java library at least lets you get the serialized byte length of a parsed proto struct. You could probably also decompress with a very simple schema to only separate the layers (it just passes through all the unrecognized fields) message Tile {
message Layer {
required string name = 1;
}
repeated Layer layers = 3;
}
Yeah, anywhere common would work but would definitely be useful to have some common queries listed out (especially the z/x/y -> lat/lon macro) |
|
👍
Sounds good, after more testing I think layer to tile grouping should be reasonably fast enough. Just thinking if we went down to attribute stats that would be a lot bigger?
The benefit of naming
Gzipping takes ~3m for the planet, maybe less with a smaller format so probably not a big enough deal to worry about... |
For the actual stats format I think we definitely need:
then maybe:
then I think we can skip the Are there any other of those "maybes" you think would be useful to include @bdon ? |
|
more notes while implementing:
|
``` | ||
|
||
NOTE: this group by uses a lot of memory so you need to be running in file-backed | ||
mode `duckdb analysis.duckdb` (not in-memory mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a one-liner to change the .tsv.gz
into a .duckdb
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be
duckdb analysis.duckdb -cmd "CREATE TABLE layerstats AS SELECT * FROM 'output.pmtiles.layerstats.tsv.gz';"
to drop you into a shell after importing the file, of -c "create...
to just create the file - given the shortcut's not much shorter than the individual steps I'm inclined to just leave them as separate steps for clarity.
Ran on the planet with these settings used by openstreetmap-americana The output tsv.gz file was 2.8GB |
Kudos, SonarCloud Quality Gate passed! |
Expose more detailed tile size statistics from planetiler:
--output-layerstats
to output an extra<output file>.layerstats.tsv.gz
file (~5% of the original archive size) with a row per layer per tile that can be analyzed using duckdb (seelayerstats/README.md
)java -jar planetiler.jar stats --input=<pmtiles or mbtiles file> --output=layerstats.tsv.gz
on an existing archive to compute stats for it.--tile-weights=weights.tsv.gz
to point planetiler to a file with z, x, y, loads columns to customize the weights used for weighted-average tile sizes, or use--download-osm-tile-weights
to download pre-computed top-1m tiles from opemnstreetmap.org traffic.java -jar planetiler.jar top-osm-tiles --days=<# days to fetch> --top=<# tiles to include> --output=weights.tsv.gz
Fixes #391