You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DBR 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
Standard_DS13_v2 (driver and 2 workers)
Photon disabled
databricks-mosaic 0.4.1
GDAL init script installed in cluster
Dataset is a GeoTIFF relatively large raster file, it has one layer. Each pixel in uint16 dtype represents a category that describes the land use or land cover (forest, industrial, agricultural, and so on). Values out of bounds are masked (represented by the last integer 2^16 - 1)
Use case: need to efficiently read and process raster file represented in a projected coordinate system, convert to grid index to intersect with points represented in a geographic coordinate system (WGS84).
The GeoTiff file seems ok as far as I know, I've tested it with Rasterio and NumPy.
I'm trying to automate the process with Mosaic.
This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)
df = mos.read().format("raster_to_grid") \
.option("resolution", "2") \
.load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()
When trying to use "retile" option it throws an error:
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)
df = mos.read().format("raster_to_grid") \
.option("resolution", "2") \
.option("retile", "true")\
.option("tileSize", "1000")\
.load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.
Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).
The text was updated successfully, but these errors were encountered:
carlosg-m
changed the title
raster_to_grid not working when using retileraster_to_grid not working when using retile with GeoTIFF file
May 3, 2024
Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.
Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.
Thank you for the response, @mjohns-databricks.
Are there any best practices to make the operation I described efficient, using the current version of Mosaic?
The workaround is to generate a table of "reading windows or bounding boxes" and with a UDF go through each one in parallel, loading each window with Rasterio and converting it to a "grid index".
This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):
When trying to use "retile" option it throws an error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.
Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).
The text was updated successfully, but these errors were encountered: