Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray implementation of maple inference pipeline #19

Merged
merged 32 commits into from
Jul 29, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
7d38e10
Create ray conda environment
kaylahardie Mar 18, 2024
c89494f
Format file
kaylahardie Mar 18, 2024
156fdfb
Load geotiffs into ray dataset
kaylahardie Mar 18, 2024
de4a019
Calculate Water Mask
kaylahardie Mar 18, 2024
9eb36fd
Apply water mask and tile image
kaylahardie Mar 18, 2024
6793e73
Starting to rayify inference but there's a bug
kaylahardie Mar 19, 2024
d270836
Fix inference ray bug
kaylahardie Mar 19, 2024
fe30d57
Fixed creating image tiles -> needed to use deepcopy, also made chang…
kaylahardie Apr 2, 2024
c86e41d
Fix inference -> Ray throws type errors when we set a column value to…
kaylahardie Apr 2, 2024
f07e7d7
Remove old inference code
kaylahardie Apr 2, 2024
81c2b23
Stitching image tiles back together using Ray
kaylahardie Apr 2, 2024
a71cce7
Close gdal files after opening them fix and also added code to write …
kaylahardie Apr 5, 2024
0e3f5fe
Remove unnecessary code from stitch shapefile
kaylahardie Apr 5, 2024
833fcf2
fix gdal virtual file system issue - the gdal vfs only works in a sin…
kaylahardie Apr 8, 2024
e9ce986
Break logic up into different files and fix imports
kaylahardie Apr 11, 2024
89d5656
Rename files, add compare_unordered_shapefiles for testing purposes
kaylahardie Apr 12, 2024
c03addd
Add back original maple_workflow file
kaylahardie Apr 12, 2024
acfd23b
fix file name indexing for shapefile
kaylahardie Apr 12, 2024
68c1c95
Add documentation for compare_unordered_shapefiles
kaylahardie Apr 12, 2024
bdbe6fb
Adding more documentation
kaylahardie Apr 12, 2024
18dea96
More documentation
kaylahardie Apr 12, 2024
99f7b41
More comments
kaylahardie Apr 12, 2024
f43a4b6
Making it possible to read and write to gcs buckets when running ray_…
kaylahardie May 16, 2024
24523c6
add application default credentials directory as a flag so it isn't h…
kaylahardie May 16, 2024
5cdf5c0
Each run is now outputted under it's own datetime directory
kaylahardie May 17, 2024
4fa1a8f
small fixes (trying to limit the changes on the original maple_workfl…
kaylahardie Jun 18, 2024
fe5caf5
addressing
kaylahardie Jul 8, 2024
156d209
addressing more comments, fixing delete ray shapefile dir
kaylahardie Jul 18, 2024
f170b34
Updated readme
kaylahardie Jul 29, 2024
9902b38
fix merge conflict with weight file
kaylahardie Jul 29, 2024
3c6a08a
updating readme again
kaylahardie Jul 29, 2024
785ecad
Merge branch 'main' into try-ray-inference
kaylahardie Jul 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions compare_shapefile_features.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""
Compares the features in two shapefiles. The geometry and properties for each feature are compared.
The features don't have to be in the same order in the two files for them to be equal. This approach
assumes that each feature in the file is unique (ie. it doesn't guarantee equality if there are
duplicate features in one of the files.)
"""
import argparse
import sys

import fiona
from shapely.geometry import shape


def feature_to_hashable(feature):
"""Converts a feature into a hashable representation."""
geometry_repr = shape(feature['geometry']).wkt
attributes = tuple(sorted(feature['properties'].items()))
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved
return (geometry_repr, attributes)


def compare_shapefile_features(file1, file2):
"""Compares shapefiles using sets. This approach doesn't work if there are duplicate features in one of the files."""
with fiona.open(file1) as src1, fiona.open(file2) as src2:
set1 = {feature_to_hashable(feature) for feature in src1}
set2 = {feature_to_hashable(feature) for feature in src2}

if len(set1) != len(set2):
print(
f"{file1} doesn't have the same number of elements: {len(set1)}, as {file2}: {len(set2)}")
return

if set1 == set2:
print("Shapefiles seem to have identical features")
return

diff1 = set1 - set2 # Features only in shapefile1
diff2 = set2 - set1 # Features only in shapefile2

print("Shapefiles have differences:")
print(f"{len(diff1)} features unique to {file1}")
print(f"{len(diff2)} features unique to {file2}")


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Compare the features in two shapefiles.")
parser.add_argument("file1", help="Path to the first shapefile")
parser.add_argument("file2", help="Path to the second shapefile")

args = parser.parse_args()

# Check if both file paths were provided
if not args.file1 or not args.file2:
print("Error: Please provide both shapefile paths.")
sys.exit(1)

compare_shapefile_features(args.file1, args.file2)

# Usage:
# python compare_shapefile_features.py <path to file1> <path to file2>
10 changes: 8 additions & 2 deletions environment_maple.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
name: maple_py310
name: maple_py310_ray
channels:
- defaults
dependencies:
- python=3.10
- gdal
- fiona
- pyshp
- scikit-image
- shapely
- pip=23.3.1
- pip:
- tensorflow[and-cuda]
- gcsfs
- google-auth
- tensorflow[and-cuda]==2.15.1
- opencv-python
- ray
- pandas==2.1.4
- pyarrow
27 changes: 27 additions & 0 deletions gdal_virtual_file_system.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""
Class that manages open and closing of a single gdal file.
The pattern is to instantiate the class, call create to open it and then
call close to close the file.
"""
import os
from osgeo import gdal

class GDALVirtualFileSystem:
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved
def __init__(
self,
file_path: str,
file_bytes: bytes
):
vfs_dir_path = '/vsimem/vsidir/'
file_name = os.path.basename(file_path)
self.vfs_filename = os.path.join(vfs_dir_path, file_name)
self.file_bytes = file_bytes

def create_virtual_file(self) -> str:
dst = gdal.VSIFOpenL(self.vfs_filename, 'wb+')
gdal.VSIFWriteL(self.file_bytes, 1, len(self.file_bytes), dst)
gdal.VSIFCloseL(dst)
return self.vfs_filename

def close_virtual_file(self):
gdal.Unlink(self.vfs_filename)
2 changes: 1 addition & 1 deletion maple_workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,4 +337,4 @@ def stich_shapefile(config: MPL_Config, input_img_name: str):
)

# Once you are done you can check the output on ArcGIS (win) or else you can check in QGIS (nx) Add the image and the
# shp, shx, dbf as layers.
# shp, shx, dbf as layers.
22 changes: 21 additions & 1 deletion mpl_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@

import os

import gcsfs
import google.auth

from config import Config


Expand All @@ -29,12 +32,13 @@ class MPL_Config(object):
def __init__(
self,
root_dir="",
adc_dir="",
weight_file="hyp_best_train_weights_final.h5",
logging=True,
crop_size=200,
num_gpus_per_core=1,
):
## Do not change this section
# Do not change this section
# Code depends on the relative locations indicated so should not change
# Code expects some of the locations to be available when executing.
# -----------------------------------------------------------------
Expand All @@ -48,6 +52,22 @@ def __init__(
self.TEMP_W_IMG_DIR = self.ROOT_DIR + r"/data/water_mask/temp"
self.OUTPUT_IMAGE_DIR = self.ROOT_DIR + r"/data/output_img"
self.WORKER_ROOT = self.ROOT_DIR + r"/data"
self.MODEL_DIR = self.ROOT_DIR + r"/local_dir/datasets/logs"
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved
self.RAY_OUTPUT_SHAPEFILES_DIR = self.ROOT_DIR + r"/data/ray_output_shapefiles"

self.GCP_FILESYSTEM = None
if (self.ROOT_DIR.startswith(
"gcs://") or self.ROOT_DIR.startswith("gs://")):
if adc_dir:
print("You are using application default credentials to authenticate.")
creds, _ = google.auth.load_credentials_from_file(
adc_dir, scopes=["https://www.googleapis.com/auth/cloud-platform"])
self.GCP_FILESYSTEM = gcsfs.GCSFileSystem(
project="pdg-project-406720", token=creds)
else:
print("Please specify an application default credentials directory if you are running this code on your local computer.")
self.GCP_FILESYSTEM = gcsfs.GCSFileSystem(
project="pdg-project-406720")

# ADDED to include inference cleaning post-processing
self.CLEAN_DATA_DIR = self.ROOT_DIR + r"/data/cln_data"
Expand Down
136 changes: 136 additions & 0 deletions ray_image_preprocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
"""
Preprocessing helper functions for the MAPLE pipeline.

Project: Permafrost Discovery Gateway: Mapping Application for Arctic Permafrost Land Environment(MAPLE)
PI : Chandi Witharana
"""


import os
from pathlib import Path
import shutil
from typing import Dict, Any

import numpy as np
from osgeo import gdal
from skimage import filters, io
from skimage.morphology import disk

from mpl_config import MPL_Config
import gdal_virtual_file_system as gdal_vfs
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved



def cal_water_mask(row: Dict[str, Any], config: MPL_Config) -> Dict[str, Any]:
"""
This will calculate the water mask to avoid (inference) processing of the masked areas with water
Uses gdal to transform the image into the required format.
The result of this function is that it will add a "mask" column with the water mask to each row.
It also writes the water mask to the config.WATER_MASK_DIR.
Parameters
----------
row : Row in Ray Dataset, there is one row per image.
config : Contains static configuration information regarding the workflow.
"""
worker_water_root = config.WATER_MASK_DIR
temp_water_root = (
config.TEMP_W_IMG_DIR
)

image_name = row["image_name"]
worker_water_subroot = os.path.join(worker_water_root, image_name)
temp_water_subroot = os.path.join(temp_water_root, image_name)
# Prepare to make directories to create the files
try:
shutil.rmtree(worker_water_subroot)
except:
print("directory deletion failed")
pass

try:
shutil.rmtree(temp_water_subroot)
except:
print("directory deletion failed")
pass

# check local storage for temporary storage
Path(worker_water_subroot).mkdir(parents=True, exist_ok=True)
Path(temp_water_subroot).mkdir(parents=True, exist_ok=True)
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved

output_watermask = os.path.join(
worker_water_subroot, r"%s_watermask.tif" % image_name
)
output_tif_8b_file = os.path.join(
temp_water_subroot, r"%s_8bit.tif" % image_name
)
nir_band = 3 # set number of NIR band

# %% Median and Otsu
value = 5

# Create virtual file system file for image to use GDAL's file apis.
vfs = gdal_vfs.GDALVirtualFileSystem(
file_path=row["path"], file_bytes=row["bytes"])
virtual_file_path = vfs.create_virtual_file()

# UPDATED CODE - amal 01/11/2023
# cmd line execution thrown exceptions unable to capture
# Using gdal to execute the gdal_Translate
# output file checked against the cmd line gdal_translate
gdal.UseExceptions() # Enable errors
try:
gdal.Translate(
destName=output_tif_8b_file,
srcDS=virtual_file_path,
format="GTiff",
outputType=gdal.GDT_Byte,
)
except RuntimeError:
print("gdal Translate failed with", gdal.GetLastErrorMsg())
pass

image = io.imread(
output_tif_8b_file
) # image[rows, columns, dimensions]-> image[:,:,3] is near Infrared
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved
nir = image[:, :, nir_band]

bilat_img = filters.rank.median(nir, disk(value))

gtif = gdal.Open(virtual_file_path)
geotransform = gtif.GetGeoTransform()
sourceSR = gtif.GetProjection()
kaylahardie marked this conversation as resolved.
Show resolved Hide resolved
# Close the file.
gtif = None
vfs.close_virtual_file()

x = np.shape(image)[1]
y = np.shape(image)[0]

# Normalize and blur before thresholding. Usually instead of normalizing
# a rgb to greyscale transformation is applied. In this case, we are already
# dealing with a single channel so we divide all the pixels in the image by
# 255 to get a value between [0, 1].
normalized_bilat_img = bilat_img / 255
normalized_blurred_bilat_img = filters.gaussian(
normalized_bilat_img, sigma=2.0)

# find the threshold to filter if all values are same otsu cannot find value
# hence t is made to 0.0
try:
t = filters.threshold_otsu(normalized_blurred_bilat_img)
except:
t = 0.0
# perform inverse binary thresholding
mask = normalized_blurred_bilat_img > t

# output np array as GeoTiff
dst_ds = gdal.GetDriverByName("GTiff").Create(
output_watermask, x, y, 1, gdal.GDT_Byte, ["NBITS=1"]
)
dst_ds.GetRasterBand(1).WriteArray(mask)
dst_ds.SetGeoTransform(geotransform)
dst_ds.SetProjection(sourceSR)
dst_ds.FlushCache()
dst_ds = None
row["mask"] = mask
return row
Loading