Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray implementation of maple inference pipeline #19

Merged
merged 32 commits into from
Jul 29, 2024
Merged

Conversation

kaylahardie
Copy link
Contributor

To run the ray implementation of the maple inference pipeline:

  • Use the "environment_maple.yml" file to create a conda environment with ray:
    run conda env create -f environment_maple.yml and then conda activate maple_py310_ray
  • Create the directory structures by running mpl_workflow_create_dir_struct.py
  • Add a sample input image to the data/input_img_local directory, I used the sample image here https://drive.google.com/file/d/1YwQiPc7Ow-oSyEHuCBD97RxRbnzJ4_dW/view?usp=drive_link
  • run python3 maple_workflow.py --gpus_per_core=0 for running the pipeline on a cpu, I haven't tried running it on more than one cpu. I've only tried running it on one image. When running it on two images locally it ran out of memory
  • The results from the pipeline should be in the data/ray_shapefiles dir, you can use compare_shapefile_features.py to compare the features in two shapefiles or you can use 'ogrinfo -so -al ' on the command line to examine a shapefile.

…es to how the dataclasses represented the dict
…gle process so instead of having the virtual file in the ray dataset, adapted the code to store the image bytes in the dataset and create the gdal virtual file locally when needed
…ardcoded, this is needed for service account impersonation if running the code on your local computer and want it to be able to access gcs buckets
@kaylahardie
Copy link
Contributor Author

The most recent commit makes it possible to run the ray version of the maple pipeline using pdg's google cloud storage bucket instead of your local filesystem.

Here are the instructions for locally running (ie. running on your local computer) the ray version of the maple pipeline with access to gcs storage buckets:
To give the ray program that's running on your local computer access to the gcs storage bucket, it's best practice to use service account impersonation:

  1. make sure your email that has access to the pdg-project also has the roles/iam.serviceAccountTokenCreator.
  2. run gcloud auth application-default login --impersonate-service-account=pdg-sa-01@pdg-project-406720.iam.gserviceaccount.com --scopes="https://www.googleapis.com/auth/cloud-platform" It will ask you to sign in and then you will get access to an application default credentials directory. The terminal should print something like Credentials saved to file: [/usr/local/google/home/kaylahardie/.config/gcloud/application_default_credentials.json]
  3. Run python3 ray_maple_workflow.py --gpus_per_core=0 --root_dir="gs://pdg-storage-default/workflows_optimization/maple_ray_pipeline" --adc_dir="/usr/local/google/home/kaylahardie/.config/gcloud/application_default_credentials.json" but replace the directory that you got from step 2 into the --adc_dir flag.
  4. when the program finishes the output shapefiles will be stored in the directory: gs://pdg-storage-default/workflows_optimization/maple_ray_pipeline/data/ray_output_shapefiles/

Right now it just overwrites the old outputted files, we can add some versioning to the output file names to avoid overwriting the old files if desired. It takes as input, the data in the gs://pdg-storage-default/workflows_optimization/maple_ray_pipeline/data/input_img/ directory

@kaylahardie
Copy link
Contributor Author

Most recent commit makes it so each run doesn't overwrite the data from previous runs. Each run now creates a date directory in the format "%Y-%m-%d_%H-%M-%S" in the gs://pdg-storage-default/workflows_optimization/maple_ray_pipeline/data/ray_output_shapefiles/ directory and in the date directory the output shapefiles for that run are stored

@kaylahardie kaylahardie marked this pull request as ready for review June 18, 2024 16:51
…ow.py). Changed the input img directory back to be the input_img_local (what it was originally)
Copy link
Contributor

@gugibugy gugibugy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, will take a look at the rest on Monday.

compare_shapefile_features.py Show resolved Hide resolved
gdal_virtual_file_system.py Outdated Show resolved Hide resolved
mpl_config.py Show resolved Hide resolved
ray_image_preprocessing.py Outdated Show resolved Hide resolved
ray_image_preprocessing.py Outdated Show resolved Hide resolved
ray_image_preprocessing.py Outdated Show resolved Hide resolved
ray_image_preprocessing.py Outdated Show resolved Hide resolved
# Used to identify a specific predictor when mulitple predictors are
# created to run inference in parallel. The counter is also used to
# know which GPU to use when multiple are available.
self.process_counter = 1 # TODO need to fix this process_counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be replaced by a call to ray.get_gpu_ids which will return a list of available GPUs. For now we could simply choose a value at random (if this become more complex we can re-adress this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we want to keep track of which ids have been used? if we just pick a random id we could randomly pick the same gpu right?

ray_infer_tiles.py Outdated Show resolved Hide resolved
ray_tile_and_stitch_util.py Outdated Show resolved Hide resolved
ray_tile_and_stitch_util.py Show resolved Hide resolved
ray_tile_and_stitch_util.py Show resolved Hide resolved
ray_write_shapefiles.py Outdated Show resolved Hide resolved
ray_write_shapefiles.py Outdated Show resolved Hide resolved
ray_write_shapefiles.py Outdated Show resolved Hide resolved
ray_maple_workflow.py Outdated Show resolved Hide resolved
ray_maple_workflow.py Outdated Show resolved Hide resolved
gugibugy's review comments
ray_maple_workflow.py Outdated Show resolved Hide resolved
ray_write_shapefiles.py Outdated Show resolved Hide resolved
@tcnichol
Copy link
Contributor

This one looks good to me, I've been able to run everything.

Copy link
Contributor

@tcnichol tcnichol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to mark this approved. It looks like all comments were addressed, runs fine, environment works.

create_directory_if_not_exists(config.RAY_OUTPUT_SHAPEFILES_DIR)
shapefiles_dataset = data_per_image.map(
fn=ray_write_shapefiles.WriteShapefiles, fn_constructor_kwargs={"config": config}, concurrency=concurrency)
print("MAPLE Ray pipeline finished, done writing shapefiles", shapefiles_dataset.schema())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll actually want to call materialize here instead of schema. Materialize actually runs the pipeline for all the rows in the dataset, while my understanding is that schema only materializes a single row in order to get the schema of the data set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How'd you find out the schema only materializes a single row? I updated the code, I'm just curious

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was prototyping with running on more than 1 image, if you did .schema() it would only produce results for a single row

@kaylahardie
Copy link
Contributor Author

I updated the README as well a little bit. The Building Scalable ML Pipelines Using Ray is also updated. Going to merge now

@kaylahardie kaylahardie merged commit 2e6d1af into main Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants