-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray implementation of maple inference pipeline #19
Conversation
…es to how the dataclasses represented the dict
…gle process so instead of having the virtual file in the ray dataset, adapted the code to store the image bytes in the dataset and create the gdal virtual file locally when needed
…maple_workflow.py
…ardcoded, this is needed for service account impersonation if running the code on your local computer and want it to be able to access gcs buckets
The most recent commit makes it possible to run the ray version of the maple pipeline using pdg's google cloud storage bucket instead of your local filesystem. Here are the instructions for locally running (ie. running on your local computer) the ray version of the maple pipeline with access to gcs storage buckets:
Right now it just overwrites the old outputted files, we can add some versioning to the output file names to avoid overwriting the old files if desired. It takes as input, the data in the gs://pdg-storage-default/workflows_optimization/maple_ray_pipeline/data/input_img/ directory |
Most recent commit makes it so each run doesn't overwrite the data from previous runs. Each run now creates a date directory in the format "%Y-%m-%d_%H-%M-%S" in the gs://pdg-storage-default/workflows_optimization/maple_ray_pipeline/data/ray_output_shapefiles/ directory and in the date directory the output shapefiles for that run are stored |
…ow.py). Changed the input img directory back to be the input_img_local (what it was originally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, will take a look at the rest on Monday.
ray_infer_tiles.py
Outdated
# Used to identify a specific predictor when mulitple predictors are | ||
# created to run inference in parallel. The counter is also used to | ||
# know which GPU to use when multiple are available. | ||
self.process_counter = 1 # TODO need to fix this process_counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be replaced by a call to ray.get_gpu_ids which will return a list of available GPUs. For now we could simply choose a value at random (if this become more complex we can re-adress this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we want to keep track of which ids have been used? if we just pick a random id we could randomly pick the same gpu right?
gugibugy's review comments
This one looks good to me, I've been able to run everything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to mark this approved. It looks like all comments were addressed, runs fine, environment works.
ray_maple_workflow.py
Outdated
create_directory_if_not_exists(config.RAY_OUTPUT_SHAPEFILES_DIR) | ||
shapefiles_dataset = data_per_image.map( | ||
fn=ray_write_shapefiles.WriteShapefiles, fn_constructor_kwargs={"config": config}, concurrency=concurrency) | ||
print("MAPLE Ray pipeline finished, done writing shapefiles", shapefiles_dataset.schema()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll actually want to call materialize here instead of schema. Materialize actually runs the pipeline for all the rows in the dataset, while my understanding is that schema only materializes a single row in order to get the schema of the data set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How'd you find out the schema only materializes a single row? I updated the code, I'm just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was prototyping with running on more than 1 image, if you did .schema() it would only produce results for a single row
I updated the README as well a little bit. The Building Scalable ML Pipelines Using Ray is also updated. Going to merge now |
To run the ray implementation of the maple inference pipeline:
run
conda env create -f environment_maple.yml
and thenconda activate maple_py310_ray
mpl_workflow_create_dir_struct.py
python3 maple_workflow.py --gpus_per_core=0
for running the pipeline on a cpu, I haven't tried running it on more than one cpu. I've only tried running it on one image. When running it on two images locally it ran out of memory