-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worldcover embeddings conus #153
Conversation
0a949a3
to
88b03e3
Compare
be9df85
to
7a08dba
Compare
The Contiguous United states are fully processed for 2020 and 2021! 🇺🇸 🌎 🦅 🎉 🎆 The version that is complete for both years is
|
Focus on CONUS area.
ae498bc
to
b4193d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good, nice use of pack operations.
from skimage import io | ||
|
||
# Set working directory | ||
wd = "/home/usr/Desktop/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yellowcap, I know this is already merged, but can you avoid such absolute/hardcoded paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feeback @chuckwondo you are right, this isn't great, but is supposed to be a placeholder. Sometimes I use fake paths like /path/to/your/working/directory
, to show what this is supposed to be so that people running the script could replace it.
But I am very happy to learn about better ways to do this, what is your favorite solution for this kind of thing? In cases like this with scripts, maybe env vars could be an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to propose some ideas, but I need some context first. How is this script intended to be used? Is the intent to have the user first run the aws s3 sync
command shown in the code comment below this line, and then just directly call this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, the idea is that someone with access to the embeddings downloads them using aws s3 sync
to a local folder, and then runs the script pointing to the embedding files and to a folder where the lancedb data should be stored.
I.e. the script needs two folders
- A place to download the raw data
- A folder to create / store the lancedb data
I made many scripts like this where some local workding directories are needed. Never really found a very satisfactory way of handling this. Env vars seem a bit cluncky and are not always easy to set up. Constants work, but then the script requires a hard coded default value.
So if you have good ideas on how to approach this issue, they are most welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add command-line arguments. For very simple scripts like this one, simply use Python's standard argparse module, so you don't have to add any dependencies. In this case, it sounds like you might want to use argparse
to support a syntax like the following for running the script:
scripts/worldcover/embeddings_db.py --input-dir path/to/input/dir --db-dir path/to/db/dir
Both options should be required, and the script should also create the dir specified for --db-dir
, so the user doesn't have to do so manually beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, just put a few questions - would love to hear your feedback, thank you!
# CKPT_PATH = "https://huggingface.co/made-with-clay/Clay/resolve/main/Clay_v0.1_epoch-24_val-loss-0.46.ckpt" | ||
VERSION = "002" | ||
BUCKET = "clay-worldcover-embeddings" | ||
URL = "https://esa-worldcover-s2.s3.amazonaws.com/rgbnir/{year}/N{yidx}/ESA_WorldCover_10m_{year}_v{version}_N{yidx}W{xidx}_S2RGBNIR.tif" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those "N" and "W" coordinate values should not be hard-coded here, took me a while to spot why the code was fetching wrong tif files
y_size = min(CHIP_SIZE, TILE_SIZE - y_local_off) | ||
y_another = y_size < CHIP_SIZE | ||
|
||
tile_id = f"N{y_tile_index}W{str(x_tile_index).zfill(3)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue here with the hard-coded "N" and "W" coordinates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the path creation should probably be generalized. It was hardcoded here because the source file is for the US which is in the NW quadrant.
Generalizing will take some tinkering to make sure that this still works even if you are looking at an area that crosses these regions.
yoff = 0 | ||
embeddings = [] | ||
all_bounds = [] | ||
while yoff < RASTER_Y_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this most likely is a non-issue but I was wondering why the code is not looping through the X axis as well? Is it because of how the worklow is designed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the script was ran in parallel for every "chip" in the x axis separately on AWS batch. The xoff value is computed based on a single index value that goes from left to right on the map until the area is completed. The index value is an env var for each separate job.
model/scripts/worldcover/run.py
Line 174 in 7046da7
index = int(os.environ.get("AWS_BATCH_JOB_ARRAY_INDEX", 2)) |
model/scripts/worldcover/run.py
Line 187 in 7046da7
xoff = index * CHIP_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @yellowcap
index = int(os.environ.get("AWS_BATCH_JOB_ARRAY_INDEX", 2))
I assume the starting index is 0, correct?
Also, last last question: could you please also let me know what are the XORIGIN and YORIGIN values and why they differ slightly from the starting coordinates?
E_W_INDEX_START = 67
E_W_INDEX_END = 125
N_S_INDEX_START = 24
N_S_INDEX_END = 49
YORIGIN = 50.0
XORIGIN = -125.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the starting index is 0, correct?
yes
let me know what are the XORIGIN and YORIGIN values and why they differ slightly from the starting coordinates?
the E_W_INDEX_START, E_W_INDEX_END, N_S_INDEX_START, N_S_INDEX_END are the integer indexes
of the worlcover grid reference.
X and YORIGIN are in lat/lon coordinates
. They are the upper left corner of the area covered by the tiles in the US layer. I.e. the upper left corner of the worldcover tile with index (67, 24).
This is work to run embedding inference for all of the US Conus region using worldcover composites from 2020 and 2021.
I am currently pushing this into Batch to make it work there. Once the first jobs run through I will kickoff the processing at US level.
This uses partial inputs from RGBNIR, the 4 bands available from worldcover.