-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worldcover embeddings conus #153
Merged
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
777c453
Add script to generate worldcover composite vrt files
yellowcap 9a924cd
Add initial version of batch run script
yellowcap a3fe4ad
Intermediate
yellowcap e1dbebc
Improve print statements
yellowcap 5456a85
Reduce batch size and fix array index usage
yellowcap 8cae928
Disable workers on datamodule to save memory
yellowcap 7bdf59b
Add script to explore embeddings using lancedb
yellowcap 0b626c7
Rename run.py file
yellowcap e3a5811
Index based run file
yellowcap 9d461d2
Small fixes
yellowcap bbabb7a
Add initial readme
yellowcap 577524e
Full array size, change mem requirements
yellowcap 14dd282
Remove scripts from previous attempt
yellowcap 6c0ea95
Improved docs
yellowcap 9884869
Use v002
yellowcap 1113fd7
Improved docs
yellowcap 2bf0e01
Improved docs
yellowcap 3b0116f
Improved docs
yellowcap 394f55f
Move worldcover readme into docs
yellowcap f0422be
Make year a parameter
yellowcap 5bc7d42
Fix url formatting
yellowcap 5e37513
Fix url worldcover version by year
yellowcap b4193d7
Use S3 uri for model checkpoint
yellowcap 20492ef
Merge branch 'main' into worldcover-embeddings-conus
yellowcap 7046da7
Merge branch 'main' into worldcover-embeddings-conus
yellowcap File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Running embeddings for Worldcover Sentinel-2 Composites | ||
This package is made to generate embeddings from the [ESA Worldcover](https://esa-worldcover.org/en/data-access) | ||
Sentinel-2 annual composites. The target region is all of the | ||
Contiguous United States. | ||
|
||
We ran this script for 2020 and 2021. | ||
|
||
## The algorithm | ||
|
||
The `run.py` script will run through a column of image chips of 512x512 pixels. | ||
Each run is a column that spans the Contiguous United States from north to | ||
south. For each chip in that column, embeddings are generated and stored | ||
together in one geoparquet file. These files are then uploaded to the | ||
`clay-worldcover-embeddings` bucket on S3. | ||
|
||
There are 1359 such columns to process in order to cover all of the Conus US. | ||
|
||
The embeddings are stored alongside with the bbox of the data chip used for | ||
generating the embedding. To visualize the underlying data or an embedding | ||
the WMS and WMTS endpoints provided by the ESA Worldcover project can be used. | ||
|
||
So the geoparquet files only have the following two columns | ||
|
||
| embeddings | bbox | | ||
|------------------|--------------| | ||
| [0.1, 0.4, ... ] | POLYGON(...) | | ||
| [0.2, 0.5, ... ] | POLYGON(...) | | ||
| [0.3, 0.6, ... ] | POLYGON(...) | | ||
|
||
## Exploring results | ||
|
||
The `embeddings_db.py` script provides a way to locally explore the embeddings. | ||
It will create a `lancedb` database and allow for search. The search results are | ||
visualizded by requesting the RGB image from the WMS endpoint for the bbox of | ||
each search result. | ||
|
||
## Running on Batch | ||
|
||
### Upload package to fetch and run bucket | ||
This snippet will create the zip package that is used for the fetch-and-run | ||
instance in our ECR registry. | ||
|
||
```bash | ||
# Add clay src and scripts to zip file | ||
zip -FSr batch-fetch-and-run-wc.zip src scripts -x *.pyc -x scripts/worldcover/wandb/**\* | ||
|
||
# Add run to home dir, so that fetch-and-run can see it. | ||
zip -uj batch-fetch-and-run-wc.zip scripts/worldcover/run.py | ||
|
||
# Upload fetch-and-run package to S3 | ||
aws s3api put-object --bucket clay-fetch-and-run-packages --key "batch-fetch-and-run-wc.zip" --body "batch-fetch-and-run-wc.zip" | ||
``` | ||
|
||
### Push array job | ||
This command will send the array job to AWS batch to run all of the | ||
1359 jobs to cover the US. | ||
|
||
```python | ||
import boto3 | ||
|
||
batch = boto3.client("batch", region_name="us-east-1") | ||
year = 2020 | ||
job = { | ||
"jobName": f"worldcover-conus-{year}", | ||
"jobQueue": "fetch-and-run", | ||
"jobDefinition": "fetch-and-run", | ||
"containerOverrides": { | ||
"command": ["run.py"], | ||
"environment": [ | ||
{"name": "BATCH_FILE_TYPE", "value": "zip"}, | ||
{ | ||
"name": "BATCH_FILE_S3_URL", | ||
"value": "s3://clay-fetch-and-run-packages/batch-fetch-and-run-wc.zip", | ||
}, | ||
{"name": "YEAR", "value": f"{year}"} | ||
], | ||
"resourceRequirements": [ | ||
{"type": "MEMORY", "value": "7500"}, | ||
{"type": "VCPU", "value": "4"}, | ||
# {"type": "GPU", "value": "1"}, | ||
], | ||
}, | ||
"arrayProperties": { | ||
"size": int((125 - 67) * 12000 / 512) | ||
}, | ||
"retryStrategy": { | ||
"attempts": 5, | ||
"evaluateOnExit": [ | ||
{"onStatusReason": "Host EC2*", "action": "RETRY"}, | ||
{"onReason": "*", "action": "EXIT"} | ||
] | ||
}, | ||
} | ||
|
||
print(batch.submit_job(**job)) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
from pathlib import Path | ||
|
||
import geopandas as gpd | ||
import lancedb | ||
import matplotlib.pyplot as plt | ||
from skimage import io | ||
|
||
# Set working directory | ||
wd = "/home/usr/Desktop/" | ||
|
||
# To download the existing embeddings run aws s3 sync | ||
# aws s3 sync s3://clay-worldcover-embeddings /my/dir/clay-worldcover-embeddings | ||
|
||
vector_dir = Path(wd + "clay-worldcover-embeddings/v002/2021/") | ||
|
||
# Create new DB structure or open existing | ||
db = lancedb.connect(wd + "worldcoverembeddings_db") | ||
|
||
# Read all vector embeddings into a list | ||
data = [] | ||
for strip in vector_dir.glob("*.gpq"): | ||
print(strip) | ||
tile_df = gpd.read_parquet(strip).to_crs("epsg:3857") | ||
|
||
for _, row in tile_df.iterrows(): | ||
data.append( | ||
{"vector": row["embeddings"], "year": 2021, "bbox": row.geometry.bounds} | ||
) | ||
|
||
# Show table names | ||
db.table_names() | ||
|
||
# Drop existing table if exists | ||
db.drop_table("worldcover-2021-v001") | ||
|
||
# Create embeddings table and insert the vector data | ||
tbl = db.create_table("worldcover-2021-v001", data=data, mode="overwrite") | ||
|
||
|
||
# Visualize some image chips | ||
def plot(df, cols=10): | ||
fig, axs = plt.subplots(1, cols, figsize=(20, 10)) | ||
|
||
for ax, (i, row) in zip(axs.flatten(), df.iterrows()): | ||
bbox = row["bbox"] | ||
url = f"https://services.terrascope.be/wms/v2?SERVICE=WMS&version=1.1.1&REQUEST=GetMap&layers=WORLDCOVER_2021_S2_TCC&BBOX={','.join([str(dat) for dat in bbox])}&SRS=EPSG:3857&FORMAT=image/png&WIDTH=512&HEIGHT=512" # noqa: E501 | ||
image = io.imread(url) | ||
ax.imshow(image) | ||
ax.set_axis_off() | ||
|
||
plt.tight_layout() | ||
plt.show() | ||
|
||
|
||
# Select a vector by index, and search 10 similar pairs, and plot | ||
v = tbl.to_pandas()["vector"].values[10540] | ||
result = tbl.search(query=v).limit(5).to_pandas() | ||
plot(result, 5) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yellowcap, I know this is already merged, but can you avoid such absolute/hardcoded paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feeback @chuckwondo you are right, this isn't great, but is supposed to be a placeholder. Sometimes I use fake paths like
/path/to/your/working/directory
, to show what this is supposed to be so that people running the script could replace it.But I am very happy to learn about better ways to do this, what is your favorite solution for this kind of thing? In cases like this with scripts, maybe env vars could be an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to propose some ideas, but I need some context first. How is this script intended to be used? Is the intent to have the user first run the
aws s3 sync
command shown in the code comment below this line, and then just directly call this script?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, the idea is that someone with access to the embeddings downloads them using
aws s3 sync
to a local folder, and then runs the script pointing to the embedding files and to a folder where the lancedb data should be stored.I.e. the script needs two folders
I made many scripts like this where some local workding directories are needed. Never really found a very satisfactory way of handling this. Env vars seem a bit cluncky and are not always easy to set up. Constants work, but then the script requires a hard coded default value.
So if you have good ideas on how to approach this issue, they are most welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add command-line arguments. For very simple scripts like this one, simply use Python's standard argparse module, so you don't have to add any dependencies. In this case, it sounds like you might want to use
argparse
to support a syntax like the following for running the script:Both options should be required, and the script should also create the dir specified for
--db-dir
, so the user doesn't have to do so manually beforehand.