Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run data collection for Clay v0.2 #142

Closed
yellowcap opened this issue Jan 31, 2024 · 4 comments
Closed

Run data collection for Clay v0.2 #142

yellowcap opened this issue Jan 31, 2024 · 4 comments
Assignees

Comments

@yellowcap
Copy link
Member

yellowcap commented Jan 31, 2024

We can use the current pipeline, but probably with the following changes:

  • Reduce chip size to 256x256 pixels
  • Add more time steps (can do all available years)
  • Increase the number of MGRS tiles by 3-5 times.

Regarding the MGRS tile increase, the question is if we want to change the ratio of the input. I discussed with @srmsoumya yesterday that we should mabye increase the fraction of the landcover classes with human footprint, i.e. Urban and Agriculture. Presumably that is what users will be most interested in for search. So we could increase the fraction of that to give this more weight.

@yellowcap yellowcap self-assigned this Jan 31, 2024
@brunosan
Copy link
Member

brunosan commented Jan 31, 2024

I suspect we are also dropping a very substantial share of inputs due a single no-data pixel invalidating the whole set.

model/scripts/tile.py

Lines 39 to 51 in ae70345

if int(tile.sel(band="B02").isin([NODATA]).sum()):
print("Too much no-data in B02")
return False
bands_to_check = ["vv", "vh", "dem"]
for band in bands_to_check:
if int(np.isnan(tile.sel(band=band)).sum()):
print(f"Too much no-data in {band}")
return False
# Check for cloud coverage
cloudy_pixel_count = int(tile.sel(band="SCL").isin(SCL_FILTER).sum())
if cloudy_pixel_count / PIXELS_PER_TILE >= BAD_PIXEL_MAX_PERCENTAGE:

image

aoi = gpd.GeoDataFrame(
    pd.DataFrame(["CDL Test Region"], columns=["Region"]),
    crs="EPSG:4326",
    geometry=[box(-92.30926, 32.17581, -90.01114, 38.63658)],  # using lower left and upper right coordinates
)

See: https://github.com/Clay-foundation/office/issues/170#issuecomment-1914173261

@brunosan
Copy link
Member

brunosan commented Feb 7, 2024

For the latlon coordinates embeddings to capture the intended global structure, I believe we must include full global coverage on the training set, which in my opinion means to add full coverage from MODIS, either composite or several times raw images.

Perhaps even train first with modis only to warm up a general latlon embeddings?

@yellowcap
Copy link
Member Author

For Clay v0.2 we are not planning to change the input platforms. Adding MODIS would require changes in architecture. The idea for v0.2 was to use the same datasources but with a much larger sample.

@yellowcap
Copy link
Member Author

Ran data collection with code from #173

We have 2535 MGRS tiles successfully processed, the data sits in s3://clay-tiles-04-sample-v02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants