Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a demcoreg_init.sh wrapper for all of the get_*.sh scripts #17

Open
dshean opened this issue Apr 19, 2020 · 14 comments
Open

Create a demcoreg_init.sh wrapper for all of the get_*.sh scripts #17

dshean opened this issue Apr 19, 2020 · 14 comments

Comments

@dshean
Copy link
Owner

dshean commented Apr 19, 2020

Ideally, we would fetch all of these layers (e.g., NLCD, bareground) on the fly through services like "Earth on AWS" registry: https://aws.amazon.com/earth/

At present, they still require local download, extraction and processing.

We should give the user the option to get all demcoreg layers in one shot, or instructions on how to run the necessary get_*.sh script. Right now, when a user runs dem_align.py it starts with a bunch of downloads - no good.

Alternatively, we could include all auxiliary data in a docker image, or store ourselves in the cloud. Should discuss with @scottyhq

@ShashankBice
Copy link
Contributor

Based on discussion with @scottyhq, I am listing what data we fetch, whether they are available on "Earth on AWS", and if not available, then from where we download them without a login.

  • Landcover dataset main site (https://www.mrlc.gov/data). For CONUS, the latest for 2016 can be downloaded from this link
  • FTP site for 2010 Global Fractional Bareground Data. Data is divided per 10 $^\circ$ by 10 $^\circ$ tiles.
  • FTP site hosting the RGI polygons.

None of these is currently available on "Earth on AWS", so maybe including auxiliary data in a docker image is the better option. Just to confirm, all this will come into play if we set up a binder hub for lightweight computations, right @dshean ? Or is the thought that here that the user can fetch the relevant compressed .tif files stored on the docker image directly ?

Note: The repository does have shell scripts to fetch these locally.

@dshean
Copy link
Owner Author

dshean commented Apr 28, 2020

OK, thanks for checking. The idea was to have them ready to go "locally" in the docker image with all of the necessary dependencies.

Also, it's not just about downloading the files, there are also some processing steps in the shell scripts to prepare for dem_align.py or dem_mask.py. Some of these steps may be less relevant for newer versions of the products, and could probably come up with a better solution for combining all relevant RGI region shp.

@scottyhq
Copy link

After talking with @ShashankBice I spent a couple hours with the ASP docker image and ran through the demcoreg beginners doc. To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;)
badge

Alternatively, we could include all auxiliary data in a docker image

Embedding in the image could be practical fro data volumes< 1Gb, but it seems all these datasets could easily be 10Gb+. So my suggestion is to let users run get_X.sh as-needed or host "analysis-ready" data (unzipped, etc) externally on S3 or elsewhere. Perhaps some code refactoring could allow streaming only portions of these global datasets from agency servers or FTP locations.

@scottyhq
Copy link

I didn't try all the get_*.sh scripts, (just nlcd, rgi, and bareground) and it looks like bareground hosting URL changed:

Downloading bare2010.zip
--2020-04-29 04:27:35--  http://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip
Resolving edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)... 152.61.136.26, 2001:49c8:4000:122c::26
Connecting to edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)|152.61.136.26|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip [following]
--2020-04-29 04:27:35--  https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip
Connecting to edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)|152.61.136.26|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-04-29 04:27:35 ERROR 404: Not Found.

Unzipping bare2010.zip
Archive:  bare2010.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of bare2010.zip or
        bare2010.zip.zip, and cannot find bare2010.zip.ZIP, period.
ls: cannot access 'bare2010/bare2010_v3/*bare2010_v3.tif': No such file or directory

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

Thanks for taking a look. I remember this coming up in Jan 2020 in email thread with @cmcneil-usgs. Here are my notes:

Hmmm. Yeah, looks like the USGS landcover site disappeared. I found them on UMD site: https://glad.umd.edu/dataset/global-2010-bare-ground-30-m. looks like they’ve posted individual tif tiles here https://glad.umd.edu/Potapov/Bare_2010/
If you want to update get_bareground.sh to download and clean up these tif, that would be great!
As stopgap, I pushed the original bare2010.zip file to Google Drive here: https://drive.google.com/file/d/1YDaaOm7aWG1URH8eviIYr69d-7ZwD8pj/view?usp=sharing
I see updated forest cover products here http://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html. The bareground is just the thresholded inverse of forest cover percentage. So could also be useful to create a new script to download and process these data, which provide more flexibility with timestamps.

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

Embedding in the image could be practical fro data volumes< 1Gb, but it seems all these datasets could easily be 10Gb+. So my suggestion is to let users run get_X.sh as-needed or host "analysis-ready" data (unzipped, etc) externally on S3 or elsewhere. Perhaps some code refactoring could allow streaming only portions of these global datasets from agency servers or FTP locations.

@scottyhq, I agree with all of these thoughts. Simplest solution is to maintain get_*.sh and have better doc on initial setup. But then the user has to download and store locally, and they potnet. If we can prepare and host core data layers on S3, that would be great, esp if it falls under existing project with credits. Would be nice if providers/agencies took the lead on this, but not going to hold my breath.

For the datasets with tif tiles on the web (like the new link for bareground dateset), I expect we could prepare and distribute a vrt that would do the trick. Anybody want to do a quick test with a few tiles?

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;)
badge

I launched pangeo binder and while upload speed is not great, successfully uploaded a >100 MB DEM. Seems like we can recommend this for new users who have one-off application. though we should disable RGI glacier masking by default (I'll create separate issue).

Playing around with fresh install and successfully ran the Rainier DEM samples from the geohackweek raster tutorial (great idea!). Let's keep hacking on this and update the README/doc with a simple example...

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

though we should disable RGI glacier masking by default (I'll create separate issue).

Done in bd48b4f

@ShashankBice
Copy link
Contributor

To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;)
badge

I launched pangeo binder and while upload speed is not great, successfully uploaded a >100 MB DEM. Seems like we can recommend this for new users who have one-off application. though we should disable RGI glacier masking by default (I'll create separate issue).

Playing around with fresh install and successfully ran the Rainier DEM samples from the geohackweek raster tutorial (great idea!). Let's keep hacking on this and update the README/doc with a simple example...

I will add this example as an extension to the ASP DEM tutorial over the weekend.

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

Sounds great @ShashankBice! Probably best to keep it separate from the core ASP processing tutorial though - modular is good. What if we had a separate tutorial in demcoreg?

@ShashankBice
Copy link
Contributor

Sounds great @ShashankBice! Probably best to keep it separate from the core ASP processing tutorial though - modular is good. What if we had a separate tutorial in demcoreg?

makes sense :) !

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

To some extent we already have a nice solution for the preconfigured computing environment

@scottyhq I think you're using https://github.com/uw-cryo/asp-binder-dev/blob/master/binder/postBuild

Looks like it pulls latest source from github and does dev install. Strangely, I'm not seeing latest commits when launching via pangeo binder. Firing up terminal and running dem_align.py -h still shows default -mask_list ['glaciers']. Is this a caching issue?

@scottyhq
Copy link

scottyhq commented Apr 29, 2020

Good catch, it was a a bit of a hack solution to try things out. Anything in postBuild (somewhat confusingly) is baked into the image at build time. BinderHub doesn't rebuild an image if the repo hasn't changed, so one solution is to edit the readme, or add a comment to any file to trigger rebuilding.

I'm trying moving those pip install commands to the start script, which are run when the image launches, that probably is the easiest way to get the lastest src code when testing things out.

That seems to work @dshean, you can keep using the same binder link and you'll have the latest from github :

jovyan@jupyter-uw-2dcryo-2dasp-2dbinder-2ddev-2d23hlp096:/srv/dshean/demcoreg$ git status
On branch master
Your branch is up to date with 'origin/master'.

@dshean
Copy link
Owner Author

dshean commented Apr 29, 2020

Nice! That makes sense, and seems like a good solution. Thanks for looking into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants