Create a `demcoreg_init.sh` wrapper for all of the `get_*.sh` scripts #17

dshean · 2020-04-19T22:21:24Z

Ideally, we would fetch all of these layers (e.g., NLCD, bareground) on the fly through services like "Earth on AWS" registry: https://aws.amazon.com/earth/

At present, they still require local download, extraction and processing.

We should give the user the option to get all demcoreg layers in one shot, or instructions on how to run the necessary get_*.sh script. Right now, when a user runs dem_align.py it starts with a bunch of downloads - no good.

Alternatively, we could include all auxiliary data in a docker image, or store ourselves in the cloud. Should discuss with @scottyhq

The text was updated successfully, but these errors were encountered:

ShashankBice · 2020-04-28T21:36:58Z

Based on discussion with @scottyhq, I am listing what data we fetch, whether they are available on "Earth on AWS", and if not available, then from where we download them without a login.

Landcover dataset main site (https://www.mrlc.gov/data). For CONUS, the latest for 2016 can be downloaded from this link
FTP site for 2010 Global Fractional Bareground Data. Data is divided per 10 $^\circ$ by 10 $^\circ$ tiles.
FTP site hosting the RGI polygons.

None of these is currently available on "Earth on AWS", so maybe including auxiliary data in a docker image is the better option. Just to confirm, all this will come into play if we set up a binder hub for lightweight computations, right @dshean ? Or is the thought that here that the user can fetch the relevant compressed .tif files stored on the docker image directly ?

Note: The repository does have shell scripts to fetch these locally.

dshean · 2020-04-28T23:06:09Z

OK, thanks for checking. The idea was to have them ready to go "locally" in the docker image with all of the necessary dependencies.

Also, it's not just about downloading the files, there are also some processing steps in the shell scripts to prepare for dem_align.py or dem_mask.py. Some of these steps may be less relevant for newer versions of the products, and could probably come up with a better solution for combining all relevant RGI region shp.

scottyhq · 2020-04-29T04:28:48Z

After talking with @ShashankBice I spent a couple hours with the ASP docker image and ran through the demcoreg beginners doc. To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;)

Alternatively, we could include all auxiliary data in a docker image

Embedding in the image could be practical fro data volumes< 1Gb, but it seems all these datasets could easily be 10Gb+. So my suggestion is to let users run get_X.sh as-needed or host "analysis-ready" data (unzipped, etc) externally on S3 or elsewhere. Perhaps some code refactoring could allow streaming only portions of these global datasets from agency servers or FTP locations.

scottyhq · 2020-04-29T04:32:01Z

I didn't try all the get_*.sh scripts, (just nlcd, rgi, and bareground) and it looks like bareground hosting URL changed:

Downloading bare2010.zip
--2020-04-29 04:27:35--  http://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip
Resolving edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)... 152.61.136.26, 2001:49c8:4000:122c::26
Connecting to edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)|152.61.136.26|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip [following]
--2020-04-29 04:27:35--  https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip
Connecting to edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)|152.61.136.26|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-04-29 04:27:35 ERROR 404: Not Found.

Unzipping bare2010.zip
Archive:  bare2010.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of bare2010.zip or
        bare2010.zip.zip, and cannot find bare2010.zip.ZIP, period.
ls: cannot access 'bare2010/bare2010_v3/*bare2010_v3.tif': No such file or directory

dshean · 2020-04-29T17:54:43Z

Thanks for taking a look. I remember this coming up in Jan 2020 in email thread with @cmcneil-usgs. Here are my notes:

Hmmm. Yeah, looks like the USGS landcover site disappeared. I found them on UMD site: https://glad.umd.edu/dataset/global-2010-bare-ground-30-m. looks like they’ve posted individual tif tiles here https://glad.umd.edu/Potapov/Bare_2010/
If you want to update get_bareground.sh to download and clean up these tif, that would be great!
As stopgap, I pushed the original bare2010.zip file to Google Drive here: https://drive.google.com/file/d/1YDaaOm7aWG1URH8eviIYr69d-7ZwD8pj/view?usp=sharing
I see updated forest cover products here http://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html. The bareground is just the thresholded inverse of forest cover percentage. So could also be useful to create a new script to download and process these data, which provide more flexibility with timestamps.

dshean · 2020-04-29T18:14:27Z

Embedding in the image could be practical fro data volumes< 1Gb, but it seems all these datasets could easily be 10Gb+. So my suggestion is to let users run get_X.sh as-needed or host "analysis-ready" data (unzipped, etc) externally on S3 or elsewhere. Perhaps some code refactoring could allow streaming only portions of these global datasets from agency servers or FTP locations.

@scottyhq, I agree with all of these thoughts. Simplest solution is to maintain get_*.sh and have better doc on initial setup. But then the user has to download and store locally, and they potnet. If we can prepare and host core data layers on S3, that would be great, esp if it falls under existing project with credits. Would be nice if providers/agencies took the lead on this, but not going to hold my breath.

For the datasets with tif tiles on the web (like the new link for bareground dateset), I expect we could prepare and distribute a vrt that would do the trick. Anybody want to do a quick test with a few tiles?

dshean · 2020-04-29T18:15:45Z

To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;)
badge

I launched pangeo binder and while upload speed is not great, successfully uploaded a >100 MB DEM. Seems like we can recommend this for new users who have one-off application. though we should disable RGI glacier masking by default (I'll create separate issue).

Playing around with fresh install and successfully ran the Rainier DEM samples from the geohackweek raster tutorial (great idea!). Let's keep hacking on this and update the README/doc with a simple example...

dshean · 2020-04-29T18:42:03Z

though we should disable RGI glacier masking by default (I'll create separate issue).

Done in bd48b4f

ShashankBice · 2020-04-29T18:44:10Z

To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;)
badge

I launched pangeo binder and while upload speed is not great, successfully uploaded a >100 MB DEM. Seems like we can recommend this for new users who have one-off application. though we should disable RGI glacier masking by default (I'll create separate issue).

Playing around with fresh install and successfully ran the Rainier DEM samples from the geohackweek raster tutorial (great idea!). Let's keep hacking on this and update the README/doc with a simple example...

I will add this example as an extension to the ASP DEM tutorial over the weekend.

dshean · 2020-04-29T18:47:43Z

Sounds great @ShashankBice! Probably best to keep it separate from the core ASP processing tutorial though - modular is good. What if we had a separate tutorial in demcoreg?

ShashankBice · 2020-04-29T18:50:49Z

Sounds great @ShashankBice! Probably best to keep it separate from the core ASP processing tutorial though - modular is good. What if we had a separate tutorial in demcoreg?

makes sense :) !

dshean · 2020-04-29T20:47:44Z

To some extent we already have a nice solution for the preconfigured computing environment

@scottyhq I think you're using https://github.com/uw-cryo/asp-binder-dev/blob/master/binder/postBuild

Looks like it pulls latest source from github and does dev install. Strangely, I'm not seeing latest commits when launching via pangeo binder. Firing up terminal and running dem_align.py -h still shows default -mask_list ['glaciers']. Is this a caching issue?

scottyhq · 2020-04-29T20:57:36Z

Good catch, it was a a bit of a hack solution to try things out. Anything in postBuild (somewhat confusingly) is baked into the image at build time. BinderHub doesn't rebuild an image if the repo hasn't changed, so one solution is to edit the readme, or add a comment to any file to trigger rebuilding.

I'm trying moving those pip install commands to the start script, which are run when the image launches, that probably is the easiest way to get the lastest src code when testing things out.

That seems to work @dshean, you can keep using the same binder link and you'll have the latest from github :

jovyan@jupyter-uw-2dcryo-2dasp-2dbinder-2ddev-2d23hlp096:/srv/dshean/demcoreg$ git status
On branch master
Your branch is up to date with 'origin/master'.

dshean · 2020-04-29T23:01:19Z

Nice! That makes sense, and seems like a good solution. Thanks for looking into it!

dshean added the enhancement label Apr 19, 2020

This was referenced Apr 19, 2020

Improved installation: create conda environment.yml #16

Open

beginners_doc - get_rgi.sh fails if libraries are not added to PATH #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a `demcoreg_init.sh` wrapper for all of the `get_*.sh` scripts #17

Create a `demcoreg_init.sh` wrapper for all of the `get_*.sh` scripts #17

dshean commented Apr 19, 2020

ShashankBice commented Apr 28, 2020

dshean commented Apr 28, 2020

scottyhq commented Apr 29, 2020

scottyhq commented Apr 29, 2020

dshean commented Apr 29, 2020

dshean commented Apr 29, 2020

dshean commented Apr 29, 2020

dshean commented Apr 29, 2020

ShashankBice commented Apr 29, 2020

dshean commented Apr 29, 2020

ShashankBice commented Apr 29, 2020

dshean commented Apr 29, 2020

scottyhq commented Apr 29, 2020 •

edited

Loading

dshean commented Apr 29, 2020

Create a demcoreg_init.sh wrapper for all of the get_*.sh scripts #17

Create a demcoreg_init.sh wrapper for all of the get_*.sh scripts #17

Comments

dshean commented Apr 19, 2020

ShashankBice commented Apr 28, 2020

dshean commented Apr 28, 2020

scottyhq commented Apr 29, 2020

scottyhq commented Apr 29, 2020

dshean commented Apr 29, 2020

dshean commented Apr 29, 2020

dshean commented Apr 29, 2020

dshean commented Apr 29, 2020

ShashankBice commented Apr 29, 2020

dshean commented Apr 29, 2020

ShashankBice commented Apr 29, 2020

dshean commented Apr 29, 2020

scottyhq commented Apr 29, 2020 • edited Loading

dshean commented Apr 29, 2020

Create a `demcoreg_init.sh` wrapper for all of the `get_*.sh` scripts #17

Create a `demcoreg_init.sh` wrapper for all of the `get_*.sh` scripts #17

scottyhq commented Apr 29, 2020 •

edited

Loading