Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install missing packages to pangeo #898

Closed
grallewellyn opened this issue Jan 23, 2024 · 13 comments
Closed

Install missing packages to pangeo #898

grallewellyn opened this issue Jan 23, 2024 · 13 comments
Assignees
Labels
ADE Algorithm Development Environment Subsystem JPL JPL related issues
Milestone

Comments

@grallewellyn
Copy link
Collaborator

grallewellyn commented Jan 23, 2024

To install Pangeo that were suggested in the original ticket from links:

adlfs
argopy
black
ciso
cmocean
cdsapi
cf_xarray
dask-ml
fastjmd95
fsspec
gcsfs
gh
gh-scoped-creds
git-lfs
gsw
line_profiler
memory_profiler
metpy
nb_conda_kernels
nbstripout
numbagg
numcodecs
python-graphviz
xarray-datatree
xarray_leaflet
xarray-spatial
xbatcher
xcape
xclim
xgboost
xgcm
xhistogram
xmip
xmitgcm
xpublish
xrft
xskillscore

Also, these packages were in the spreadsheet but are missing. We need to know to put in DPS or ADE from uwg:

adlfs
argopy
fastjmd95
gcsfs
geogif (there is a note for this one being commented out in the environment.yml, but why?)
gh-scoped-creds
gsw
intake-esm
ipytree
jupyter-panel-proxy
jupyterlab-s3-browser (there is a note for this one being commented out in the environment.yml)
metpy
parcels
pop-tools
pycamhd
python-gist
python-graphviz
rise
satpy
snakeviz
tiledb-py
timezonefinder
xbatcher
xcape
xclim
xcube (there is a note for this one being commented out in the environment.yml)
xmip
xmitgcm
xrft
xskillscore

Update in these spreadsheets
Pangeo: https://docs.google.com/spreadsheets/d/1krnOZ1SFW-GA_jOiL-nWzhNAA3JKTNqFrG0IpObHsBg/edit?usp=sharing
Vanilla/ kinda isce3: https://docs.google.com/spreadsheets/d/18Orw1cZbqUdOPBy9hFwXm43Pb0MZL7n1KEJuWO9UuIY/edit?usp=sharing
R: https://docs.google.com/spreadsheets/d/1mrQ3gdcxZHZNTksUmLz6qqqNSNxhAoonB9znLU0c0pk/edit?usp=sharing

@anilnatha
Copy link

@wildintellect The list of packages listed above, notably those for Pangeo were included in the original working list of packages we used to determine what to add to our workspace in our recent release.

However, columns F and G were not filled in for these, leading me to believe that they shouldn't be added to the Pangeo environment unlike other packages that were indicated to be added. Apologies if this was a mistake on my part. If these were in fact needed, can the data/users indicate which ones should be added, we want to keep the images lean, and the more packages we add, the harder it becomes for us to manage build dependency issues that @grallewellyn and I worked a lot on in the last release.

@wildintellect
Copy link
Collaborator

Can we keep separate tickets for the different workspace types. R should be handled very separately.
For now I don't think most UWG members are ready to provide feedback without taking the new workspaces for a test drive.

@grallewellyn grallewellyn changed the title Install missing packages to pangeo/ R Install missing packages to pangeo Jan 24, 2024
@grallewellyn
Copy link
Collaborator Author

I agree, new issue here: #902 and updated this issue

@grallewellyn grallewellyn removed this from the 3.1.5 milestone Jan 24, 2024
@anilnatha anilnatha added ADE Algorithm Development Environment Subsystem JPL JPL related issues labels Feb 1, 2024
@anilnatha anilnatha added this to the 3.1.5 milestone Feb 1, 2024
@wildintellect wildintellect pinned this issue Feb 5, 2024
@anilnatha
Copy link

@wildintellect Any input from the user working group regarding which packages listed in this issue need to be added to the pangeo workspace?

@wildintellect
Copy link
Collaborator

I've asked @jsignell to have a look at this. Nothing urgent has come up, so you can probably go ahead with your planned changes and we'll revisit in another update cycle if the users report needing additional libs.

@anilnatha
Copy link

anilnatha commented Feb 28, 2024

Using comments and other information derived from the google doc, I've compiled the list below regarding packages we should be able to ignore or omit for Pangeo for the reason posted, but if any of these should not be in this omit list, please point it out.

After omitting the packages above, the list of packages to potentially add to our Pangeo workspace is the following. We need clarification from @wildintellect and the user working group members about which of these packages should be added.

  • black
  • cdsapi
  • cf_xarray
  • fastjmd95
  • geogif - This package is no longer listed in Pangeo's envionment.yml. At one point it was was listed in Pangeo's environment.yml, but was commented out.
  • geopy - "Geocoding (Address to Lat/Lon), unclear need at this time or which geocoders would be FAIR and OpenAccess principles" - Alex
  • gh-scoped-creds (https://github.com/jupyterhub/gh-scoped-creds)
  • intake-esm
  • ipytree
  • jupyter-panel-proxy
  • jupyterlab-s3-browser - Is no longer listed in the official Pangeo's environment.yml.
  • line_profiler
  • memory_profiler
  • metpy - "weather data" - Alex
  • nb_conda_kernels - "unsure if necessary with newer versions of jupyter" - Alex
  • numbagg
  • pycamhd
  • python-gist - "coder nicety" - Alex
  • python-graphviz
  • rise - "turn jupyter notebook into slideshow" - Alex
  • satpy
  • snakeviz - "code performance profiler" - Alex
  • tiledb-py
  • timezonefinder
  • xbatcher
  • xcape
  • xclim - "climate modeling" - Alex
  • xcube - Is no longer listed in the official Pangeo's environment.yml.
  • xgboost
  • xgcm - "climate specific" - Alex
  • xmip - "CMIP6 climate data access" - Alex
  • xmitgcm - "climate specific" - Alex
  • xpublish
  • xrft - "Fourier Transforms" - Alex
  • xskillscore - "Weather forecasting" - Alex

cc: @grallewellyn @sujen1412

@anilnatha
Copy link

@wildintellect I saw your most recent message after I posted the list seen in the post above.

Before we proceed with this change we'll need confirmation from you (based on some of your comments in the original Google document) and from the user working group on which of these packages to add. We can't tell which are must-haves vs should-haves or nice-to-haves and we're trying to avoid bloating the workspace with unneeded packages.

cc: @jsignell

@jsignell
Copy link

I'm a little confused about the goal of this ticket. My understanding is that the value of the pangeo-notebook environment is that:

  1. it saves infrastructure teams (aka us) from having to make decisions about what packages users are likely to want
  2. it saves "pangeo-like" users from having to customize the environment.
  3. it makes code more portable between hubs: the pangeo-notebook image on MAAP is exactly the same as the one on veda jupyterhub, or any other hub.

Is it potentially bloated? Sure. But I would argue that that doesn't actually matter too much.

@gchang
Copy link
Collaborator

gchang commented Feb 29, 2024

We would like to streamline the image as the size of the container has a direct relation to the performance during data processing (each worker node downloads the container image at the start of a job). If the additional libraries take up an additional 5 gb each, and we launch a cluster of 1000 nodes, that's an additional 5 tb of unnecessary data transfer (albeit free) but also compute/wait time for that download.

@anilnatha Please quantify the size difference between an optimized build and an unoptimized build.

@jsignell
Copy link

jsignell commented Mar 1, 2024

You might have already done all this, but here are a bunch of other ways to shrink images without removing packages. It looks like the standard pangeo-notebook image already implements the suggestions from https://jcristharif.com/conda-docker-tips.html and ends up with a docker image that compresses to 1.92 GB (https://hub.docker.com/r/pangeo/pangeo-notebook/tags)

@anilnatha
Copy link

anilnatha commented Mar 6, 2024

(fyi) I've issued a PR that takes care of adding the images listed below. If there are additional packages that are needed, we can try to squeeze them into this release, or they will have to be added in the next release.

Added to base image (used in DPS)

  • cdsapi
  • cf_xarray
  • fastjmd95
  • geogif
  • geopy
  • intake-esm
  • line_profiler
  • memory_profiler
  • metpy
  • numbagg
  • pycamhd
  • satpy
  • tiledb-py
  • timezonefinder
  • xbatcher
  • xcape
  • xclim
  • xcube
  • xgboost
  • xgcm
  • xmip
  • xmitgcm
  • xpublish
  • xrft

Added to jupyterlab image (ADE)

  • black
  • gh-scoped-creds
  • ipytree
  • jupyter-panel-proxy
  • jupyterlab-s3-browser
  • nb_conda_kernels
  • python-gist
  • python-graphviz
  • rise
  • snakeviz

@wildintellect
Copy link
Collaborator

FYI something is wrong with the Python Paths in pangeo
the version of boto3 that imports is not what conda installed, in the terminal you can test this.

conda list | grep boto3
boto3                     1.34.51            pyhd8ed1ab_0    conda-forge
botocore                  1.34.51         pyge310_1234567_0    conda-forge

python
>>> botocore.__version__
'1.24.32'

@grallewellyn
Copy link
Collaborator Author

@wildintellect Opened a new ticket with that problem here: #956 and commented with what I am seeing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADE Algorithm Development Environment Subsystem JPL JPL related issues
Projects
None yet
Development

No branches or pull requests

5 participants