Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Hub] LEAP Pangeo #1050

Closed
7 of 10 tasks
choldgraf opened this issue Mar 2, 2022 · 23 comments
Closed
7 of 10 tasks

[New Hub] LEAP Pangeo #1050

choldgraf opened this issue Mar 2, 2022 · 23 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Mar 2, 2022

Hub Description

LEAP Pangeo is an extension of the Pangeo project to new communities around research and education with Machine Learning. The hub's environment will be nearly identical to the Pangeo Hubs, and run on GKE, though the setup might be slightly different and we should get clarifications from @rabernat.

Community Representative(s)

@rabernat

Not sure if there are others serving as leads on the project.

Important dates

  • Required start date: March 14th
  • Target start date: ASAP - they would like to get this running whenever we can get the hub set up
  • Any important dates for usage: Not that I know of

Hub Authentication Type

Other (may not be possible, please specify in comments)

Hub logo information

TODO: @rabernat does this look correct?

Hub user image

TODO: @rabernat can you advise here? Is this the Pangeo user image?

  • Repository for user image: { REPO LINK IF IT EXISTS }
  • User image registry: { REGISTRY IF ONE ALREADY EXISTS }
  • User image tag and name: { NAME AND TAG IF IT EXISTS }

Extra features you'd like to enable

TODO: @rabernat does it need to be in a specific data center?

  • Specific cloud provider or datacenter:
  • Dedicated Kubernetes cluster
  • Scalable Dask Cluster
  • GPUs available to users

Other relevant information

There is a GCP billing account with credits for this hub. It is under the 2i2c.org GCP organization. Here are the details:

  • Name: community-LEAP-NSF
  • ID: 01A164-923D17-3199D9

Hub URL

leap.pangeo.2i2c.cloud

Hub Type

daskhub

Tasks to deploy the hub

  • Engineer who will deploy the hub is assigned
  • Deploy information filled in above
  • Initial Hub deployment PR: Add LEAP hub #1074
  • Administrators able to log on
  • Community Representative satisfied with hub environment
  • Hub now in steady-state

Follow-up issues

@choldgraf
Copy link
Member Author

Hey all - I put down some details for the new LEAP Pangeo hub that we're deploying for @rabernat . I think that we need to clarify some of the information above in order to know what kind of environment / hardware to set up. @rabernat could you take a look at the questions in the top comment and resolve them w/ answers or discussion?

@yuvipanda yuvipanda self-assigned this Mar 9, 2022
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 9, 2022
@yuvipanda yuvipanda mentioned this issue Mar 9, 2022
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 9, 2022
@yuvipanda
Copy link
Member

yuvipanda commented Mar 9, 2022

@rabernat ok I've deployed a standard dask based hub at https://leap.2i2c.cloud! It's configured to be similar to the pangeo hub.

Next steps:

  • Create a GitHub team under https://github.com/leap-stc for people who will have access to this hub and let me know, so I can grant access to people who are part of that team? Right now, everyone who has access to the pangeo hub has access to this
  • Try it out and let me know what else needs to change.

I'll work on adding GPUs as well.

@rabernat
Copy link
Contributor

rabernat commented Mar 9, 2022

Hi Folks! This is awesome. Sorry for not responding earlier to this issue. Somehow I missed the notification.


Not sure if there are others serving as leads on the project.

Just me for now. May add others later.

👍

  • Repository for user image: { REPO LINK IF IT EXISTS }
  • User image registry: { REGISTRY IF ONE ALREADY EXISTS }
  • User image tag and name: { NAME AND TAG IF IT EXISTS }

We would like to use the latest image from https://github.com/pangeo-data/pangeo-docker-images/tags, currently at 2022.02.04. However, that is probably not possible due to #1031, which is preventing us from updating to the latest image due to dask gateway incompatibilities.

We also want all of the different machine types to have the option to launch the ML-version of the image. However, the ML notebook image has a small 🐛 right now (see pangeo-data/pangeo-docker-images#294).

We would like to add a larger machine type, something equivalent to e2-standard-8, with 8 vcpus and 32 GB memory.

We will need an option to attach a GPU. I am not sure which one and would appreciate an rundown of the options / costs.

Going forward, it would be great to be able to select any of the tags from a dropdown, as part of a matrix of profile-list spawner options (see jupyterhub/kubespawner#307).

I have created the leap-stc/leap-pangeo-users group. Please DO NOT retain access to the broader pangeo group.

TODO: @rabernat does it need to be in a specific data center?

GCP us-central-1 would probably be ideal, like the other cluster.

@yuvipanda
Copy link
Member

yuvipanda commented Mar 10, 2022

@rabernat I've investigated #1031 (comment) and I think it's sorted out. The LEAP hub now has the latest pangeo image.

Next things to do:

  • Add an even larger instance
  • Investigate GPU options

@yuvipanda
Copy link
Member

@rabernat I've redid the size options available:

image

These put one user per node as well, which I think is a better fit for research hubs.

@rabernat
Copy link
Contributor

rabernat commented Mar 10, 2022

Has access been granted to the @leap-stc/leap-pangeo-users group? I just had a report that it was not working.

Also, where does this hub configuration live?

@yuvipanda
Copy link
Member

@rabernat it's currently in this PR: #1074. Can you add me to that GitHub team, and I'll debug that?

@rabernat
Copy link
Contributor

rabernat commented Mar 10, 2022

I actually just had a new report that it IS working, so I think we are good in terms of authorization.

@rabernat
Copy link
Contributor

So I just got some good feedback from the LEAP Executive Committee about this hub. First off, everyone is very excited and happy to have the hub up! 🎉 So thanks for getting this off the ground. 🙏

Most of the feedback is from PIs who are very experienced at using HPC resources for supporting large research groups. I think these points will be quite universal for large and complex communities like LEAP, so I hope they can stimulate some useful discussion.

Onboarding Tutorials

As specified in our contract, 2i2c will provide onboarding training for the hubs.

Question for 2i2c: What is the timeframe and process for organizing these training sessions?

Offboarding

Over the 5-10 years of this project, people will exit the project. We need a sustainable approach to not only onboarding but offboarding.

Question for 2i2c: Beyond simply removing their access via the github group, what is the process for offboarding them and specifically purging their user data from storage so we don't continuously accumulate abandoned data?

Tiering of Access

There is a huge range of different types of participants in LEAP and users of LEAP-Pangeo: from high-schoolers who will participate in a hackathon for 1 day to senior faculty who will do cutting edge research over many users. It seems inevitable that we will need different tiers of access. Specifically, we would like to limit certain profile-list options (e.g. GPUs) to certain user groups.

Question for 2i2c: is it possible to associate distinct profiles with different user sub-groups?

Metrics and Report

This one will be difficult I think, but I am stating it clearly here: it is important for LEAP to have user-level breakdowns of hub usage and costs. This is what PIs who work on HPC centers are used to and this is what they expect here. Specifically, I would like to do a query for a specific user (e.g. myself rabernat) and see, on a weekly, daily, or monthly basis:

  • Total CPU hours used and associated cost
  • Total GPU hours used and associated cost
  • Total storage used and associated cost

The sum of these individual user costs should roughly add up to the total hub cost.

The reason for this is based on the PIs years of experience on HPC where a small number of users (sometimes maliciously) consume a disproportionate amount of resources. Identifying and diagnosing such situations is imperative.

Question for 2i2c: What technical developments are required to deliver this granularity of reporting? What is a reasonable time-frame for implementation?

@yuvipanda
Copy link
Member

@rabernat Great questions! I opened 2i2c-org/features#8 to discuss offboarding. I'll let @choldgraf speak about some of the other questions. I also know we already have issues wrt reporting elsewhere...

@choldgraf
Copy link
Member Author

Hey @rabernat - thanks for these follow-up questions and requests. Some of them there are plans in the works, and others will need more investigation and discussion before moving forward. I'll touch on each below:

What is the timeframe and process for organizing these training sessions?

Right now, we have a job position open for the person that will spearhead these efforts: https://2i2c.org/jobs/2022/product-community-lead/ . We expect to start reviewing applications in a week or so, and will hire somebody on a rolling basis once we find the right candidate. I expect that process to take another month at least.

In the meantime, I wonder how we can have the most impact with low-hanging fruit for the LEAP community. Can we discuss the most important things to focus on in the issues linked below? If there are specific needs that LEAP has right now, we can create focused issues for them.

Beyond simply removing their access via the github group, what is the process for offboarding them and specifically purging their user data from storage so we don't continuously accumulate abandoned data?

See the issue below where we're tracking this question. Semi-related: we have these offboarding docs but those are for an entire hub migrating off the service, not for the regular "churn" of users on a hub.

is it possible to associate distinct profiles with different user sub-groups?

I don't believe this is currently possible in JupyterHub. I looked around in KubeSpawner but didn't find anything about this specifically, so I've opened up the issue below to track and discuss:

What technical developments are required to deliver this granularity of reporting? What is a reasonable time-frame for implementation?

We are tracking development efforts to improve reporting / monitoring in these two issues that are both actively under development. I'm not sure what the timeline is on them, but I think we'll be able to track hub-level usage/costs by the end of Q2 or so.

Our current targets are to calculate "usage and costs" at the hub level, and at the user-level focus on "usage" (memory, CPU, etc) rather than calculate costs per-se. Let's discuss this one in those more specific issues?

@sgibson91
Copy link
Member

sgibson91 commented Mar 15, 2022

I don't believe this is currently possible in JupyterHub. I looked around in KubeSpawner but didn't find anything about this specifically, so I've opened up the issue below to track and discuss:

I actually think this is possible, it's just not default out-of-the-box and requires custom logic. See @consideRatio's wonderful Discourse post on the topic here: https://discourse.jupyter.org/t/tailoring-spawn-options-and-server-configuration-to-certain-users/8449 (I will add this to the related issue too. Edit: Ah, I see it's already been mentioned over there!)

@choldgraf
Copy link
Member Author

choldgraf commented Mar 15, 2022

@sgibson91 good point! Indeed @consideRatio provided some helpful comments there as well. I've opened up a 2i2c issue to track this one, since it seems the change wouldn't be in KubeSpawner but instead would be in our config / deployment: #1120

I believe that we have all major parts of this hub worked out, so once #1074 is merged I think we can close this issue and spot-check more feature improvements or issues in support channels + dev issues. Anybody object to that?

@rabernat
Copy link
Contributor

It was great to read jupyterhub/kubespawner#589 (comment) and @consideRatio's suggestion of how to implement custom spawner logic. It sounds like this is technically feasible for 2i2c today. Based on this I would like to request that 2i2c implement this sort of customized spawner for the LEAP hub.

To begin, we would like two tiers:

tier privileges
Public tier leap-stc/leap-pangeo-users Access to "Small" and "Medium" machine types
Research tier leap-stc/leap-pangeo-research Access to all machine types plus GPU option

Having tiered access is very important to the LEAP executive committee. Delivering this feature quickly will be a win for 2i2c in terms of demonstrating ability to be responsive to feature requests, building trust from the LEAP PIs.

@consideRatio
Copy link
Member

consideRatio commented Mar 24, 2022

It makes me happy you thought what I've written it was helpful @rabernat!

@rabernat are leap-stc/leap-pangeo-users and leap-stc/leap-pangeo-research teams defined in the leap-stc GitHub organization, and based on being part of those teams - different permissions should be granted with what machines are made available?

If so I think the following issue is of very high relevance to address: jupyterhub/oauthenticator#492, it is about retaining the information captured during authentication about github org/team membership for later use. That could for example be when a user is about to be presented spawn options - which is at a separate time than during login even though it can be something happening in a quick succession.

@choldgraf
Copy link
Member Author

I've opened up an issue to track this action, since it is complex enough that I think it warrants its own description / implementation discussion, etc:

Also added it to our project backlog so that we can consider it in the context of the other development efforts we're undertaking. Agreed that having a nice story for this will be impactful for many, and it would be extra useful since LEAP could use this feature right now.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 24, 2022
@rabernat
Copy link
Contributor

are leap-stc/leap-pangeo-users and leap-stc/leap-pangeo-research teams defined in the leap-stc GitHub organization

yes, and both are public:

There is also

@consideRatio
Copy link
Member

consideRatio commented Mar 24, 2022

@rabernat note they don't look to be public to me, i get
image

image

@rabernat
Copy link
Contributor

Ok I think I used the wrong term. These groups are "visible"

image

I believe the 2i2c oauth app has the scope to view them and see the membership. But clearly you're correct: they are not "public".

@rabernat
Copy link
Contributor

rabernat commented Apr 7, 2022

I am checking in to see if there is any progress on the issue of the custom spawner for the LEAP hub? I would like to be able to share an update with the LEAP executive committee.

@yuvipanda
Copy link
Member

@rabernat I am going to start actively working on it this week, and should have an update on how long this might take soon.

@yuvipanda
Copy link
Member

@rabernat moving conversations about the github teams based profiles to #1146

@damianavila
Copy link
Contributor

We agreed at the planning meeting this is completed and any follow-ups already have dedicated issues.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 25, 2022
profile_list is now dynamically generated, based on the GH teams
user is a part of. This list of teams is refreshed only during login -
so user needs to log out and log back in to see new teams! This also
means that users removed from teams on GH will still have access to
the profiles until they are logged out from the admin panel too (to
be fixed)

This approach is taken over customizing options_form to protect
against users just bypassing the options form and using the API
directly to spawn servers.

Deployed to the leap hub, except 'large' & 'huge' is only available to
leap-stc:leap-pangeo-research members, not to leap-stc:leap-pangeo-users
members - based on 2i2c-org#1050 (comment)

Fixes 2i2c-org#1146
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 25, 2022
profile_list is now dynamically generated, based on the GH teams
user is a part of. This list of teams is refreshed only during login -
so user needs to log out and log back in to see new teams! This also
means that users removed from teams on GH will still have access to
the profiles until they are logged out from the admin panel too (to
be fixed)

This approach is taken over customizing options_form to protect
against users just bypassing the options form and using the API
directly to spawn servers.

Deployed to the leap hub, except 'large' & 'huge' is only available to
leap-stc:leap-pangeo-research members, not to leap-stc:leap-pangeo-users
members - based on 2i2c-org#1050 (comment)

Fixes 2i2c-org#1146
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

6 participants