Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EVENT] OpenScapes NASA Cloud Hackathon #810

Closed
7 of 10 tasks
choldgraf opened this issue Nov 5, 2021 · 50 comments
Closed
7 of 10 tasks

[EVENT] OpenScapes NASA Cloud Hackathon #810

choldgraf opened this issue Nov 5, 2021 · 50 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Nov 5, 2021

Summary

OpenScapes is running their first cloud hackathon for the NASA community from November 15th through 19th. During that time they will have ~40 people accessing the OpenScapes instance, and then they will have access for the next 3 months.

supercedes #767 (this is the same issue, but now with our event template so we can try it out)

Info

  • Event begin: November 15th, 2021
  • Event end: November 19th, 2021
  • Community Representative: @jules32
  • Hub URL: openscapes.2i2c.cloud
  • Hub decommisioned after event?: no

Task list

Before the event

  • Dates confirmed with the community representative
  • One week before event Hub is running.
  • Confirm with Community Representative that their workflows function as expected.
    • 👉Template message to send to community representative
      Hey {{ COMMUNITY REPRESENTATIVE }}, the date of your event is getting close!
      
      Could you please confirm that your hub environment is ready-to-go, and matches your hub's infrastructure setup, by ensuring the following things:
      - [ ] Log-in and authentication works as-expected
      - [ ] `nbgitpuller` links you intend to use resolve properly
      - [ ] Your notebooks run as-expected
      
      </details>
      
  • Ensure that everybody on the team can access this cluster and add documentation about it
  • Check and validate the quota for this cluster to make sure we have enough space
  • Check the size of the Kubernetes master nodes to make sure they're big enough

During and after event

  • Confirm event is finished.
  • Nodegroup created for the hub is decommissioned.
  • Hub decommissioned (if needed).
  • Debrief with community representative.
    • 👉Template debrief to send to community representative
      Hey {{ COMMUNITY REPRESENTATIVE }}, your event appears to be over 🎉
      
      We hope that your hub worked out well for you! We are trying to understand where we can improve our hub infrastructure and setup around events, and would love any feedback that you're willing to give. Would you mind answering the following questions? If not, just let us know and that is no problem!
      
      - Did the infrastructure behave as expected?
      - Anything that was confusing or could be improved?
      - Any extra functionality you wish you would have had?
      - Are you willing to share a story about how you used the hub?
      - Any other feedback that you'd like to share?
      
      
@choldgraf
Copy link
Member Author

Hey @jules32, the date of your event is getting close! Could you please confirm that your hub environment is ready-to-go, and matches your hub's infrastructure setup, by ensuring the following things:

  • Log-in and authentication works as-expected
  • nbgitpuller links you intend to use resolve properly
  • Your notebooks run as-expected

@jules32
Copy link
Contributor

jules32 commented Nov 5, 2021

Hi @choldgraf thanks for starting this issue!

At the moment I think we are a check for each box but looping in @betolink to confirm!

@betolink
Copy link
Contributor

betolink commented Nov 6, 2021

I think we are O.K.! maybe just worth mentioning that we'll have a 1 day clinic this Tuesday that may have ~30 participants.

@choldgraf
Copy link
Member Author

choldgraf commented Nov 9, 2021

@betolink - wait there will be 30 people logging on to the hub at once on Tuesday (tomorrow?) - I thought this event began on the 15th? FYI the hub might be slow to scale up as we have not pre-created nodes for the event (we were intending to do this before the 15th, not the 9th)

@betolink
Copy link
Contributor

betolink commented Nov 9, 2021

The event starts the 15th @choldgraf, tomorrow morning we are doing a preliminary 2 hour clinic to get the participants familiar with the hub. I don't think we'll have more than 25 people. It's ok if the spawning is not as fast.

@jules32
Copy link
Contributor

jules32 commented Nov 9, 2021

Hi @choldgraf, tomorrow we're holding a pre-hackathon clinic to help folks get familiar with Python, Bash, GitHub, and also log into 2i2c – they won't be running any code with cloud data there but it's an opportunity to test more logins at once and for them to get familiar ahead of time. Likely we'll have fewer than 30 people attend so expecting it will be ok with your timeline for the 15th, which is the real "go time".

Also, over the past weeks we've been having practice dry runs with our team so we've had 10+ people all log in and run code with cloud data and it's been going quite smoothly.

@choldgraf
Copy link
Member Author

Gotcha - thanks for the heads up. In general, it's helpful for us to know when there are likely going to be "spikes" in users, since if those spikes are associated with a specific hub we should be able to make some modifications to make things happen more quickly.

A quick explanation for context, and so that I can re-use this in our documentation :-)

Something we are trying to do for events is "pre-cooking" the nodes on which the user sessions will be run. Each node has a finite number of user sessions that can be active on it at once. When a node is ready and has space, then when a user starts their session it happens pretty quickly. However, if a node becomes full and a new user starts a session, then the hub has to "scale up" and request a new node first. This often takes a bit of time to occur.

Usually it's not a big deal if the hub has off-and-on usage over time - it just means that every now and then a user will have to wait a bit longer for their session to start. However, in the context of an event, this can be unnecessarily disruptive because you often have moments like "ok now everybody open up a session at once". This often creates the awkward moment of everybody starting up a session, it triggers a "node scale up" event on the hub, and then the session is delayed by a few minutes while the hub scales up.

So, if we know there's an event happening, we try to give that event a dedicated set of nodes, and "pre-cook" some nodes so that there are more slots available in general. That way user sessions tend to start more quickly. However, we don't want to do this too far in advance because it costs extra money to pay for those nodes, so we want to minimize that time. So that's why we try to figure out when an event will start and end, so that we can do this at the appropriate time.

If we can't pre-cook the nodes before the session on Tuesday, another thing you can try is get people to log on to the hub and start a user session before you expect them to actually use it. So for example, at the beginning of the session have them log on and open a notebook or something, and then do introductions and passive listening, then hopefully by the time you're ready for "hands on" stuff, their sessions will be ready.

I hope some of that is helpful!

@betolink
Copy link
Contributor

betolink commented Nov 9, 2021

Thanks for clarifying all this @choldgraf !! I haven't read the 2i2c documentation extensively and was wondering what kind of nodes you used, For a moment I wondered if you used spot instances. Now, one thing I'm not sure how is setup but it would be good to confirm, the instances are 1 per user right? once they select say m5.large there is a guarantee that they are not sharing those 8GB(RAM) with other users.

@erinmr
Copy link

erinmr commented Nov 9, 2021

@choldgraf - Right now we are doing a pre-hack clinic today and I'm trying to start 2i2c. It's taking much longer than usual ~15 mins so far. It looks like 31 people have servers running. Wondering if we have maxed out users?
Screen Shot 2021-11-09 at 9 59 56 AM
.

@choldgraf
Copy link
Member Author

@erinmr oops I just saw this one (FYI the preferred place to make support requests is [email protected]. I suspect that this is just slowness of the cluster in creating a new node. Sometimes this takes many minutes, other times it takes just a few (part of what I was referring to above). I tried creating a new session myself and it took a few minutes but was able to successfully start. Were you able to log-on in the end?

@erinmr
Copy link

erinmr commented Nov 10, 2021 via email

@damianavila
Copy link
Contributor

@erinmr, @betolink, a few more questions for you:

  1. Can you please confirm the number of attendees for the event (it seems it is close to 40 attendees, correct)?
  2. Can you please confirm the profile option you are going to use (small, medium, large, huge, a mix of those)?
  3. Do you have an idea of how "intensive" the event would be? Are you using dask-gateway at all? If that is the case, how many worker nodes do you think are you going to need per user?

Thanks!

@erinmr
Copy link

erinmr commented Nov 11, 2021 via email

@damianavila
Copy link
Contributor

@yuvipanda,

  1. I have checked the master node and it is an m5.large instance with AG min:1 and max:3. When I looked into the config file in the repo I see a t3.medium instance:
    machineType: "t3.medium",
    ). Could it be the case the master was at some time updated to m5.large and that was not persisted in the repo?
  2. I have checked the available quotas and it seems we have a total of 512 vCPU limit (which might be enough depending on the usage they are going to do during the event):
Running On-Demand All Standard (A, C, D, H, I, M, R, T, Z) instances | Running instances | 512 vCPUs | Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances
  1. I currently see 3 m5.large nodes and 1 m5.xlarge node... when I looked into the autoscaling groups, it seems somehow there is some "Desired capacity" already configured with those values but I do not see any "Desired capacity" on the config files for the cluster... I only see a maxSize: ). Could it be the case this was somehow manually configured at some time? If that is the case, I guess we could use the same approach to pre-warm the nodes a few hours before the start of the event, right?
  2. We should deploy the support chart and all the grafana stuff on this cluster in case we need to diagnose any problems.

cc @GeorgianaElena

@damianavila
Copy link
Contributor

damianavila commented Nov 11, 2021

About item 3, I just checked again this morning and now I see just 2 nodes and when I check the AG, the "Desired capacity" values are now "1" for m5.large and "1" for m5.xlarge as well... so it seems someone is already playing with this config 😉 .

Update: this is our autoscaler messing with us 😉

@damianavila
Copy link
Contributor

damianavila commented Nov 11, 2021

Additionally, I think pre-pulling the image would be useful here as well since it takes more than 1 min to fetch the configured image (configured with the configurator, so we should promote that one into the config file to pre-pull it).

@damianavila
Copy link
Contributor

damianavila commented Nov 11, 2021

I have successfully tested warming the cluster to ~80 small nodes using ASG, we should probably do the update using kops...

@damianavila
Copy link
Contributor

Since the event is starting at 9 am PST time we can warm the nodes on EST time on Monday morning so we are optimizing for cost as well.

@betolink
Copy link
Contributor

Hi @damianavila, questions for 2i2c.

  1. Since we are trying to be flexible with our base image, if we update it, will the pre-pulling in Jupyterhub catch up the change? (a different docker tag)
  2. Looks like I can't write to shared-readwrite and I'm an admin... which leads me to another question: Can we have another shared mount for all the participants(r+w)? maybe promoting this shared-readwrite/ mount to anybody in the hub? We are thinking 100GB should be enough for now.
  3. We're thinking that setting up an S3 bucket would be super handy to show some concepts on blob storage in the cloud and also to share some results from the hack week, is this possible within the infrastructure 2i2c can provide?
  4. If a team decides to spin up a Dask cluster, will that use the same cluster allocation we already have? or the Kubespawner will requests extra nodes? (actually, is this even possible with the current setup?) We are not focusing in Dask but just in case we need it.

Should I send these questions to [email protected] as well, or here is enough? Thanks

@betolink
Copy link
Contributor

I tested changing the image and looks like it's working as expected, the new tag is pulled.

@damianavila
Copy link
Contributor

damianavila commented Nov 12, 2021

Should I send these questions to [email protected] as well, or here is enough? Thanks

We would appreciate it if you can tunnel your question through support!!

But since I am already here, let me reply to some of your questions now...

  1. Pre-pulling will not catch up if you update the image through the configurator. What you have tested is if the new tag is pulled once you have configured it and that will happen regardless of the pre-pulling setting. Pre-pulling is taking care of pulling the image into the node before the pod arrives.
    Since I want to maintain the flexibility you mentioned above and because the current kops-based cluster supporting your hub was never tested for pre-pulling, I think it is probably a good idea to keep the status quo for now with pre-pulling deactivated.

  2. We need to check this one, I will investigate and report back why you can not write to the shared-readwrite directory.
    I also need to investigate how promoting that shared directory to w+r would look like since that is not standard practice on our deployments.

  3. You should be able to reach any s3 point from your pod/notebook server, AFAIK. Although we are not doing any automatic auth on kops-based clusters (we are indeed doing automatic auth on EKS-based clusters and we are going to migrate your cluster to EKS soon, but not before the event 😉 ) so you would need to handle the auth by yourself.

  4. Your current kops-based cluster is what we call a daskhub and it will spin up dask-specific nodes if you start a Dask cluster (modulo you have all the dask-specific pieces on your custom image, such as dask-gateway). We should be OK with a few users spinning up clusters but I would be worried if all your attendees want to start dask clusters because our above pressure-testing (quota testing) was starting small profiles without Dask cluster involved.

cc @GeorgianaElena who is playing our support steward role during your event.

@damianavila
Copy link
Contributor

Now, let's agree on some points for the event:

  1. Please, going forward tunnel any question/request to the support email as indicated on this page: https://docs.2i2c.org/en/latest/support.html
  2. We are going to warm up the cluster at 7:45 am PT so it is ready to receive your attendees and cold down the cluster at 1:00 pm PT. During the "hot" hours, we are going to warm up to 80 concurrent small nodes. If you ever need more nodes, let us know so we can raise that limit. During the "cold" hours the max amount of concurrent users would be 20 concurrent small nodes. We chose this pattern to optimize for cost and accordingly to your event schedule. If you believe the hours or numbers should be adjusted, let us know ASAP.

@betolink
Copy link
Contributor

Hi @damianavila, we think that 40 small instances would be the minimum during the week, even after 1PM PT. Is this a hard limit? or just passing 40 they'll take more time to spawn? we'll route future support questions to [email protected].

@damianavila
Copy link
Contributor

@betolink, do you think there will be up to 80 concurrent users during the "cold" hours at some point?
If your answer is yes, then I would recommend keeping the 80 nodes during the whole week.
If the answer is no, then what would be the max amount of users during the "cold" hours given your estimations?
Currently, the max amount of nodes is an "automated" hard limit but we can change it manually if we need to (and if you signal that to us).

Btw, the cluster is already pre-warmed with 80 nodes as of now...

Note: In the close future, when we migrate your cluster to EKS the whole experience would be better: we will just define a min and max amount of nodes and the autoscaler will do the magic for you accordingly to the user's load.

@betolink
Copy link
Contributor

It's hard to say if we'll have 80 concurrent in cold hours, my guess is no but what if... @erinmr are we OK with 80 during the whole week?

@jules32
Copy link
Contributor

jules32 commented Nov 15, 2021 via email

@betolink
Copy link
Contributor

It seems like 60 is the magic number. Not seeing more than 50 today and probably that would be the case for the rest of the week. @damianavila

@damianavila
Copy link
Contributor

damianavila commented Nov 15, 2021

Yep, I was monitoring the whole day and I can confirm the 50 pods/nodes.
Btw, I also detected some pods (3) on nodes greater than the smaller ones, not sure if that was attendees just choosing the wrong profile of something else (nothing to be worried about, IMHO, just letting you know about it).

I guess everything went smoothly today, am I right @betolink?

Btw, I will adjust the number of available small nodes to be 60 later tonight...

I will also try to monitor the nodes during the night to see if it really makes sense to keep 60 nodes live the whole time (my soul cries knowing that some nodes would be there without actually being used 😉 ).

@betolink
Copy link
Contributor

Everything went smooth @damianavila! I bet you're right, probably 30 during cold hours but maybe, just maybe at the end of the week everybody will be actually working on their projects and that might change. I'm looking forward for the EKS autoscaler for the next event!

@damianavila
Copy link
Contributor

Btw, I will adjust the number of available small nodes to be 60 later tonight...

@betolink, I have drained the 80 small nodes so I can set up the new "magic" value of 30 for the "cold" hours (sorry if your pod was restarted, that should not happen in the future even with new adjustments!).

Btw, right now, I can see 3 small nodes being used (one of them is a dask one 😉 ) and I can also see some bigger nodes (xlarge and 2xlarge) being used for more than 8 hours (be aware of those in case you want to terminate them).

Finally, I will raise the "magic" value to 60 tomorrow morning PT for the "hot" hours as I did early today (pre-warm process).

Have a great evening/night!

@damianavila
Copy link
Contributor

OK, the cluster is already pre-warmed to 60 small nodes.

Btw, I have been following the node's occupancy for the last 12-14 hours and during night time we had 2-3 concurrent users at maximum. I think we would be totally OK cooling down the cluster to the standard configuration from 6 pm to 8 am, which will result in a fully functional autoscaling process able to support up to 20 concurrent users and, at the same time, saving resources if we only have a few working during the night.

@betolink, I would recommend the above process going forward and re-adjust if necessary accordingly to the real demand, let me know WDYT.

@betolink
Copy link
Contributor

I think you're right again @damianavila let's use the defaults for the cold hours. Quick question, what's the idle time before an instance gets shut down?

@damianavila
Copy link
Contributor

Quick question, what's the idle time before an instance gets shut down?

For the default experience, IIRC, the autoscaler terminates the node about 10 minutes after it is vacated.

@betolink
Copy link
Contributor

Hi @damianavila, do you know if you can add the --collaborative flag in the hub configuration or I need to do it on the Docker image? to enable this https://jupyterlab.readthedocs.io/en/stable/user/rtc.html

@damianavila
Copy link
Contributor

@betolink, we actually have an open issue to add this feature here: #441
But, regrettably, it seems there is currently a blocker for making it happen: jupyterlab-contrib/jupyterlab-link-share#10.

@consideRatio may have more details... but I see it unlikely to have this feature enabled for the already running event.

@consideRatio
Copy link
Member

The link share button is a user interface feature that doesn't work in a JupyterHub setup with typical authentication, but the collaborative mode can be enabled in a way that allows jupyterhub admins to collaborate with other admins or non-admin users on the non-admin users servers. This is because admins can be granted permissions to access other users servers.

@damianavila you can enable this by https://github.com/2i2c-org/infrastructure/pull/436/files#diff-ca80c8d18c23e271ff0620419ee36c9229c33141accc37f88b6345ecd34bef42R68-R73, the referenced issue is resolved with the latest version of JupyterLab so it may work properly I think. You would also need to make sure the jupyterhub admins have the ability to access others servers if that isn't already enabled.

@damianavila
Copy link
Contributor

Thanks for your feedback, @consideRatio.

@betolink, given that we are approaching the end of the event, I would suggest we try to set up this configuration next week, maybe? We need to test some things on our side first, as @consideRatio suggested.

@betolink
Copy link
Contributor

Just to thank you all, we had a great hack week and your support was key! @damianavila @choldgraf @consideRatio

Our teams will continue to work on their projects but I guess we won't need 60 concurrent instances. I look forward to the migration to EKS and enabling the real time collaboration feature!

@jules32
Copy link
Contributor

jules32 commented Nov 19, 2021

Echoing Luis, thank you all so much! Excited with how this worked this week and that it's just the beginning!

@choldgraf
Copy link
Member Author

ahhh these posts warm my heart :-)

THANK YOU to the 2i2c team for being so great

@erinmr
Copy link

erinmr commented Nov 19, 2021 via email

@damianavila
Copy link
Contributor

It is wonderful to hear you have a nice experience, folks!
And, as you said, this is just the beginning and there are a lot of interesting things for all of us ahead.
If we can do it together, I have no doubts will be a path full of success and fun!

Btw, I have already "post-cold" the cluster and it is now working in the "standard/default" behaviour (with the autoscaler "serving" nodes as the demand grows up to max 20 nodes per profile option). Currently, there are only 10 concurrent users (7 using the small instances, the rest using bigger ones) and I presume the number will go down as we enter the weekend.

Finally, we will be working soon on the transition to EKS and adding RTC capabilities, so stay tuned! 😉
And have a great weekend and some rest after a long week!!

@damianavila
Copy link
Contributor

@betolink @erinmr and @jules32, we are trying to understand where we can improve our hub infrastructure and setup around events, and would love any feedback that you're willing to give. Would you mind answering the following questions? If not, just let us know and that is no problem!

  • Did the infrastructure behave as expected?
  • Anything that was confusing or could be improved?
  • Any extra functionality you wish you would have had?
  • Are you willing to share a story about how you used the hub?
  • Any other feedback that you'd like to share?

I know you already replied to some of these questions, but if you want to provide more context and/or answer the ones you did not reply to yet, we would greatly appreciate it!! Thanks!

@betolink
Copy link
Contributor

Did the infrastructure behave as expected?

  • I think it did, we only had a minor hiccup at the clinic(before the hack week) because we were not expecting a lot of participants.

Anything that was confusing or could be improved?

  • I'm probably merging this and the next question... 2i2c hubs seem to follow the "right to reproduce" principles (which is great!) but some uses cases like ours require more flexibility for the user. i.e. we cannot dictate what libraries they want to use and having a restrictive base environment can be a blocker for some. We noticed that some of the participants wanted to have some control on what they installed and such. Our solution was to use a Pangeo-based image that could serve multiple environments. The idea was that if they needed to work with their environments they only had to open a PR to that project and the base image will update accordingly. However, our users are not software engineers, they are more comfortable just doing pip install or conda install and expect that what they do will be there the next time they use their instances. Following on the discussion at the beginning of this thread, maybe there is a middle ground for having a base image with a base environment and the ability to let the users persist their changes (with the warnings that they are altering the base environment etc etc).
  • Another thing we didn't use and I think it's available via Jupyterhub is letting the user pick a base image. This could be also useful for hack weeks.
  • RTC! this would be a super feature to have for next events, it seems like this is totally doable so I'm really excited about it.

Are you willing to share a story about how you used the hub?

I think so, one of the cool things about the hack week in 2i2c is that we started to get into Pangeo territory and just the fact that we don't have to worry about configuring Dask is worth telling. I bet @erinmr @jules32 will have more stories from the participants!

Any other feedback that you'd like to share?

I like having all the questions in GitHub for visibility and readability rather than emailing to 2i2c but I guess you have your logistics as well. I thank you all for your support

@damianavila
Copy link
Contributor

Thanks for the additional feedback, @betolink! It is very useful for us!
Btw, I think the next steps are actually collected or referenced in other issues already linked here, so closing this issue now.

Thanks all for your contributions here!

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants