-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EVENT] OpenScapes NASA Cloud Hackathon #810
Comments
Hey @jules32, the date of your event is getting close! Could you please confirm that your hub environment is ready-to-go, and matches your hub's infrastructure setup, by ensuring the following things:
|
Hi @choldgraf thanks for starting this issue! At the moment I think we are a check for each box but looping in @betolink to confirm! |
I think we are O.K.! maybe just worth mentioning that we'll have a 1 day clinic this Tuesday that may have ~30 participants. |
@betolink - wait there will be 30 people logging on to the hub at once on Tuesday (tomorrow?) - I thought this event began on the 15th? FYI the hub might be slow to scale up as we have not pre-created nodes for the event (we were intending to do this before the 15th, not the 9th) |
The event starts the 15th @choldgraf, tomorrow morning we are doing a preliminary 2 hour clinic to get the participants familiar with the hub. I don't think we'll have more than 25 people. It's ok if the spawning is not as fast. |
Hi @choldgraf, tomorrow we're holding a pre-hackathon clinic to help folks get familiar with Python, Bash, GitHub, and also log into 2i2c – they won't be running any code with cloud data there but it's an opportunity to test more logins at once and for them to get familiar ahead of time. Likely we'll have fewer than 30 people attend so expecting it will be ok with your timeline for the 15th, which is the real "go time". Also, over the past weeks we've been having practice dry runs with our team so we've had 10+ people all log in and run code with cloud data and it's been going quite smoothly. |
Gotcha - thanks for the heads up. In general, it's helpful for us to know when there are likely going to be "spikes" in users, since if those spikes are associated with a specific hub we should be able to make some modifications to make things happen more quickly. A quick explanation for context, and so that I can re-use this in our documentation :-) Something we are trying to do for events is "pre-cooking" the nodes on which the user sessions will be run. Each node has a finite number of user sessions that can be active on it at once. When a node is ready and has space, then when a user starts their session it happens pretty quickly. However, if a node becomes full and a new user starts a session, then the hub has to "scale up" and request a new node first. This often takes a bit of time to occur. Usually it's not a big deal if the hub has off-and-on usage over time - it just means that every now and then a user will have to wait a bit longer for their session to start. However, in the context of an event, this can be unnecessarily disruptive because you often have moments like "ok now everybody open up a session at once". This often creates the awkward moment of everybody starting up a session, it triggers a "node scale up" event on the hub, and then the session is delayed by a few minutes while the hub scales up. So, if we know there's an event happening, we try to give that event a dedicated set of nodes, and "pre-cook" some nodes so that there are more slots available in general. That way user sessions tend to start more quickly. However, we don't want to do this too far in advance because it costs extra money to pay for those nodes, so we want to minimize that time. So that's why we try to figure out when an event will start and end, so that we can do this at the appropriate time. If we can't pre-cook the nodes before the session on Tuesday, another thing you can try is get people to log on to the hub and start a user session before you expect them to actually use it. So for example, at the beginning of the session have them log on and open a notebook or something, and then do introductions and passive listening, then hopefully by the time you're ready for "hands on" stuff, their sessions will be ready. I hope some of that is helpful! |
Thanks for clarifying all this @choldgraf !! I haven't read the 2i2c documentation extensively and was wondering what kind of nodes you used, For a moment I wondered if you used spot instances. Now, one thing I'm not sure how is setup but it would be good to confirm, the instances are 1 per user right? once they select say |
@choldgraf - Right now we are doing a pre-hack clinic today and I'm trying to start 2i2c. It's taking much longer than usual ~15 mins so far. It looks like 31 people have servers running. Wondering if we have maxed out users? |
@erinmr oops I just saw this one (FYI the preferred place to make support requests is |
Thanks so much, Chris. It did eventually successfully start. We will use
the support@ for urgent support needs in the future. Thanks!
…On Tue, Nov 9, 2021 at 1:24 PM Chris Holdgraf ***@***.***> wrote:
@erinmr <https://github.com/erinmr> oops I just saw this one (FYI the
preferred place to make support requests is ***@***.*** I suspect
that this is just slowness of the cluster in creating a new node. Sometimes
this takes many minutes, other times it takes just a few (part of what I
was referring to above). I tried creating a new session myself and it took
a few minutes but was able to successfully start. Were you able to log-on
in the end?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWHZYYXKSLEGGBW5HWSS5DULF7QRANCNFSM5HOSF7RA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@erinmr, @betolink, a few more questions for you:
Thanks! |
Hi Damian -
Let's plan for 80 attendees - 40 hackers + 20 staff and 20 observers that
may start an instance.
We plan to have them just using small profiles. There may be a few that
tend toward larger instances later in the week, but that should be less
than 10 instances total.
No dask gateway for this round is planned.
Here is the material we are planning to teach to get a sense of scope
https://nasa-openscapes.github.io/2021-Cloud-Hackathon/
Thanks -
E
…On Wed, Nov 10, 2021 at 8:38 PM Damian Avila ***@***.***> wrote:
@erinmr <https://github.com/erinmr>, @betolink
<https://github.com/betolink>, a few more questions for you:
1. Can you please confirm the number of attendees for the event (it
seems it is close to 40 attendees, correct)?
2. Can you please confirm the profile option you are going to use
(small, medium, large, huge, a mix of those)?
3. Do you have an idea of how "intensive" the event would be? Are you
using dask-gateway at all? If that is the case, how many worker nodes do
you think are you going to need per user?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWHZYZPTRAS4FHTOGQNBI3ULM3CDANCNFSM5HOSF7RA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
|
About item 3, I just checked again this morning and now I see just 2 nodes and when I check the AG, the "Desired capacity" values are now "1" for m5.large and "1" for m5.xlarge as well... so it seems someone is already playing with this config 😉 . Update: this is our autoscaler messing with us 😉 |
Additionally, I think pre-pulling the image would be useful here as well since it takes more than 1 min to fetch the configured image (configured with the configurator, so we should promote that one into the config file to pre-pull it). |
I have successfully tested warming the cluster to ~80 small nodes using ASG, we should probably do the update using kops... |
Since the event is starting at 9 am PST time we can warm the nodes on EST time on Monday morning so we are optimizing for cost as well. |
Hi @damianavila, questions for 2i2c.
Should I send these questions to [email protected] as well, or here is enough? Thanks |
I tested changing the image and looks like it's working as expected, the new tag is pulled. |
We would appreciate it if you can tunnel your question through support!! But since I am already here, let me reply to some of your questions now...
cc @GeorgianaElena who is playing our support steward role during your event. |
Now, let's agree on some points for the event:
|
Hi @damianavila, we think that 40 small instances would be the minimum during the week, even after 1PM PT. Is this a hard limit? or just passing 40 they'll take more time to spawn? we'll route future support questions to [email protected]. |
@betolink, do you think there will be up to 80 concurrent users during the "cold" hours at some point? Btw, the cluster is already pre-warmed with 80 nodes as of now... Note: In the close future, when we migrate your cluster to EKS the whole experience would be better: we will just define a min and max amount of nodes and the autoscaler will do the magic for you accordingly to the user's load. |
It's hard to say if we'll have 80 concurrent in cold hours, my guess is no but what if... @erinmr are we OK with 80 during the whole week? |
Yes, 80 concurrent sounds good for the whole week.
…On Mon, Nov 15, 2021 at 8:21 AM Luis Lopez ***@***.***> wrote:
It's hard to say if we'll have 80 concurrent in cold hours, my guess is no
but what if... @erinmr <https://github.com/erinmr> are we OK with 80
during the whole week?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABM6ORIYSUODBJN23VKD2DDUMEXRNANCNFSM5HOSF7RA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julia Stewart Lowndes, PhD
Openscapes Co-Director
National Center for Ecological Analysis and Synthesis (NCEAS
<https://www.nceas.ucsb.edu/>)
University of California, Santa Barbara (UCSB)
Openscapes <https://openscapes.org> • Ocean Health Index
<http://ohi-science.org/> • Mozilla Fellow
website <http://jules32.github.io/> • github <https://github.com/jules32> •
twitter <https://twitter.com/juliesquid>
|
It seems like 60 is the magic number. Not seeing more than 50 today and probably that would be the case for the rest of the week. @damianavila |
Yep, I was monitoring the whole day and I can confirm the 50 pods/nodes. I guess everything went smoothly today, am I right @betolink? Btw, I will adjust the number of available small nodes to be 60 later tonight... I will also try to monitor the nodes during the night to see if it really makes sense to keep 60 nodes live the whole time (my soul cries knowing that some nodes would be there without actually being used 😉 ). |
Everything went smooth @damianavila! I bet you're right, probably 30 during cold hours but maybe, just maybe at the end of the week everybody will be actually working on their projects and that might change. I'm looking forward for the EKS autoscaler for the next event! |
@betolink, I have drained the 80 small nodes so I can set up the new "magic" value of 30 for the "cold" hours (sorry if your pod was restarted, that should not happen in the future even with new adjustments!). Btw, right now, I can see 3 small nodes being used (one of them is a dask one 😉 ) and I can also see some bigger nodes (xlarge and 2xlarge) being used for more than 8 hours (be aware of those in case you want to terminate them). Finally, I will raise the "magic" value to 60 tomorrow morning PT for the "hot" hours as I did early today (pre-warm process). Have a great evening/night! |
OK, the cluster is already pre-warmed to 60 small nodes. Btw, I have been following the node's occupancy for the last 12-14 hours and during night time we had 2-3 concurrent users at maximum. I think we would be totally OK cooling down the cluster to the standard configuration from 6 pm to 8 am, which will result in a fully functional autoscaling process able to support up to 20 concurrent users and, at the same time, saving resources if we only have a few working during the night. @betolink, I would recommend the above process going forward and re-adjust if necessary accordingly to the real demand, let me know WDYT. |
I think you're right again @damianavila let's use the defaults for the cold hours. Quick question, what's the idle time before an instance gets shut down? |
For the default experience, IIRC, the autoscaler terminates the node about 10 minutes after it is vacated. |
Hi @damianavila, do you know if you can add the |
@betolink, we actually have an open issue to add this feature here: #441 @consideRatio may have more details... but I see it unlikely to have this feature enabled for the already running event. |
The link share button is a user interface feature that doesn't work in a JupyterHub setup with typical authentication, but the collaborative mode can be enabled in a way that allows jupyterhub admins to collaborate with other admins or non-admin users on the non-admin users servers. This is because admins can be granted permissions to access other users servers. @damianavila you can enable this by https://github.com/2i2c-org/infrastructure/pull/436/files#diff-ca80c8d18c23e271ff0620419ee36c9229c33141accc37f88b6345ecd34bef42R68-R73, the referenced issue is resolved with the latest version of JupyterLab so it may work properly I think. You would also need to make sure the jupyterhub admins have the ability to access others servers if that isn't already enabled. |
Thanks for your feedback, @consideRatio. @betolink, given that we are approaching the end of the event, I would suggest we try to set up this configuration next week, maybe? We need to test some things on our side first, as @consideRatio suggested. |
Just to thank you all, we had a great hack week and your support was key! @damianavila @choldgraf @consideRatio Our teams will continue to work on their projects but I guess we won't need 60 concurrent instances. I look forward to the migration to EKS and enabling the real time collaboration feature! |
Echoing Luis, thank you all so much! Excited with how this worked this week and that it's just the beginning! |
ahhh these posts warm my heart :-) THANK YOU to the 2i2c team for being so great |
This is an exciting chart that shows our AWS/2i2c usage. We loved it! Thank
you especially to Georgiana for super support.
T
…On Fri, Nov 19, 2021 at 3:04 PM Chris Holdgraf ***@***.***> wrote:
ahhh these posts warm my heart :-)
THANK YOU to the 2i2c team for being so great
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWHZY5OXJDL2ZASAR7JZVLUM3CX3ANCNFSM5HOSF7RA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
It is wonderful to hear you have a nice experience, folks! Btw, I have already "post-cold" the cluster and it is now working in the "standard/default" behaviour (with the autoscaler "serving" nodes as the demand grows up to max 20 nodes per profile option). Currently, there are only 10 concurrent users (7 using the small instances, the rest using bigger ones) and I presume the number will go down as we enter the weekend. Finally, we will be working soon on the transition to EKS and adding RTC capabilities, so stay tuned! 😉 |
@betolink @erinmr and @jules32, we are trying to understand where we can improve our hub infrastructure and setup around events, and would love any feedback that you're willing to give. Would you mind answering the following questions? If not, just let us know and that is no problem!
I know you already replied to some of these questions, but if you want to provide more context and/or answer the ones you did not reply to yet, we would greatly appreciate it!! Thanks! |
Did the infrastructure behave as expected?
Anything that was confusing or could be improved?
Are you willing to share a story about how you used the hub? I think so, one of the cool things about the hack week in 2i2c is that we started to get into Pangeo territory and just the fact that we don't have to worry about configuring Dask is worth telling. I bet @erinmr @jules32 will have more stories from the participants! Any other feedback that you'd like to share? I like having all the questions in GitHub for visibility and readability rather than emailing to 2i2c but I guess you have your logistics as well. I thank you all for your support |
Thanks for the additional feedback, @betolink! It is very useful for us! Thanks all for your contributions here! |
Summary
OpenScapes is running their first cloud hackathon for the NASA community from November 15th through 19th. During that time they will have ~40 people accessing the OpenScapes instance, and then they will have access for the next 3 months.
supercedes #767 (this is the same issue, but now with our event template so we can try it out)
Info
Task list
Before the event
👉Template message to send to community representative
During and after event
👉Template debrief to send to community representative
The text was updated successfully, but these errors were encountered: