-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTS-1 program launch testing #15
Comments
Looks like current default mpi when I log in to opal still requires PMGR, I may open a TOSS-specific Jira issue on this unless this is a known issue already?
|
Part of the problem is default mpicc on opal seems to build with rpath :-(
|
Just FYI -- At least I don't plan to scale the overcommit factor that much as part of my testing. The max factor for #14 is 32. |
Running into some problems sanity testing
Note above we get notified of each state transition reserved->starting->running->complete.
in this run, the script appears to miss all the events up until This is further detailed in flux-framework/flux-core#772 |
Ok, thanks to @SteVwonder we found that missing state transitions are simply due to the fact that However, there appear to be other problems with However, jobs still run to completion, so data for this part of the milestone can still be gathered via use of |
I was able to grab 512 nodes on jade this morning, and get some preliminary results of
All of these were run with default stdio commit settings, except lwj 9, and 10 which used |
Got time with 2048 nodes, running at 64K
This happened with and without |
Btw, here I'm using a new (kludgy) |
Before my session exits, heres results up to 32K tasks on 2048 nodes:
Note that |
I did verify that a 16K MPI job works
|
I was able to get some more runs this morning, including a ~43K task mpi hello job:
This was launched across 2216 nodes. Other jobs run were test runs of
I got up to 28 tasks per node before hitting the issue in the comment above. |
_Goals_
Test scalability and usability of Flux program launch on full system. Determine any bugs or scaling and usability issues.
_Methodology_
Launch and collect timing data for a series of programs, both MPI and non-MPI, and compare with baseline SLURM launch data as collected in #13. Utilize and/or enhance instrumentation already extant in
flux-wreckrun
and record the timing of phases, includingas well as the entire time to fully execute each parallel program from a hot cache.
Run these tests through a similar scale as the baseline described in #13, with enough samples for statistical validity. Vary the number of tasks per broker rank as well as the number of total tasks for each program. Publish results in this issue.
Time permitting, include scale testing a program with increasing amounts of stdio and record impact to runtime (Tcompleted - Trunning).
_exit criteria_
flux wreckrun -n $((2304*cores_per_node)) mpi_hello
_Issues to watch out for_
persist-directory
to ensure content-store doesn't fill up local tmp or tmpfs during these runs.The text was updated successfully, but these errors were encountered: