Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for real-time events and custom signal handlers #1096

Open
gvoskuilen opened this issue Jun 26, 2024 · 0 comments
Open

Add support for real-time events and custom signal handlers #1096

gvoskuilen opened this issue Jun 26, 2024 · 0 comments
Assignees
Labels
Enhancement in progress Major Feature A new feature that has broad impact on codebase and requires a minimum two week discussion period

Comments

@gvoskuilen
Copy link
Contributor

Today, the only real (wall) time feature SST supports is the --exit-after command line option. We would like to add support for heartbeats and checkpoints on a real time interval rather than simulated time. In addition, we want to make it possible to register both user-supplied signal handlers and custom real-time events. We envision this being useful for debug, among other use cases.

Our solution is as follows:

  1. Add a real-time management class (RealTimeManager) to unify signal handling and real-time event handling. The class would manage setting SIGALRM for real-time events and would invoke handlers in response to signals (current list: SIGUSR1, SIGUSR2, SIGINT, SIGTERM, SIGALRM). The default handlers would be the same as today.
  2. Add a base class and default handlers for RealTimeActions to trigger on signals.
  • SIGUSR1, SIGUSR2 - default same as today (status output) but could be changed by user to a custom handler
  • SIGINT, SIGTERM - terminate after generating a checkpoint. Shutdown today is immediate and causes MPI error messages - would delay to sync point for graceful shutdown.
  • SIGALRM - would be passed to real-time actions which could be a mix of default (e.g., checkpoint, heartbeat, exit) and user-supplied
  1. Add ELI types for the new RealTimeAction to allow user libraries to register signal handlers and recurring real-time events
  2. Properly handling the signals & real-time events in parallel will require communication among ranks at sync points after a signal is received - instead of adding additional allreduces to the sync object, we will unify the new checks with the existing checks for end-of-simulation/skip intervals and do them in a single operation on a custom data structure

Test plan: Add a new test suite to sst-core to test signals and real-time events

@gvoskuilen gvoskuilen added Enhancement in progress Major Feature A new feature that has broad impact on codebase and requires a minimum two week discussion period labels Jun 26, 2024
@gvoskuilen gvoskuilen self-assigned this Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement in progress Major Feature A new feature that has broad impact on codebase and requires a minimum two week discussion period
Projects
None yet
Development

No branches or pull requests

1 participant