Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mothership/Softswitch barrier logic breaks "simple" GALS applications #285

Open
heliosfa opened this issue Oct 21, 2021 · 2 comments · May be fixed by #297
Open

Mothership/Softswitch barrier logic breaks "simple" GALS applications #285

heliosfa opened this issue Oct 21, 2021 · 2 comments · May be fixed by #297
Assignees
Labels
bug Report of a bug high-priority High-priority, time-critical issues

Comments

@heliosfa
Copy link
Contributor

heliosfa commented Oct 21, 2021

This is somewhat related to #247.

The barrier release time for the current Mothership/Software barrier is too short. This means that some parts of the application can start running and sending packets before other parts are even started. If this occurs, this results in some packets being thrown away. For an asynchronous application, this is not necessarily a problem as more packets can be emitted. For GALS applications, the loss of a packet causes local synchronisation to be lost and the entire application to freeze.

An application that illustrates this problem: A 19x19 GALS arithmetic grid works on Ayres while a 20x20 does not. Modifying the Softswitch so that "AA" is printed whenever a packet is thrown away in the barrier shows that the latter throws away packets, causing the application to hang. Running in debug mode exacerbates the problem and makes smaller applications fail.

Possible solutions:

  • Increase softswitch_delay(). This is not sustainable in the long run.
  • Buffer packets received during the barrier (rather than discard) and then play them back. This is not sustainable and there are too many questions over implementation that works for every application.
  • Implement multicast for barrier release as discussed in Mothership freezes with some large problem sizes, deadlocking the system #247.
  • Make the softswitches sit at a tinselIdle() call once released so that they only progress as one. Requires Softswitch: Implement Hardware Idle #242 and makes our softswitch/barrier release logic inherently single application (or at least requires all applications to be launched at the same time).
  • Change the barrier release logic to use the debug UART network rather than the actual network. While it sounds bad, this removes instantiation from the normal network and means that we are not throwing away packets in the barrier. It has the added benefit of falling back to network pushback to stop started parts running ahead too far.
@heliosfa heliosfa added bug Report of a bug high-priority High-priority, time-critical issues labels Oct 21, 2021
@mvousden
Copy link
Contributor

So we had a natter about this on 2021-10-22. The way forward:

  • We're leaning towards the tinselIdle -based solution in the post above.
  • Run this by ADB in the next Orchestrator meeting, in brief. If we're using the tinselIdle solution, follow the stuff below.
  • Review Mothership-facing changes to support hardware idle. #282, modify as needed, and merge into FEATURE-0242-HardwareIdle.
  • Run GMB's test app (with his dummy binary) using the Orchestrator on the merged branch, making changes as needed to fix this issue.
  • Open a PR for Softswitch: Implement Hardware Idle #242, merging FEATURE-0242-HardwareIdle into development (eventually).
  • Modify softswitch logic to sit on the hardware idle barrier, after receiving a barrier-breaking packet from the Mothership.
  • Test, review, merge, profit.

@heliosfa
Copy link
Contributor Author

heliosfa commented Nov 1, 2021

OK, so there is an initial working version that appears to sort this and makes GALs applications work as expected.

It is on the BUGFIX-0285-HardwareIdleBarrier branch but needs FEATURE-0242-HardwareIdle-Mothership to be merged in locally to actually work. This will be consolidated in the next day or so.

The initial version makes the feature configurable by preprocessor macro. If the Softswitch is built with SOFTSWITCH_HWIDLE_BARRIER defined (which can be defined by calling make with SOFTSWITCH_HWIDLE_BARRIER=1 as an argument), the feature is enabled and a call to tinselIdle() is made in softswitch_delay() rather than using an unoptimised spinner loop.

Currently, Composer force enables this feature. I will add in Composer commands to configure it from the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report of a bug high-priority High-priority, time-critical issues
Projects
None yet
2 participants