Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placement error from "rand" placer #304

Open
m8pple opened this issue Jan 4, 2022 · 4 comments
Open

Placement error from "rand" placer #304

m8pple opened this issue Jan 4, 2022 · 4 comments

Comments

@m8pple
Copy link
Contributor

m8pple commented Jan 4, 2022

When trying to place a graph using the "rand" method, the following error shows up:

POETS> 14:20:53.71: 309(I) Attempting to place graph instance 'blurble' using the 'rand' method...
POETS> 14:20:53.71: 304(W) Unable to place graph instance 'blurble' - we tried, but an integrity check failed. You should shout at MLV (or whoever wrote the algorithm you're trying to use). In the short term, consider resetting the placer, and trying a different algorithm. Details: [ERROR] Use of algorithm 'rand' on application graph instance 'blurble' from file '/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_14x14x14_8192.xml' resulted in some normal devices not being placed correctly. These devices are:
 - rr

This appears to happen for graphs over a certain size. For smaller graphs it completed, for larger graphs the
same error. This was approximately the first graph size where it failed.

Context

  • Machine : jennings
  • Orchestrator version : e74e6ee
  • Input xml : stationary_water_12x12x12_8192.xml.gz
  • Commands:
    load /app = "/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_14x14x14_8192.xml"
    tlink /app = *
    place /random = *
    
  • Microlog:
========================================================================================================================
04/01/2022 14:20:53.70 file ../Output/Microlog/Microlog_2022_01_04T14_20_53p1.plog
command [place /random = *]
from console
========================================================================================================================
@mvousden
Copy link
Contributor

mvousden commented Jan 5, 2022

I can't reproduce this (hohoho). That said, I have a few thoughts.

The random algorithm works in the following way (with implementationy bits in parentheses):

  1. For each device type, create a set of cores that could be used to house a device of that type. Recall that each core pair cannot house devices of different types. (This map is defined by Placer::define_valid_cores_map)
  2. For each device in the device graph, iterated in the order they are declared in the graph instance:
    a. Ignore that device if it is not a normal device (i.e. continue).
    b. If the set for devices of this type is empty, leave and let the integrity checker clean up (literally return -1).
    c. Choose a core at random from the set.
    d. For each thread in the selected core, if the selected thread has space, or has no other constraint that forbids the selected device from being placed upon it, go to "f". If no threads are legal, go to "e".
    e. Remove the selected core from the set, and go to "b".
    f. Place the selected device on the selected thread.
    g. Remove the selected core from each other set in the device type map we defined in "1".
  3. Redistribute each device in each core such that each thread is evenly loaded (calling Placer::redistribute_devices_in_gi)

Based off this, I think the ff device is being left until the end, at which point there are no cores available for it, because it is of a different device type.

Solutions:

  • Allow the algorithm to "reserve" certain cores for certain device types (a bit like what the spread method does). This would make the algorithm "less random" however, so is undesirable.
  • I'm open to ideas.

Workarounds:

  • Put ff at the top of your DeviceInstances element in your application XML.
  • Refactor ff to use the same device type as your other devices.
  • Try a different placement algorithm. Spread filling will usually be better, I think.

@mvousden
Copy link
Contributor

mvousden commented Jan 5, 2022

Also, the error message in this case is pretty unhelpful - need to refactor that.

@heliosfa
Copy link
Contributor

heliosfa commented Jan 5, 2022

Allow the algorithm to "reserve" certain cores for certain device types (a bit like what the spread method does). This would make the algorithm "less random" however, so is undesirable.

One way I can think to keep the randomness is to introspect the number of device types that are actually used in the graph (not just declared) to work out how many sets of cores are needed and what the max/min number of cores should be, e.g. 1 set per type with minimum 1 and maximum something sensible. Allocate cores randomly to the sets. These sets then replace the set from step 1.

@m8pple
Copy link
Contributor Author

m8pple commented Jan 10, 2022

Put ff at the top of your DeviceInstances element in your application XML.

This application-specific workaround allows random to place correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants