Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEST encounters bad_alloc exception when changing tic length #644

Closed
heplesser opened this issue Jan 26, 2017 · 8 comments
Closed

NEST encounters bad_alloc exception when changing tic length #644

heplesser opened this issue Jan 26, 2017 · 8 comments
Assignees
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: High Should be handled next T: Bug Wrong statements in the code or documentation ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL
Milestone

Comments

@heplesser
Copy link
Contributor

The following causes a bad_malloc exception (per hash 1cb0529, OSX 10.11 with gcc 6.2):

SLI ] 0 << /tics_per_ms 1. /resolution 1. >> SetStatus
SLI ] 0 [[/tics_per_ms /resolution]] get ==
[1.000000e+00 1.000000e+00]
SLI ] 1 Simulate
nest(23226,0x7fff7cf0a000) malloc: *** mach_vm_map(size=9223372036857856) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

Jan 26 17:08:24 Simulate_d [Error]: C++Exception
    std::bad_alloc

Interestingly, a number of examples and tests, e.g. ArtificialSynchrony.sli and test_iaf_ps_dc_t_accuracy.sli set resolution and tics in this way and work. In those cases, a full network is built before Simulate is called. Creating just a single neuron does not get rid of the exception.

See also #643.

@heplesser heplesser added ZC: Kernel DO NOT USE THIS LABEL I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) ZP: Pending DO NOT USE THIS LABEL S: High Should be handled next T: Bug Wrong statements in the code or documentation labels Jan 26, 2017
@heplesser
Copy link
Contributor Author

@gtrensch This is a slightly mysterious bug that may provide an interesting opportunity to delve into the NEST kernel---interested?

@gtrensch
Copy link
Contributor

@heplesser yes, I like mysterious bugs.

@apeyser
Copy link
Contributor

apeyser commented Feb 20, 2017

So the question is which array that is binned by tic size isn't properly resized on resetting tics?

@gtrensch
Copy link
Contributor

@apeyser It's the spike buffer. Something goes wrong with min_delay calculation in case tics_per_ms is < 1000.0.

@gtrensch
Copy link
Contributor

@heplesser before I create a pull request, can you please have a look at the text below, in particular my remarks at the end, and let me know what you think. Perhaps we can also discuss this in Oslo during the hackathon.

@apeyser many thanks for your hints and the debug session !!!

The problem is the following:
Let's assume that tics_per_ms was set to 100. and it is:

TICS_PER_STEP = 100
MS_PER_STEP = 1.000000 

This will set following max limits (LIM_MAX):

tics (rounded to step boundary) = 1152921504606846900
steps = 11529215046068469 
ms = 11529215046068468.000000 

During the instantiation of the DelayCheckers, the value of tics of the min_delay_ object is preset with the positive infinity value (not intuitive) by calling Time::pos_inf(). Which means:
( ==> LONG_LONG_MAX / 8 + 1 ==> 1152921504606846976 ==> 0x1000000000000000 ==> the infinity value LIM_POS_INF.tics)

DelayChecker::Calibrate destroys this infinity value when TimeConverter is called. The problem is not visible for TICS_PER_STEP >= 1000. A boundary check hides the issue when a new Time-object is instantiated. See the nested macro below: The absolute infinity value contained in t.t won't be smaller than LIM_MAX.ms.

Time
TimeConverter::from_old_tics( tic_t t_old ) const
{
  double ms = t_old / OLD_TICS_PER_MS;
  return Time::ms( ms );     // <-- return Time object
}
Boudary check:

Time( ms t ) : tics( ( time_abs( t.t ) < LIM_MAX.ms ) ? 
    static_cast< tic_t >( t.t * Range::TICS_PER_MS + 0.5 ) : 
    ( t.t < 0 ) ? LIM_NEG_INF.tics : LIM_POS_INF.tics)

For TICS_PER_STEP < 1000 this condition is not given any longer. In the example above, it will be:

t.t = 1152921504606847 this is < ms (11529215046068468.000000)

... causing a new tics value of 115292150460684704 after calibration. This is not an infinity value anymore and the if-statment below in ConnectionManager::update_delay_extrema_ which checks against infinity will fail !

if ( min_delay_ == Time::pos_inf().get_steps() )
    min_delay_ = Time::get_resolution().get_steps();

Based on the invalid tics value an inappropriate steps value is calculated which in turn is used to resize the moduli_ array during simulation start causing the bad_alloc.

Solution:
I go along with @apeyser who came up with the idea to simply preserve the infinity values in the TimeConverter class. This would fix the issue.

Some general remarks:
I take a critical perspective on how infinity values are handeled in the code. The program flow relies on numbers in data fields. This is potentially error-prone. I understand the concept of doing calculations with infinity values, but wouldn't it be better to keep the state of a variable separated from the data?

In my opinion the TimerConverter belongs to the Time class, I would merge them.

The Time class itself is very hard to read and to debug. E.g.: Nested macros in initializers should be better functions. What looks like constants is sometimes overwritten. Some of the names are ambiguous and not descriptive (min_delay, for example, is used for different things).

Will say, I suggest a small conservative refactoring.

@apeyser
Copy link
Contributor

apeyser commented Feb 23, 2017

I think the issue is that their exist a large number of time units: steps (int), ms (float), tics (int), delay (naked int)... there should be a universal unit, and all inputs should immediately be converted into it. It should be a very simple class without all the object oriented complexity.

It needs to be transparent and fast -- the previous iteration was eating 50% of simulation time in constructed and handling time objects.

Unfortunately, this is a lot of work.

@heplesser
Copy link
Contributor Author

@gtrensch @apeyser Thank you very much for your detective work! I agree that the best fix for now is to modify the TimeConverter class so that it returns arguments equal to positive/negative infinity unchanged. It is unfortunate that integer types, in contrast to IEEE754 doubles, do not have properly defined infinity values.

The reason for initializing with positive/negative infinity is that it makes comparisons easy---otherwise, one would always have to check against invalid values.

The Time class is probably ready for an overhaul, but that is a different issue. But we need time in al three representations (ms, steps, tics) for consistency and accuracy.

@gtrensch Would you create a PR?

@heplesser heplesser added this to the NEST 2.12.1 milestone Jun 29, 2017
@gtrensch
Copy link
Contributor

PR has been created. The TimeConverter class has not been merged into the Time class. This would be better done in the course of a refactoring task.
This issue can be closed after merge.

@terhorstd terhorstd added ZP: PR Created DO NOT USE THIS LABEL and removed ZP: Pending DO NOT USE THIS LABEL labels Jul 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: High Should be handled next T: Bug Wrong statements in the code or documentation ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL
Projects
None yet
Development

No branches or pull requests

4 participants