NEST encounters bad_alloc exception when changing tic length #644

heplesser · 2017-01-26T16:12:54Z

The following causes a bad_malloc exception (per hash 1cb0529, OSX 10.11 with gcc 6.2):

SLI ] 0 << /tics_per_ms 1. /resolution 1. >> SetStatus
SLI ] 0 [[/tics_per_ms /resolution]] get ==
[1.000000e+00 1.000000e+00]
SLI ] 1 Simulate
nest(23226,0x7fff7cf0a000) malloc: *** mach_vm_map(size=9223372036857856) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

Jan 26 17:08:24 Simulate_d [Error]: C++Exception
    std::bad_alloc

Interestingly, a number of examples and tests, e.g. ArtificialSynchrony.sli and test_iaf_ps_dc_t_accuracy.sli set resolution and tics in this way and work. In those cases, a full network is built before Simulate is called. Creating just a single neuron does not get rid of the exception.

See also #643.

The text was updated successfully, but these errors were encountered:

heplesser · 2017-01-26T16:14:23Z

@gtrensch This is a slightly mysterious bug that may provide an interesting opportunity to delve into the NEST kernel---interested?

gtrensch · 2017-01-30T10:56:17Z

@heplesser yes, I like mysterious bugs.

apeyser · 2017-02-20T08:48:36Z

So the question is which array that is binned by tic size isn't properly resized on resetting tics?

gtrensch · 2017-02-20T09:17:42Z

@apeyser It's the spike buffer. Something goes wrong with min_delay calculation in case tics_per_ms is < 1000.0.

gtrensch · 2017-02-22T18:07:51Z

@heplesser before I create a pull request, can you please have a look at the text below, in particular my remarks at the end, and let me know what you think. Perhaps we can also discuss this in Oslo during the hackathon.

@apeyser many thanks for your hints and the debug session !!!

The problem is the following:
Let's assume that tics_per_ms was set to 100. and it is:

TICS_PER_STEP = 100
MS_PER_STEP = 1.000000

This will set following max limits (LIM_MAX):

tics (rounded to step boundary) = 1152921504606846900
steps = 11529215046068469 
ms = 11529215046068468.000000

During the instantiation of the DelayCheckers, the value of tics of the min_delay_ object is preset with the positive infinity value (not intuitive) by calling Time::pos_inf(). Which means:
( ==> LONG_LONG_MAX / 8 + 1 ==> 1152921504606846976 ==> 0x1000000000000000 ==> the infinity value LIM_POS_INF.tics)

DelayChecker::Calibrate destroys this infinity value when TimeConverter is called. The problem is not visible for TICS_PER_STEP >= 1000. A boundary check hides the issue when a new Time-object is instantiated. See the nested macro below: The absolute infinity value contained in t.t won't be smaller than LIM_MAX.ms.

Time
TimeConverter::from_old_tics( tic_t t_old ) const
{
  double ms = t_old / OLD_TICS_PER_MS;
  return Time::ms( ms );     // <-- return Time object
}

Boudary check:

Time( ms t ) : tics( ( time_abs( t.t ) < LIM_MAX.ms ) ? 
    static_cast< tic_t >( t.t * Range::TICS_PER_MS + 0.5 ) : 
    ( t.t < 0 ) ? LIM_NEG_INF.tics : LIM_POS_INF.tics)

For TICS_PER_STEP < 1000 this condition is not given any longer. In the example above, it will be:

t.t = 1152921504606847 this is < ms (11529215046068468.000000)

... causing a new tics value of 115292150460684704 after calibration. This is not an infinity value anymore and the if-statment below in ConnectionManager::update_delay_extrema_ which checks against infinity will fail !

if ( min_delay_ == Time::pos_inf().get_steps() )
    min_delay_ = Time::get_resolution().get_steps();

Based on the invalid tics value an inappropriate steps value is calculated which in turn is used to resize the moduli_ array during simulation start causing the bad_alloc.

Solution:
I go along with @apeyser who came up with the idea to simply preserve the infinity values in the TimeConverter class. This would fix the issue.

Some general remarks:
I take a critical perspective on how infinity values are handeled in the code. The program flow relies on numbers in data fields. This is potentially error-prone. I understand the concept of doing calculations with infinity values, but wouldn't it be better to keep the state of a variable separated from the data?

In my opinion the TimerConverter belongs to the Time class, I would merge them.

The Time class itself is very hard to read and to debug. E.g.: Nested macros in initializers should be better functions. What looks like constants is sometimes overwritten. Some of the names are ambiguous and not descriptive (min_delay, for example, is used for different things).

Will say, I suggest a small conservative refactoring.

apeyser · 2017-02-23T08:23:40Z

I think the issue is that their exist a large number of time units: steps (int), ms (float), tics (int), delay (naked int)... there should be a universal unit, and all inputs should immediately be converted into it. It should be a very simple class without all the object oriented complexity.

It needs to be transparent and fast -- the previous iteration was eating 50% of simulation time in constructed and handling time objects.

Unfortunately, this is a lot of work.

heplesser · 2017-03-02T10:01:47Z

@gtrensch @apeyser Thank you very much for your detective work! I agree that the best fix for now is to modify the TimeConverter class so that it returns arguments equal to positive/negative infinity unchanged. It is unfortunate that integer types, in contrast to IEEE754 doubles, do not have properly defined infinity values.

The reason for initializing with positive/negative infinity is that it makes comparisons easy---otherwise, one would always have to check against invalid values.

The Time class is probably ready for an overhaul, but that is a different issue. But we need time in al three representations (ms, steps, tics) for consistency and accuracy.

@gtrensch Would you create a PR?

gtrensch · 2017-07-18T11:01:12Z

PR has been created. The TimeConverter class has not been merged into the Time class. This would be better done in the course of a refactoring task.
This issue can be closed after merge.

heplesser assigned gtrensch Jan 26, 2017

heplesser added this to the NEST 2.12.1 milestone Jun 29, 2017

gtrensch mentioned this issue Jul 18, 2017

Fix #644: bad_alloc exception #787

Merged

terhorstd added ZP: PR Created DO NOT USE THIS LABEL and removed ZP: Pending DO NOT USE THIS LABEL labels Jul 31, 2017

terhorstd closed this as completed in 7e25df7 Aug 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEST encounters bad_alloc exception when changing tic length #644

NEST encounters bad_alloc exception when changing tic length #644

heplesser commented Jan 26, 2017

heplesser commented Jan 26, 2017

gtrensch commented Jan 30, 2017

apeyser commented Feb 20, 2017

gtrensch commented Feb 20, 2017

gtrensch commented Feb 22, 2017

apeyser commented Feb 23, 2017

heplesser commented Mar 2, 2017

gtrensch commented Jul 18, 2017

NEST encounters bad_alloc exception when changing tic length #644

NEST encounters bad_alloc exception when changing tic length #644

Comments

heplesser commented Jan 26, 2017

heplesser commented Jan 26, 2017

gtrensch commented Jan 30, 2017

apeyser commented Feb 20, 2017

gtrensch commented Feb 20, 2017

gtrensch commented Feb 22, 2017

apeyser commented Feb 23, 2017

heplesser commented Mar 2, 2017

gtrensch commented Jul 18, 2017