Random(...,1,2^80) test fails on SPARC Solaris #2292

dimpase · 2018-03-24T09:38:08Z

Observed behaviour

testing: /datapool/dima/Sage/gap/tst/testinstall/intarith.tst
########> Diff in /datapool/dima/Sage/gap/tst/testinstall/intarith.tst:948
# Input is:
Random(mysource, 1, 2^80);
# Expected output:
1071135285340982180054653
# But found:
2461224813218428180

(the other tests there pass)

Expected behaviour

No diff, obviously

Copy and paste GAP banner (to tell us about your setup)

 *********   GAP 4.8.8-5202-gb88ad0f of today
 *  GAP  *   https://www.gap-system.org
 *********   Architecture: sparc-sun-solaris2.11-default64
 Configuration:  gmp 6.1.2
 Loading the library and packages ...
 Packages:   AClib 1.2, Alnuth 3.1.0, AtlasRep 1.5.1, AutPGrp 1.8, CRISP 1.4.4, Cryst 4.1.13, CrystCat 1.1.6, CTblLib 1.2.2, 
             FactInt 1.6.1, FGA 1.3.1, GAPDoc 1.6.1, IRREDSOL 1.4, LAGUNA 3.8.0, Polenta 1.3.8, Polycyclic 2.11, PrimGrp 3.3.0, 
             RadiRoot 2.7, ResClasses 4.7.1, smallgrp 1.2, Sophus 1.23, SpinSym 1.5, TomLib 1.2.6, TransGrp 2.0.2, utils 0.53

Built with vanilla gcc 7.2.

Note that as #2274 is merged, this and #2194 are the only things to be fixed for testinstall pass on this platform.

#116 makes it hard to understand what's going on, precisely...

The text was updated successfully, but these errors were encountered:

ChrisJefferson · 2018-03-24T10:22:50Z

Would it be possible to bisect this issue if it only appeared in the last couple of weeks, or has it been around for a long time (I made some recentish random changes).

dimpase · 2018-03-24T10:59:12Z

I don't recall seeing it on 4.9. I will bisect this.

dimpase · 2018-03-26T14:26:46Z

It's still there on 4.9.0, but it does work on 4.8.6. I will try finding the exact commit.

It's also not in 4.8.10, so one has to chase the 4.9 branch.

dimpase · 2018-03-29T13:31:31Z

Could you give me a reasonable starting point on the 4.9 branch to start bisecting from?

fingolfin · 2018-04-09T12:18:15Z

@dimpase One thing you can try is to start bisecting after the big build system work was settled, that makes it much less painful.
From a quick look, perhaps something like 3e23772 would be a good starting point -- assuming it builds for you, and assuming the error does not occur there.

fingolfin · 2018-04-09T12:20:13Z

BTW, is that machine a big endian machine by chance? That would give an important hint: it would suggest that some of the kernel random code is not sufficiently endian safe (e.g. the mersenne twister). It then probably wasn't in GAP 4.8 either, but we also changed which random sources we use to be more consistent, which could have exposed the problem.

This is just a wild guess, though.

dimpase · 2018-04-09T12:40:51Z

Yes, the endianness is big, as is the case with SPARCs.

fingolfin · 2018-04-09T12:44:24Z

OK. (Note that SPARC v9 and later is bi-endian, so given the information in this issue, I was not able to able to tell the endianess).

fingolfin · 2018-04-09T12:48:25Z

A quick look at the kernel code, specifically FuncRandomIntegerMT, suggests that my guess may be correct: that function does not seem to take endianess into account; even if the underlying mersenne twister code does, still FuncRandomIntegerMT writes data 4 bytes at a time, which could lead to differing results on a 64bit big endian machine. But this is still guessing, I did not try to understand this code properly.

dimpase · 2018-04-09T22:40:24Z

Just to be sure, here is the test:

$ python
Python 2.7.9 (default, Apr 27 2017, 13:50:01) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> from sys import byteorder
>>> print(byteorder)
big

fingolfin · 2018-04-12T10:59:16Z

Excellent!
You could test if the mersenne twister code is involved or not by comparing with this snippet:

gap> state:=InitRandomMT("abc");;
gap> RandomIntegerMT(state, 80);
358386947634599872691919

If it gives identical results, we should look elsewhere; if not we have found the bug (or at least part of it).

dimpase · 2018-04-12T11:12:20Z

test if the mersenne twister code is involved or not

Both GAP 4.8.6 and 4.9.0 on SPARC Solaris 11 return

1539244741327137841491211282754109

I guess it's hidden on 4.8.6 only because tests for this are not present.

fingolfin · 2018-04-12T11:31:54Z

So one needs to look at FuncRandomIntegerMT in src/integer.c, which does this:

Obj FuncRandomIntegerMT(Obj self, Obj mtstr, Obj nrbits)
{
  Obj res;
  Int i, n, q, r, qoff, len;
  UInt4 *mt, rand;
  UInt4 *pt;
...
  n = INT_INTOBJ(nrbits);
...
  else {
     /* large int case */
     q = n / 32;
     r = n - q * 32;
     /* qoff = number of 32 bit words we need */
     qoff = q + (r==0 ? 0:1);
     /* len = number of limbs we need (limbs currently are either 32 or 64 bit wide) */
     len = (qoff*4 +  sizeof(mp_limb_t) - 1) / sizeof(mp_limb_t);
     res = NewBag( T_INTPOS, len*sizeof(mp_limb_t) );
     pt = (UInt4*) ADDR_INT(res);
     mt = (UInt4*) CHARS_STRING(mtstr);
     for (i = 0; i < qoff; i++, pt++) {
       rand = (UInt4) nextrandMT_int32(mt);
       *pt = rand;  // <--- PROBLEM IS AROUND HERE ...
     }
     if (r != 0) {
       /* we generated too many random bits -- chop of the extra bits */
       pt = (UInt4*) ADDR_INT(res);
       pt[qoff-1] = pt[qoff-1] & ((UInt4)(-1) >> (32-r));  // <--- ... AND HERE
     }
     /* shrink bag if necessary */
     res = GMP_NORMALIZE(res);
     /* convert result if small int */
     res = GMP_REDUCE(res);
  }

  return res;
}

I marked the two problematic spots (roughly): this code creates a GMP bigint 32bits at a time. This ensures that we get identical results regardless of whether we are on a 32 or 64 bit system.

I am not sure whether we get identical results on 32 bit LE vs BE systems. Perhaps you could try it? i.e. can you build a 32bit SPARC versions of GAP and see what it outputs? If that already is broken, we need to change the byte order when writing in those two places.

But in addition, for 64bit BE, also the 32bit words need to be shuffled. To do this, in the first loop, one should call nextrandMT_int32 twice and swap the write order. It might be easier, though, to simply use an #ifdef SYS_IS_64_BIT to split the code into a 32bit version, and a 64bit version; and then change pt to be of type mp_limb_t; and then write 64bit at a time as follows:

   rand = (UInt4) nextrandMT_int32(mt);
  *pt = ((UInt8) nextrandMT_int32(mt) << 32) | rand;

(except in the last iteration, if the total number of 32 bit blocks is odd, then of course the high bits should be set to 0 instead.

And of course the final masking must be done carefully, too.

dimpase · 2018-04-12T12:03:04Z

Why does GAP roll its own randoms here, instead of relying on GMP? Is it because GMP does not do a good job here?

ChrisJefferson · 2018-04-12T12:23:49Z

Until a couple of months ago, gmp was an optional dependency. Also, we have to be careful using gmp functionality because we need to be careful about memory manager interactions.

dimpase · 2018-04-12T15:08:45Z

I meant to say that it looks more practical to have this code replaced by an appropriate call to GMP's Mersenne's Twister gmp_randinit_mt, no?

fingolfin · 2018-04-12T15:29:17Z

We understood; Chris' answer still applies. I would add that it is unclear to me whether GMP's MT returns identical results across BE vs LE and 32 vs 64 bit.

dimpase · 2018-04-12T18:29:39Z

From reading GMP test code I gather it must be platform-independent: they test their generators against hard-coded sequences of numbers:
https://gmplib.org/repo/gmp-6.0/file/tip/tests/rand/t-rand.c
https://github.com/wbhart/mpir/blob/master/tests/rand/t-rand.c

Perhaps @wbhart could confirm this...

wbhart · 2018-04-12T18:54:10Z

I'm sorry, I don't know. I can't see any reason why MT should return different results across platforms. But I don't know about any other generators. It's not something I've thought about.

ChrisJefferson · 2018-04-12T18:58:53Z

My reference is that there are many GMP functions we can't use, because they can internally allocate memory, or move buffers around. These (obviously) conflict with GAP's internal memory manager. Therefore we use a very small set of functions which allow us to control the memory management ourselves. At the least we would have to carefully check none of GMP's random functions ever allocate memory, and promise not to in future. I suspect i would be easier to fix our existing code :)

ChrisJefferson · 2018-04-12T19:02:19Z

Update: Checked, GMP allocates in gmp_randinit_mt, so it's no good.

fingolfin · 2018-04-12T19:02:28Z

@ChrisJefferson Actually, we already use a few GMP APIs that allocate memory, though we are careful to ensure they also deallocate afterwards (and I think we discussed this several times by now ;-). But yeah, it still something to be very careful about.

That said: The easiest way to find out is to try to implement it: @dimpase PRs are welcome!

fingolfin · 2018-04-12T19:04:37Z

There is also the low-level API mpn_random which one might be able to use -- it accepts a pre-allocated buffer. It only allows specifying the amount of randomness in limbs, though, hence one has to mask off the extra random bits, but that's easy to do in a way that works on both BE and LE (and on 32 and 64 bit).

fingolfin · 2018-04-12T19:05:25Z

(Of course this will still change the overall randomness output; also, one then also has to rewrite the API for initializing the RNG state, etc., which affects quite a bit of code; it's not rocket science, but also not completely trivial).

dimpase · 2018-04-12T19:05:36Z

GMP allocates in gmp_randinit_mt, so it's no good.

Well, one can call it once per GAP session, before even doing anything with GASMAN, no?

ChrisJefferson · 2018-04-12T19:05:51Z

I don't think we could clean up (easily), as I can make a random number generator, then want it GCing, so you'd have to store your own buffer somehow.

ChrisJefferson · 2018-04-12T19:06:13Z

No, you can (and I do) make many independant random sources.

dimpase · 2018-04-12T19:28:12Z

But GMP surely allows for a custom memory allocator. Naively, all one needs is to hook it up to GASMAN, no?

fingolfin · 2018-04-12T19:37:16Z

In general, no, because it expects a non-moving allocator, which GASMAN is not. It might be OK in the particular case of the random number stuff, but one would have to analyze the GMP code to be sure, and that then of course is a fragile setup, because it might change in future versions of GMP.

All in all, I think it's much easier to just tweak the existing, well-tested RNG code in GAP for 64bit BE, than to fiddle with GMP.

dimpase · 2018-04-14T15:27:01Z

Is there a significant difference between GASMAN and BoehmGC --- the one in HPC-GAP --- in this respect? (Excuse my ignorance).

GMP/MPIR is used in Macaulay2, which also uses BoehmGC; I don't know whether they use GMP's randoms. Perhaps @DanGrayson could enlighten us here.

fingolfin · 2018-04-14T16:25:22Z

Yes, there is a significant difference here: Boehm GC is non moving.

It is really pointless to ask 3rd parties for expertise here. I strongly doubt that anybody knows more about the combination of GASMAN and GMP than Chris and me. We thought about similar problems a lot, and I spent considerable effort rewriting the GAP GMP integration recently. And while in principle it would be possible to use the GMP rng code here, it would not be easier than just fixing the problematic code; it would instead be a much bigger change and as sucgmh more likely to introduce regressions.

dimpase · 2018-04-14T16:37:05Z

Yes, there is a significant difference here: Boehm GC is non moving.

Oh, OK, thanks, I see. So it probably would be reasonable to use GMP for HPC-GAP, but not in "classic" GAP.

fingolfin · 2018-04-14T16:38:08Z

That would be possible, but I see no advantage in doing so, only disadvantages.

ChrisJefferson · 2018-04-14T16:39:14Z

Also, I'd prefer (personally) unless there is a good reason to avoid pulling in too much GMP, as we might want one day to move to another large number library, and it would be nice if that wasn't too hard -- for example I have tried compiling GAP to WebAssembly (nothing releasable yet), and one of the blocking points has been GMP.

dimpase · 2018-04-14T16:41:58Z

Isn't less code (assuming the "classic" one being retired) to maintain always an advantage?

ChrisJefferson · 2018-04-14T16:44:32Z

This is now getting into philosophical issues, but I prefer to use external libraries only when they provide significant value in terms of complexity or just amount of code (so, I wouldn't want to write my own GMP, or my own graphics library). On the other hand, once code is out of your control, then you have to deal with the owners taking it in a direction you don't like -- for example whenever GAP does something which makes Sage's life more painful :)

Also, I expect non-HPC GAP to continue existing for several years yet -- it's not yet clear HPC-GAP will replace traditional GAP, or if they will continue to co-exist.

Fixes gap-system#2292

Fixes #2292

stefnotch · 2020-06-21T12:23:34Z

@ChrisJefferson I apologize for the ping, but I was curious if you have had any success with the WebAssembly version of Gap

ChrisJefferson · 2020-06-22T08:29:03Z

Nothing of practical usefulness. I can make a GAP which works ( https://caj.host.cs.st-andrews.ac.uk/gap/gap.html ), but the interface is truely horrible, because I don't know anything about making Javascript interfaces

dimpase · 2020-06-22T09:07:13Z

It would be nice to have a browser-run console terminal (and there are lots of these to choose from) to run this GAP. I guess it's not something novel, just matter to find a place to copy such a setup from.

stefnotch · 2020-06-22T09:45:11Z

Oh, that's actually really neat. If you need help with making a simple interface, I'd gladly assist you.

ChrisJefferson · 2020-06-22T12:44:41Z

We probably shouldn't be discussing this here :)

I have a docker image here: https://github.com/ChrisJefferson/gap-javascript/ Unfortunately, it has broken but could probably be fixed.

The problem with the javascript is (at least last time I looked) it was suprisingly hard to link the resulting GAP executable to a javascript interface / terminal.

ChrisJefferson mentioned this issue Apr 14, 2018

Test, or block, big-endian CPUs #2362

Closed

fingolfin added a commit to fingolfin/gap that referenced this issue Jul 10, 2018

kernel: fix RandomIntegerMT on 64 bit big endian

1e4d544

Fixes gap-system#2292

fingolfin mentioned this issue Jul 10, 2018

kernel: fix RandomIntegerMT on 64 bit big endian #2638

Merged

fingolfin added a commit to fingolfin/gap that referenced this issue Jul 11, 2018

kernel: fix RandomIntegerMT on 64 bit big endian

06ff727

Fixes gap-system#2292

fingolfin closed this as completed in #2638 Jul 11, 2018

fingolfin added a commit that referenced this issue Jul 11, 2018

kernel: fix RandomIntegerMT on 64 bit big endian

4a3af77

Fixes #2292

stefnotch mentioned this issue Jan 20, 2021

SymPy translation to JavaScript open-source-ideas/ideas#165

Open

6 tasks

Random(...,1,2^80) test fails on SPARC Solaris #2292

Random(...,1,2^80) test fails on SPARC Solaris #2292

Comments

dimpase commented Mar 24, 2018

Observed behaviour

Expected behaviour

Copy and paste GAP banner (to tell us about your setup)

ChrisJefferson commented Mar 24, 2018

dimpase commented Mar 24, 2018

dimpase commented Mar 26, 2018 • edited Loading

dimpase commented Mar 29, 2018

fingolfin commented Apr 9, 2018

fingolfin commented Apr 9, 2018

dimpase commented Apr 9, 2018

fingolfin commented Apr 9, 2018

fingolfin commented Apr 9, 2018

dimpase commented Apr 9, 2018

fingolfin commented Apr 12, 2018

dimpase commented Apr 12, 2018

fingolfin commented Apr 12, 2018 • edited Loading

dimpase commented Apr 12, 2018 • edited Loading

ChrisJefferson commented Apr 12, 2018

dimpase commented Apr 12, 2018

fingolfin commented Apr 12, 2018

dimpase commented Apr 12, 2018

wbhart commented Apr 12, 2018

ChrisJefferson commented Apr 12, 2018

ChrisJefferson commented Apr 12, 2018

fingolfin commented Apr 12, 2018

fingolfin commented Apr 12, 2018

fingolfin commented Apr 12, 2018

dimpase commented Apr 12, 2018

ChrisJefferson commented Apr 12, 2018

ChrisJefferson commented Apr 12, 2018

dimpase commented Apr 12, 2018

fingolfin commented Apr 12, 2018

dimpase commented Apr 14, 2018

fingolfin commented Apr 14, 2018

dimpase commented Apr 14, 2018 • edited Loading

fingolfin commented Apr 14, 2018

ChrisJefferson commented Apr 14, 2018

dimpase commented Apr 14, 2018

ChrisJefferson commented Apr 14, 2018 • edited Loading

stefnotch commented Jun 21, 2020

ChrisJefferson commented Jun 22, 2020

dimpase commented Jun 22, 2020

stefnotch commented Jun 22, 2020 • edited Loading

ChrisJefferson commented Jun 22, 2020

dimpase commented Mar 26, 2018 •

edited

Loading

fingolfin commented Apr 12, 2018 •

edited

Loading

dimpase commented Apr 12, 2018 •

edited

Loading

dimpase commented Apr 14, 2018 •

edited

Loading

ChrisJefferson commented Apr 14, 2018 •

edited

Loading

stefnotch commented Jun 22, 2020 •

edited

Loading