Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random(...,1,2^80) test fails on SPARC Solaris #2292

Closed
dimpase opened this issue Mar 24, 2018 · 41 comments
Closed

Random(...,1,2^80) test fails on SPARC Solaris #2292

dimpase opened this issue Mar 24, 2018 · 41 comments

Comments

@dimpase
Copy link
Member

dimpase commented Mar 24, 2018

Observed behaviour

testing: /datapool/dima/Sage/gap/tst/testinstall/intarith.tst
########> Diff in /datapool/dima/Sage/gap/tst/testinstall/intarith.tst:948
# Input is:
Random(mysource, 1, 2^80);
# Expected output:
1071135285340982180054653
# But found:
2461224813218428180

(the other tests there pass)

Expected behaviour

No diff, obviously

Copy and paste GAP banner (to tell us about your setup)

 *********   GAP 4.8.8-5202-gb88ad0f of today
 *  GAP  *   https://www.gap-system.org
 *********   Architecture: sparc-sun-solaris2.11-default64
 Configuration:  gmp 6.1.2
 Loading the library and packages ...
 Packages:   AClib 1.2, Alnuth 3.1.0, AtlasRep 1.5.1, AutPGrp 1.8, CRISP 1.4.4, Cryst 4.1.13, CrystCat 1.1.6, CTblLib 1.2.2, 
             FactInt 1.6.1, FGA 1.3.1, GAPDoc 1.6.1, IRREDSOL 1.4, LAGUNA 3.8.0, Polenta 1.3.8, Polycyclic 2.11, PrimGrp 3.3.0, 
             RadiRoot 2.7, ResClasses 4.7.1, smallgrp 1.2, Sophus 1.23, SpinSym 1.5, TomLib 1.2.6, TransGrp 2.0.2, utils 0.53

Built with vanilla gcc 7.2.


Note that as #2274 is merged, this and #2194 are the only things to be fixed for testinstall pass on this platform.

#116 makes it hard to understand what's going on, precisely...

@ChrisJefferson
Copy link
Contributor

Would it be possible to bisect this issue if it only appeared in the last couple of weeks, or has it been around for a long time (I made some recentish random changes).

@dimpase
Copy link
Member Author

dimpase commented Mar 24, 2018

I don't recall seeing it on 4.9. I will bisect this.

@dimpase
Copy link
Member Author

dimpase commented Mar 26, 2018

It's still there on 4.9.0, but it does work on 4.8.6. I will try finding the exact commit.

It's also not in 4.8.10, so one has to chase the 4.9 branch.

@dimpase
Copy link
Member Author

dimpase commented Mar 29, 2018

Could you give me a reasonable starting point on the 4.9 branch to start bisecting from?

@fingolfin
Copy link
Member

@dimpase One thing you can try is to start bisecting after the big build system work was settled, that makes it much less painful.
From a quick look, perhaps something like 3e23772 would be a good starting point -- assuming it builds for you, and assuming the error does not occur there.

@fingolfin
Copy link
Member

BTW, is that machine a big endian machine by chance? That would give an important hint: it would suggest that some of the kernel random code is not sufficiently endian safe (e.g. the mersenne twister). It then probably wasn't in GAP 4.8 either, but we also changed which random sources we use to be more consistent, which could have exposed the problem.

This is just a wild guess, though.

@dimpase
Copy link
Member Author

dimpase commented Apr 9, 2018

Yes, the endianness is big, as is the case with SPARCs.

@fingolfin
Copy link
Member

OK. (Note that SPARC v9 and later is bi-endian, so given the information in this issue, I was not able to able to tell the endianess).

@fingolfin
Copy link
Member

A quick look at the kernel code, specifically FuncRandomIntegerMT, suggests that my guess may be correct: that function does not seem to take endianess into account; even if the underlying mersenne twister code does, still FuncRandomIntegerMT writes data 4 bytes at a time, which could lead to differing results on a 64bit big endian machine. But this is still guessing, I did not try to understand this code properly.

@dimpase
Copy link
Member Author

dimpase commented Apr 9, 2018

Just to be sure, here is the test:

$ python
Python 2.7.9 (default, Apr 27 2017, 13:50:01) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> from sys import byteorder
>>> print(byteorder)
big

@fingolfin
Copy link
Member

Excellent!
You could test if the mersenne twister code is involved or not by comparing with this snippet:

gap> state:=InitRandomMT("abc");;
gap> RandomIntegerMT(state, 80);
358386947634599872691919

If it gives identical results, we should look elsewhere; if not we have found the bug (or at least part of it).

@dimpase
Copy link
Member Author

dimpase commented Apr 12, 2018

test if the mersenne twister code is involved or not

Both GAP 4.8.6 and 4.9.0 on SPARC Solaris 11 return

1539244741327137841491211282754109

I guess it's hidden on 4.8.6 only because tests for this are not present.

@fingolfin
Copy link
Member

fingolfin commented Apr 12, 2018

So one needs to look at FuncRandomIntegerMT in src/integer.c, which does this:

Obj FuncRandomIntegerMT(Obj self, Obj mtstr, Obj nrbits)
{
  Obj res;
  Int i, n, q, r, qoff, len;
  UInt4 *mt, rand;
  UInt4 *pt;
...
  n = INT_INTOBJ(nrbits);
...
  else {
     /* large int case */
     q = n / 32;
     r = n - q * 32;
     /* qoff = number of 32 bit words we need */
     qoff = q + (r==0 ? 0:1);
     /* len = number of limbs we need (limbs currently are either 32 or 64 bit wide) */
     len = (qoff*4 +  sizeof(mp_limb_t) - 1) / sizeof(mp_limb_t);
     res = NewBag( T_INTPOS, len*sizeof(mp_limb_t) );
     pt = (UInt4*) ADDR_INT(res);
     mt = (UInt4*) CHARS_STRING(mtstr);
     for (i = 0; i < qoff; i++, pt++) {
       rand = (UInt4) nextrandMT_int32(mt);
       *pt = rand;  // <--- PROBLEM IS AROUND HERE ...
     }
     if (r != 0) {
       /* we generated too many random bits -- chop of the extra bits */
       pt = (UInt4*) ADDR_INT(res);
       pt[qoff-1] = pt[qoff-1] & ((UInt4)(-1) >> (32-r));  // <--- ... AND HERE
     }
     /* shrink bag if necessary */
     res = GMP_NORMALIZE(res);
     /* convert result if small int */
     res = GMP_REDUCE(res);
  }

  return res;
}

I marked the two problematic spots (roughly): this code creates a GMP bigint 32bits at a time. This ensures that we get identical results regardless of whether we are on a 32 or 64 bit system.

I am not sure whether we get identical results on 32 bit LE vs BE systems. Perhaps you could try it? i.e. can you build a 32bit SPARC versions of GAP and see what it outputs? If that already is broken, we need to change the byte order when writing in those two places.

But in addition, for 64bit BE, also the 32bit words need to be shuffled. To do this, in the first loop, one should call nextrandMT_int32 twice and swap the write order. It might be easier, though, to simply use an #ifdef SYS_IS_64_BIT to split the code into a 32bit version, and a 64bit version; and then change pt to be of type mp_limb_t; and then write 64bit at a time as follows:

   rand = (UInt4) nextrandMT_int32(mt);
  *pt = ((UInt8) nextrandMT_int32(mt) << 32) | rand;

(except in the last iteration, if the total number of 32 bit blocks is odd, then of course the high bits should be set to 0 instead.

And of course the final masking must be done carefully, too.

@dimpase
Copy link
Member Author

dimpase commented Apr 12, 2018

Why does GAP roll its own randoms here, instead of relying on GMP? Is it because GMP does not do a good job here?

@ChrisJefferson
Copy link
Contributor

Until a couple of months ago, gmp was an optional dependency. Also, we have to be careful using gmp functionality because we need to be careful about memory manager interactions.

@dimpase
Copy link
Member Author

dimpase commented Apr 12, 2018

I meant to say that it looks more practical to have this code replaced by an appropriate call to GMP's Mersenne's Twister gmp_randinit_mt, no?

@fingolfin
Copy link
Member

We understood; Chris' answer still applies. I would add that it is unclear to me whether GMP's MT returns identical results across BE vs LE and 32 vs 64 bit.

@dimpase
Copy link
Member Author

dimpase commented Apr 12, 2018

From reading GMP test code I gather it must be platform-independent: they test their generators against hard-coded sequences of numbers:
https://gmplib.org/repo/gmp-6.0/file/tip/tests/rand/t-rand.c
https://github.com/wbhart/mpir/blob/master/tests/rand/t-rand.c

Perhaps @wbhart could confirm this...

@wbhart
Copy link

wbhart commented Apr 12, 2018

I'm sorry, I don't know. I can't see any reason why MT should return different results across platforms. But I don't know about any other generators. It's not something I've thought about.

@ChrisJefferson
Copy link
Contributor

My reference is that there are many GMP functions we can't use, because they can internally allocate memory, or move buffers around. These (obviously) conflict with GAP's internal memory manager. Therefore we use a very small set of functions which allow us to control the memory management ourselves. At the least we would have to carefully check none of GMP's random functions ever allocate memory, and promise not to in future. I suspect i would be easier to fix our existing code :)

@ChrisJefferson
Copy link
Contributor

Update: Checked, GMP allocates in gmp_randinit_mt, so it's no good.

@fingolfin
Copy link
Member

@ChrisJefferson Actually, we already use a few GMP APIs that allocate memory, though we are careful to ensure they also deallocate afterwards (and I think we discussed this several times by now ;-). But yeah, it still something to be very careful about.

That said: The easiest way to find out is to try to implement it: @dimpase PRs are welcome!

@fingolfin
Copy link
Member

There is also the low-level API mpn_random which one might be able to use -- it accepts a pre-allocated buffer. It only allows specifying the amount of randomness in limbs, though, hence one has to mask off the extra random bits, but that's easy to do in a way that works on both BE and LE (and on 32 and 64 bit).

@fingolfin
Copy link
Member

(Of course this will still change the overall randomness output; also, one then also has to rewrite the API for initializing the RNG state, etc., which affects quite a bit of code; it's not rocket science, but also not completely trivial).

@dimpase
Copy link
Member Author

dimpase commented Apr 12, 2018

GMP allocates in gmp_randinit_mt, so it's no good.

Well, one can call it once per GAP session, before even doing anything with GASMAN, no?

@ChrisJefferson
Copy link
Contributor

I don't think we could clean up (easily), as I can make a random number generator, then want it GCing, so you'd have to store your own buffer somehow.

@ChrisJefferson
Copy link
Contributor

No, you can (and I do) make many independant random sources.

@dimpase
Copy link
Member Author

dimpase commented Apr 12, 2018

But GMP surely allows for a custom memory allocator. Naively, all one needs is to hook it up to GASMAN, no?

@fingolfin
Copy link
Member

In general, no, because it expects a non-moving allocator, which GASMAN is not. It might be OK in the particular case of the random number stuff, but one would have to analyze the GMP code to be sure, and that then of course is a fragile setup, because it might change in future versions of GMP.

All in all, I think it's much easier to just tweak the existing, well-tested RNG code in GAP for 64bit BE, than to fiddle with GMP.

@dimpase
Copy link
Member Author

dimpase commented Apr 14, 2018

Is there a significant difference between GASMAN and BoehmGC --- the one in HPC-GAP --- in this respect? (Excuse my ignorance).

GMP/MPIR is used in Macaulay2, which also uses BoehmGC; I don't know whether they use GMP's randoms. Perhaps @DanGrayson could enlighten us here.

@fingolfin
Copy link
Member

Yes, there is a significant difference here: Boehm GC is non moving.

It is really pointless to ask 3rd parties for expertise here. I strongly doubt that anybody knows more about the combination of GASMAN and GMP than Chris and me. We thought about similar problems a lot, and I spent considerable effort rewriting the GAP GMP integration recently. And while in principle it would be possible to use the GMP rng code here, it would not be easier than just fixing the problematic code; it would instead be a much bigger change and as sucgmh more likely to introduce regressions.

@dimpase
Copy link
Member Author

dimpase commented Apr 14, 2018

Yes, there is a significant difference here: Boehm GC is non moving.

Oh, OK, thanks, I see. So it probably would be reasonable to use GMP for HPC-GAP, but not in "classic" GAP.

@fingolfin
Copy link
Member

That would be possible, but I see no advantage in doing so, only disadvantages.

@ChrisJefferson
Copy link
Contributor

Also, I'd prefer (personally) unless there is a good reason to avoid pulling in too much GMP, as we might want one day to move to another large number library, and it would be nice if that wasn't too hard -- for example I have tried compiling GAP to WebAssembly (nothing releasable yet), and one of the blocking points has been GMP.

@dimpase
Copy link
Member Author

dimpase commented Apr 14, 2018

Isn't less code (assuming the "classic" one being retired) to maintain always an advantage?

@ChrisJefferson
Copy link
Contributor

ChrisJefferson commented Apr 14, 2018

This is now getting into philosophical issues, but I prefer to use external libraries only when they provide significant value in terms of complexity or just amount of code (so, I wouldn't want to write my own GMP, or my own graphics library). On the other hand, once code is out of your control, then you have to deal with the owners taking it in a direction you don't like -- for example whenever GAP does something which makes Sage's life more painful :)

Also, I expect non-HPC GAP to continue existing for several years yet -- it's not yet clear HPC-GAP will replace traditional GAP, or if they will continue to co-exist.

@stefnotch
Copy link

@ChrisJefferson I apologize for the ping, but I was curious if you have had any success with the WebAssembly version of Gap

@ChrisJefferson
Copy link
Contributor

Nothing of practical usefulness. I can make a GAP which works ( https://caj.host.cs.st-andrews.ac.uk/gap/gap.html ), but the interface is truely horrible, because I don't know anything about making Javascript interfaces

@dimpase
Copy link
Member Author

dimpase commented Jun 22, 2020

It would be nice to have a browser-run console terminal (and there are lots of these to choose from) to run this GAP. I guess it's not something novel, just matter to find a place to copy such a setup from.

@stefnotch
Copy link

stefnotch commented Jun 22, 2020

Oh, that's actually really neat. If you need help with making a simple interface, I'd gladly assist you.

@ChrisJefferson
Copy link
Contributor

We probably shouldn't be discussing this here :)

I have a docker image here: https://github.com/ChrisJefferson/gap-javascript/ Unfortunately, it has broken but could probably be fixed.

The problem with the javascript is (at least last time I looked) it was suprisingly hard to link the resulting GAP executable to a javascript interface / terminal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants