forked from primesearch/Mlucas
-
Notifications
You must be signed in to change notification settings - Fork 0
/
help.txt
642 lines (510 loc) · 37 KB
/
help.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
[***** This file best-viewed with a 4-column tab setting, with line-wrapping *****]
================= Mlucas v20.1.1 command line options listing =================
Note: besides the basic post-build self-test flags covered by the online README page
( http://www.mersenneforum.org/mayer/README.html ),
the most-crucial command-line flags the average user doing GIMPS production work
needs to be familiar with are '-cpu' and (for stage 2 of p-1 work), '-maxalloc'.
===============================================================================
Sections:
[0]: Default mode
[1]: Post-build self-testing for various FFT-length ranges
[2]: FFT-length setting
[3]: FFT radix-set specification
[4]: Mersenne-number primality testing
[5]: Fermat-number primality testing
[6]: Residue shift
[7]: Probable-primality testing mode
[8]: Iteration-number setting
[9]: P-1 Factoring:
[9a]: Setting maximum-percentage of system free-RAM to use per stage 2 instance
[9b]: How to run stage 2 continuation for a given stage 1 bound
[10]: Setting threadcount and CPU core affinity
[11]: User control options in mlucas.ini
[12]: Savefile format and creation
[13]: *** DON'Ts ***
[14]: General troubleshooting
[14a]: How to safely interrupt a running instance
===============================================================================
Symbol and abbreviation key:
<CR>: carriage return
| : separator for one-of-the-following multiple-choice menus
{} : denotes user-supplied numerical arguments of the type noted:
{int} means nonnegative integer, {+int} = positive int, {float} = float. Args not
enclosed in [] are required, in the sense that "if you invoke the command-line
flag in question, you must follow it with a value (or set thereof) as listed.
{arg1|arg2|...} means "pick one value from the set as the required numeric argument".
[] : encloses optional arguments.
The required/optional syntax is nested, e.g. for -cpu {lo[:hi[:incr]]},
all args {int}, 'lo' is required at a minimum. If the user wants a core-range,
':hi' is also required, and for a range-with-constant-stride, ':incr' is also
required.
[] May also be used to include optional chars in a given command-line flag name,
e.g. "-fft[len]" means "-fft" is a short-form alternative to "-fftlen".
Vertical stacking indicates argument short 'nickname' options, e.g. for
-argument : [blah blah]
-arg : [blah blah]
The stacking means that '-arg' can be used in place of '-argument'.
Prefixes for binary multiples: The prefixes K, M and G are used herein in the binary sense,
i.e. represent 2^10, 2^20 and 2^30 of the quantity in question. I eschew the more recently
adopted convention (https://physics.nist.gov/cuu/Units/binary.html) of Ki, Mi and Gi for same,
mainly because the corresponding pronounced prefixes sound silly to me: e.g. MiB stands for a
mebibyte, which to me sounds like a computer-geek joke punchline, "mebbe I'll have a byte to eat,
and mebbe I won't. Ha, ha, a million laughs, thanks for coming, you've been a swell audience, and
don't forget to generously tip the waitstaff."
===============================================================================
Supported command-line arguments:
[0]:
<CR> Default mode: looks for a worktodo.ini file in the local directory;
if none found, prompts for manual keyboard entry
======================
[1]: Post-build self-testing for various FFT-length ranges:
FOR BEST RESULTS, RUN ANY SELF-TESTS UNDER ZERO- OR CONSTANT-LOAD CONDITIONS.
The following self-test options will cause an mlucas.cfg file containing
the optimal FFT radix set for the runlength(s) tested to be created (if one did not
exist previously) or appended (if one did) with new timing data. Such a file-write is
triggered by each complete set of FFT radices available at a given FFT length being
tested, i.e. by a self-test without a user-specified -radset argument.
A user-specific Mersenne exponent may be supplied via the -m flag; if none is specified,
the program will use the largest permissible exponent for the given FFT length, based on
its internal length-setting algorithm. If the user does use -m to specify an exponent,
the -iters argument is also required.
The default number of iterations for the self-test is 100 for <= 4 threads and 1000 for more
than 4 threads. The user may also override the default via the -iters flag; while it is not
required, it is best to use one of the standard timing-test values of -iters = {100|1000|10000},
with the larger values being preferred for multithreaded timing tests, in order to minimize
noise in timing data.
Similarly, it is recommended to not use the -m flag for such tests, unless
roundoff error levels on a given compute platform are such that the default exponent at one or
more FFT lengths of interest prevents a reasonable sampling of available radix sets at same.
If the user lets the program set the exponent and uses one of the aforementioned standard
self-test iteration counts, the resulting best-timing FFT radix set will only be written to the
resulting mlucas.cfg file if the timing-test residues match the internally-stored precomputed
ones for the given default exponent at the iteration count in question, with eligible radix sets
consisting of those for which the roundoff error also remains below an acceptable threshold.
If the user instead specifies the exponent (only allowed for a single-FFT-length timing test)
and an iteration number (required in this case), the resulting best-timing FFT radix set will
only be written to the resulting mlucas.cfg file if the timing-test results match each other.
This is important for tuning code parameters to your particular platform.
Options - again note the user can override the default iteration count based on #threads via
'-iters {+int}', though only 100|1000|10000-iteration cases have precomputed reference residues.
The (very rough) time estimates are for 1000 iterations done using 4 or more cores.
-s {anything other than the mnemonics below}: Self-test, user must also supply exponent (via -m or -f),
iteration number and (optionally) FFT length to use.
-s tiny Runs 100 (<= 4 threads) or 1000-iteration self-tests on a set of 32 Mersenne exponents,
-s t covering FFT lengths 8K-120K. This will take around 1 minute on a fast CPU.
-s small Runs 100 or 1000-iteration self-tests on a set of 32 Mersenne exponents, covering
-s s FFT lengths 120K-1920K. This will take around 10 minutes on a fast CPU..
**************** THIS IS THE ONLY SELF-TEST ORDINARY USERS ARE RECOMMENDED TO DO: *************
* *
* -s medium Runs 100 or 1000-iteration self-tests on a set of 16 Mersenne exponents, covering *
* -s m FFT lengths from 2M to 7.5M. This will take around an hour on a fast CPU. *
* *
***********************************************************************************************
-s large Runs 100 or 1000-iteration self-tests on a set of 24 Mersenne exponents, covering
-s l FFT lengths 8M to 60M. This will take around 10 hours on a fast CPU.
-s all Runs 100 or 1000--iteration self-tests of all the above FFT lengths.
-s a This will take around 12 hours on a fast CPU.
-s xl [The 'h' form is short for 'huge'.] Runs 100 or 1000-iteration self-tests on a set
-s h of 16 M(p), covering FFT lengths 64M-240M. This will take several days on a fast CPU.
-s xxl [The 'e' form is short for 'egregious'.] Runs 100 or 1000-iteration self-tests on a set
-s e of 9 M(p), covering FFT lengths 256M-512M. This will take several days on a fast CPU, and
requires '-shift 0' also be added to the command line, which restriction will be removed
at a later date.
======================
[2]: FFT-length setting:
-fft[len] {+int[K|M]}
If {+int[K|M]} is one of the available FFT lengths, runs all available FFT radices
available at that length, unless the -radset flag is invoked (see below for details).
If -fft is invoked without the -iters flag, it is assumed the user wishes to do a
production run with a non-default FFT length, in this case the program requires a
valid worktodo.ini-file entry with exponent not more than 5% larger than the
default maximum for that FFT length.
Without the optional [K|M] alphabetic suffixes (i.e. with a pure-integer argument),
the code treats the numeric value as representing Kilodoubles. The code also supports
a floating-point numeric argument with either a 'K' or 'M' suffix. For example, the
following are all equivalent: -fftlen 5632, -fft 5632, -fft 5632K, -fft 5.5M, and all
result in an FFT having length 5.5 × 2^20 = 5632 × 2^10 = 5767168 doubles. Any numeric
value must map to a supported FFT length; for Mlucas these are of form k × 2^n, where k
is an odd integer in the set (1,3,5,7,9,11,13,15).
If -fft is invoked with a user-supplied value of -iters but without a
user-supplied exponent, the program will do the specified number of iterations
using the default self-test Mersenne or Fermat-number exponent for that FFT length.
If -fft is invoked with a user-supplied value of -iters and either the
-m or -f flag and a user-supplied exponent, the program will do the specified
number of iterations of either the Lucas-Lehmer test with starting value 4 (-m)
or the Pe'pin test with starting value 3 (-f) on the user-specified modulus.
In either of the latter 2 cases, the program will produce a cfg-file entry based
on the timing results, assuming at least 50% of the available radix sets at the
given FFT length ran the specified #iters to completion without suffering a fatal
error of some kind, e.g. excessive roundoff error, mismatching residues versus-
tabulated, or "No AVX-512 support; Skipping this leading radix" for certain smaller
leading radices.
Use this to find the optimal radix set for a single FFT length on your hardware.
NOTE: If you use other than the default modulus or #iters for such a single-fft-
length timing test, it is up to you to manually verify that the residues output
match for all fft radix combinations and that the roundoff errors are reasonable.
======================
[3]: FFT radix-set specification:
-radset {+int[comma-separated list +int,...,+int]}
Requires a supported value of -fft {+int}[K|M] to also be specified, and a value of
-iters. If this argument is invoked, a single-FFT-length and single-set-of-FFT-radices
timing test is assumed. If a single {+int} argument is supplied, this indicates the
specific index of a set of complex FFT radices to use, based on the big select table
in the function get_fft_radices().
Optionally, the -radset flag can take an actual set of comma-separated FFT radices.
Said radix set must be one of those present in the aforementioned select table for the
FFT length in question.
For example, the following are equivalent, for -fft 6M:
-radset 1
-radset 192,16,32,32
======================
[4]: Mersenne-number primality testing:
-m {+int}
Performs a Lucas-Lehmer primality test of the Mersenne number M(int) = 2^int - 1,
where int must be an odd prime. If -iters is also invoked, this indicates a timing test,
and allows for suitable added arguments (optionally -fft, and if that is invoked,
optionally, -radset) to be supplied.
If the -fft option (and optionally -radset) is also invoked but -iters is not, the
program first checks the first eligible (any line whose first non-whitespace caharcter
is alphabetic is treated as such) line of the worktodo.ini file to see if the assignment
specified there is a Lucas-Lehmer test with the same exponent as specified via the -m
argument. If so, the -fft argument is treated as a user override of the default FFT
length for the exponent. If -radset is also invoked, this is similarly treated as a user-
specified radix set for the user-set FFT length; otherwise the program will use the
mlucas.cfg file to select the radix set to be used for the user-forced FFT length.
If the first eligible worktodo.ini file entry does not specify an LL-test (i.e. does not
begin with "Test=" or "DoubleCheck=") of an exponent matching the -m value, a set of
timing self-tests is run on the user-specified Mersenne number using all sets of FFT
radices available at the specified FFT length.
If the -fft option is not invoked, the self-tests use all sets of FFT radices available at
that exponent's default FFT length. Users can use this to find the optimal radix set for a
given Mersenne number exponent on their hardware, similarly to the -fft option. Performs
as many iterations as specified via the -iters flag (which is required if -m is invoked).
======================
[5]: Fermat-number primality testing:
-f {+int}
Performs a base-3 Pe'pin test on the Fermat number F(num) = 2^(2^num) + 1.
If desired this can be invoked together with the -fft option, as for the Mersenne-number
self-tests (see notes about the -m flag). Note that not all FFT lengths supported for -m
are available for -f; the supported ones are of form k × 2^n, where k is an odd integer
in the set [1,7,15,63].
Optimal radix sets and timings are written to a fermat.cfg file. Performs as many iterations
as specified via the -iters flag (which is required if -f is invoked).
======================
[6]: Residue shift:
-shift {+int}
Number of bits by which to shift the initial seed (= iteration-0 residue). This initial
shift count is doubled (modulo the binary exponent of the modulus being tested) each
iteration; for Fermat-number tests the mod-doubling is further augmented by addition of a
random bit, in order to keep the shift count from landing on 0 after (Fermat-number index)
iterations and remaining 0. (Cf. https://mersenneforum.org/showthread.php?p=582525#post582525)
Savefile residues are rightward-shifted by the current shift count
before being written to the file; thus savefiles contain the unshifted residue, and
separately the current shift count, which the program uses to leftward-shift the
savefile residue when the program is restarted from interrupt.
The shift count is a 32-bit unsigned int; any modulus having > 2^32 bits (thus using an
FFT length of 256M or larger) requires 0 shift.
======================
[7]: Probable-primality testing mode:
-prp [(+int)base]
This flag takes no further arguments, just invokes PRP-test mode for the specified
Mersenne-number self-test(s). This means that instead of running the rigorous Lucas-Lehmer
primality test, a "pure-squarings-modified Fermat-PRP test" (cf. next paragraph) is run
instead, to either a base specified via the optional numeric argument following the -prp
flag, or to a default base of 3.
The standard Fermat-PRP test of a number N consists of selecting a base b coprime to N and
checking whether b^(N-1) == 1 (mod N), in which case N is a probable prime (PRP) to base b.
For a Mersenne number N = M(p) = 2^p-1, N-1 has a binary representation 2^p-2 = 111...110,
(p-1) binary ones followed by a single 0. Starting with initial seed x = b - which must be
coprime to M(p) and further not equal to a power of 2, since due to their binary form all
M(p) pass a Fermat-PRP test to base 2^k, whether M(p) is prime or not - the standard Fermat-
PRP test implemented as a left-to-right binary modular powering consisting of (p-2) iterations
of form x := b*x^2 (a squaring and scalar muliply-by-b, mod M(p)) plus a final mod-squaring
x := x^2 (mod M(p)), with M(p) being a probable-prime to base b if the result == 1.
However, the rigorous-as-far-as-we-know "Gerbicz error check" requires a pure sequence
of mod squarings, thus we replace the standard Fermat-PRP test with a modified one whereby
we check whether b^(N+1) == b^2 (mod N). For N = M(p) we have N+1 = 2^p, thus this variant
requires (p-1) pure mod-squarings. Any self-tests done in PRP mode will do the first (iters)
of these.
======================
[8]: Iteration-number setting:
-iters {+int}
Do {+int} self-test iterations of the type determined by the modulus-related options:
-s/-m means Lucas-Lehmer test iterations with initial seed 4, -f means Pe'pin-test
squarings with initial seed 3.
======================
[9]: P-1 Factoring:
[9a]: Setting maximum-percentage of system free-RAM to use for p-1 stage 2 work per instance:
-maxalloc {+int}
Maximum percentage of available system RAM to use per instance. Must be >= 10. Default = 90;
user-specified values greater than 90 are allowed, but only recommended if the program is
underestimating system free RAM for some reason (use the free-RAM field in the first few
lines of Linux 'top' output to check this), or if on a system whose amount of RAM allows
only a modest number of stage 2 buffers (less than 100, say), the default allows a buffer
count which is just below the next-larger allowable one. The buffer counts usable by Mlucas
(as of this writing, v20.x.y) are multiples of 24 and of 40; if you consult the table titled
"With small-prime relocation" inside the long comment block preceding the pm1_bigstep_size()
function in the pm1.c source file, you will see that the '%modmul' numbers (measured relative
to stage 2 run with the minimum 24-buffers and without small-prime relocation, which was a
late-added optimization to v20) for 24 and 40 buffers differ by 6%, which is appreciable.
Thus if at start of stage 2 you see something like "Available memory allows up to 39
Stage 2 residue-sized memory buffers. Using 24 Stage 2 memory buffers" in the console output,
I suggest "kill -9"ing the process and restarting with '-maxalloc 100', seeing if that gives
you 40 buffers, then just keeping an eye on the 'top' output to check for swapping-to-disc
once the stage 2 inits have finished and prime-processing starts.
On the other hand '% modmul' counts for 40 and 48 buffers differ by less than 1%, so
it's not worth trying to force the higher value to be used. (The next significant gain to
be had is at 72 = 3*24 buffers.)
Under MacOS the default is 50% of available (not free) RAM.
If the system is swapping between RAM and HD/SSD during stage 2, as evidenced by
free-RAM dropping to near-0 and 'kswapd' entries appearing among the CPU-%-sorted
of the 'top' output, the value needs to be lowered. If the default is leaving plenty
of available RAM and the resulting buffer count is under 100, it is worth trying to force
a higher fraction to be used, by invoking -maxalloc [some value > 50].
-pm1_s2_nbuf {+int}
Since available RAM fluctuates depending on current load, this flag alternatively
allows the user to set the maximum number of p-1 stage 2 buffers to use per instance.
Currently, the number of stage 2 buffers must be a multiple of 24 or 40; if the user-
set maximum value is not such, the largest such multiple <= the user-specified value
is used for stage 2 work. For stage 2 restarts there is an added constraint related
to small-prime relocation, namely that if stage 2 was begun with a multiple of 24 or
40 buffers, the restart-value must also be a multiple of the same base-count, 24 or 40.
Said constraint will be automatically enforced. If the resulting buffer count exhausts
available memory, performance will suffer due to system memory-swapping, thus this flag
should only be invoked by users who know what they are doing.
Only one of the 2 flags may be set via the command line.
These 2 flags are only important in the context of stage 2 of p-1 factoring, which
will be done automatically before a Lucas-Lehmer primality or probable-primality-test
if the GIMPS assignment in question indicates that some p-1 effort is warranted.
[9b]: How to run stage 2 continuation for a given stage 1 bound:
On completion of an initial p-1 run with stage bounds B1 and B2 and a contiguous stage 2
(i.e. one which covered the interval [B1,B2,b2]), the program leaves the respective p-1 residues
in a pair of savefiles named p[exp].s1 and p[exp].s2, respectively. The former of these may
be re-used for one or more deeper stage 2 interval runs, in distributed fashion (multiple
program instances, each of which processes a given stage 2 interal so as to cover a desired
expanded prime interval), if desired. Say the original run used a worktodo.ini assignment
Pminus1={aid,}k,b,n,c,B1,B2
where {aid} is a 32-hexit assignment ID (which may be omitted or filled with 'n/a' - the
quotes are for emphasis only, they must not appear in the actual assignment - if the user
wishes to create such an assignment for him-or-herself). Here, k,n,b,c define the desired
modulus (for prime-exponent Mersenne number M(p) = 2^p-1, k = 1, n = 2, b = p, c = -1; for
Fermat number F[m] = 2^(2^m)+1, k = 1, n = 2, b = 2^m, c = +1) and B1 and B2 are the p-1
stage bounds. If the original used a Pfactor assignment:
Pfactor={aid},k,b,n,c,TF_BITS,ll_tests_saved_if_factor_found
then you must read the resulting p-1 stage bounds from the run log file, p[exp].stat .
Once you have the stage original-run stage bounds in hand, for a desired stage 2 continuation
interval with bounds [B2_start, B2], the proper assignment syntax is
Pminus1={aid},k,b,n,c,B1,B2,TF_BITS,B2_start[,known_factors]
where any known factors should be in form of a comma-separated list of known bookended
with "", e.g. known factors f1 and f2 appear as "f1,f2". These will not be reported if
found in any of the GCD steps which test for a factor having been found, and the run
will only exit if a new factor (one not appearing in the known_factors list) is found.
======================
[10]: Setting threadcount and CPU core affinity:
Note: As of this writing (v20.x.y) setting core affinity is not effective when running
on Windows Subsystem for Linux (WSL), presumably due to virtualization. Processes
will core-hop, negatively impacting efficiency. Also, on WSL, core count is limited
to 64 due to MS Windows' "processor group" construct. Effectively though, it is less;
in actuality on a 68-core & x4 HT Xeon Phi 7250 for example, a running Mlucas instance
and the WSL session in which it runs will occupy at most 16 cores x their 4 hyperthreads,
or 4 cores x their 4 hyperthreads, depending on which processor group is allocated.
This limitation should not adversely affect most users, since for FFT lengths repreentative
of the current GIMPS testing wavefront, more than 16 threades per instance is counterproductive
in terms of throughput: I generally recommend at most 4 physical cores per instance, with 1, 2
or - if available - 4 threads per core, depending on which threads-per-core number gives the
best throughput.
==========================
For Mlucas v21 and later, it is urged that users install the hwloc freeware library on OSes
which support it - see the online Mlucas README.html page for how-to. Hwloc maps the hardware
topology in form of a tree, which makes assigning physical cores and threads on them to a
multithreaded instance simple in terms of indexing, and obviates the differing logical-core-
numbering conventions used by the leading x86-CPU manufacturers Intel and AMD.
(Recommended variant of affinity setting:)
-core {lo:hi[:threads_per_core]}
(All args {+int} here.) Set thread/CPU affinity. This flag and the -cpu and -nthread flags
are mutually exclusive! If -core is used, the threadcount is inferred from the numeric-argument
triplet which follows as #threads = (hi - lo + 1)*threads_per_core. Here, in terms of ascending
complexity:
o '-core' not invoked: run 1-threaded with affinity to the first available logical core.
o '-core' invoked, only the 'lo' argument of the triplet supplied: "run 1-threaded with
affinity to the first available thread (= logical core) on physical processor {lo}."
o Both the 'lo' and 'hi' arguments of the triplet are supplied but 'threads_per_core'
is omitted: "run 1 thread on each of physical processors 'lo' through 'hi', for a total
of (hi - lo + 1) threads."
o All three argument are supplied: "run 'threads_per_core' threads on each of physical
processors 'lo' through 'hi', for a total of (hi - lo + 1)*threads_per_core threads."
==========================
(Obsolescent - not recommended, please use the -cpu flag instead:)
-nthread {int}
For multithread-enabled builds, run with this many threads.
If the user does not specify a thread count, the default is to run single-threaded
with that thread's affinity set to logical core 0.
AFFINITY: The code will attempt to set the affinity of the resulting threads
0:n-1 to the same-indexed processor cores - whether this means distinct physical
cores is entirely up to the CPU vendor - E.g. Intel uses such a numbering scheme
but AMD does not. For this reason as of v17 this option is deprecated in favor of
the -cpu flag, whose usage is detailed below, with the online README page providing
guidance for the core-numbering schemes of popular CPU vendors.
If n exceeds the available number of logical processor cores (call it #cpu), the
program will halt with an error message.
For greater control over affinity setting, use the -cpu option, which supports two
distinct core-specification syntaxes (which may be mixed together), as follows:
(Recommended variant of affinity setting:)
-cpu {lo[:hi[:incr]]}
(All args {+int} here.) Set thread/CPU affinity. NOTE: This flag and -nthread are
mutually exclusive! If -cpu is used, the threadcount is inferred from the numeric-argument
triplet which follows. If only the 'lo' argument of the triplet is supplied, it means
"run single-threaded with affinity to logical core {lo}." (Note that in the absence
of hyperthreading, logical and physical cores are the same.)
If the increment (third) argument of the triplet is omitted, it defaults to incr = 1.
The CPU set encoded by the integer-triplet argument to -cpu corresponds to the
values of the integer loop index i in the C-loop for(i = lo; i <= hi; i += incr),
excluding the loop-exit value of i. Thus '-cpu 0:3' and '-cpu 0:3:1' are both
exactly equivalent to '-nthread 4', whereas '-cpu 0:6:2' and '-cpu 0:7:2' both
specify affinity setting to logical cores 0,2,4,6, assuming said cores exist.
Lastly, note that no whitespace is permitted within the colon-separated numeric field.
-cpu {triplet0[,triplet1,...]} This is simply an extended version of the above affinity-
setting syntax in which each of the comma-separated 'triplet' subfields is in the above
form and, analogously to the one-triplet-only version, no whitespace is permitted within
the colon-and-comma-separated numeric field. Thus '-cpu 0:3,8:11' and '-cpu 0:3:1,8:11:1'
both specify an 8-threaded run with affinity set to logical core quartets 0-3 and 8-11,
whereas '-cpu 0:3:2,8:11:2' means run 4-threaded on cores 0,2,8,10. As described for the
-nthread option, it is an error for any core index to exceed the available number of logical
processor cores.
======================
[11]: User control options in mlucas.ini:
Any mlucas.ini file in the user's run directory will be checked at program (re)start
for the supported options listed below. Each such option is assumed to be formatted as
[ws][optname][ws][=][ws][value][ws][comment]
with [ws] denoting whitespace and [value] a numeric value parseable and representable
as an IEEE-754 64-bit double-precision float. The numeric value may be followed by anything,
so long as it does not affect parsing of the preceding numeric entry. For instance, mlucas.ini
on a low_RAM might contain
CheckInterval = 10000 /* I'm using the space right of the value for a C-style comment. */
LowMem = 1 // But one could also use C++ comment format...
InterimGCD = 0 # Or bash-shell and Python-style comment format...
LowMem = 2 Or no comment delimiter at all. Note that the program uses the first entry
matching a given option name, so the second LowMem entry here will be ignored.
This specifies savefile-writes for LL/PRP/p-1 every 10000 iterations, and allows PRP-testing
but excludes p-1 stage 2. The InterimGCD option setting is moot in this case since it only
applies to p-1 stage 2.
Note that the option-name and comment formats in mlucas.ini differ from those in local.ini -
the latter is read and updated by the several primenet.py work-management scripts described
on the online Mlucas README.html file, thus the flag names therein should be all-lowercase
and comments must conform to Python rules (https://docs.python.org/3/library/configparser.html):
"Configuration files may include comments, prefixed by specific characters (# and ;).
Comments may appear on their own in an otherwise empty line, or may be entered in lines
holding values or section names. In the latter case, they need to be preceded by a
whitespace character to be recognized as a comment. (For backwards compatibility,
only ; starts an inline comment, while # does not.)"
o FREQUENCY OF SAVEFILE WRITES: The default frequency is threadcount-dependent: every 10000
iterations for <= 4 threads; every 100,000 iterations for more than 4 threads. (Note that
as of v20, if exponent ratio in the "p[ = ***]/pmax_rec" informational printed at run-start
is > 0.97, the program will set ITERS_BETWEEN_CHECKPOINTS to the smaller of any user-set
value or 10000, irrespective of the threadcount.
At run-start, you will see this captured in the informational terminal output (which gets
piped to nohup.out if you prefix your instance invocation with "nohup", as is recommended):
Set affinity for the following 4 cores: 0.1.2.3.
NTHREADS = 4
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
For p-1 factoring, slightly different terminology is used for .stat-file entries documenting
savefile-writes: "S1 bit = ..." reflects which bit of the p-1 stage 1 small-prime-powers
product (whose bitlength is roughly 1.4x the stage 1 bound B1) for stage 1, and "S2 at q = ..."
reflecting which stage 2 primes have been processed (prime less than and nearest the printed
value). However, in all three cases the underlying savefile frequency is based on the same
metric: the number of mod-M(p) multiplies done during the current task. Thus p-1 stage 1
checkpoints will occur at roughly the same wall-clock frequency as for LL and PRP-test ones;
the stage 2 savefile-update frequency will be perhaps 10-20% slower, reflecting the fact that
while LL/PRP/stage-1 all do chains of in-place mod-M(p) "autosquarings", p-1 stage 2 does
2-input mod-M(p) multiplies, each of which requires 2x more data to stream between the CPU
and the cache+memory subsystem.
The ITERS_BETWEEN_CHECKPOINTS value can be customized by adding a "CheckInterval = [value]"
line to one's mlucas.ini file, but note that there are constraints on the value related to
the Gerbicz-checking done for PRP tests. Specifically, the CheckInterval value must be a
multiple of 1000 and must divide 1 million. Violation of these constraints will trigger an
assertion-exit if a PRP-test is attempted. The only good reason to use a value < 10000 is
on slow devices where 10000 iterations need substantially more than, say 10 minutes, as
checkpoints more frequent than a minute or so will cost 1-2% overall throughput.
o RUNNING ON LOW-MEMORY SYSTEMS: The LowMem option provides two supported low-memory run
modes for low-RAM systems:
LowMem = 1 allows PRP-testing but excludes p-1 stage 2. For a given exponent, this
mode will use 2-4x as much working memory for PRP-tests as it does for LL-tests.
LowMem = 2 excludes both PRP-testing and p-1 stage 2. This is the minimum-memory option.
For QA-testing purposes, this allows self-tests to some modest number of iterations to
be done up to the maximum supported FFT length of 512M on systems with 8GB of memory.
o DISABLING INTERIM GCDs IN P-1 STAGE 2: Set InterimGCD = 0 in mlucas.ini. This will cause
the program to wait until any p-1 stage 2 is finished to take a GCD (check for a factor),
irrespective of the depth of the stage.
======================
[12]: Savefile format and creation:
Mlucas creates savefiles (a.k.a. "checkpoint" files) at regular intervals - cf. section [11]
for how to override the default value of same - to permit safe restart with in case of a run
being interrupted for any reason, such as the host machine going down or a program crash. All
savefile data are stored in bytewise minimum-size and endian-independent form, according to the
schema implemented in the [read|write]_ppm1_savefiles function pair in the Mlucas.c source file.
(Cf. the long comment "Set of functions to Read/Write full-length residue data in bytewise format"
preceding said set of functions.)
Such savefile writes are reflected in the run logfile (.stat file) latest-progress summary lines.
o PRP tests save both a test current-residue value and a Gerbicz error check residue, thus are
roughly twice the size of those for LL-tests and p-1 factoring savefiles.
o LL, PRP-test and p-1 stage 1 residues are stored in redundant savefile pairs. For work on
the Mersenne number M[exponent] = 2^[exponent] - 1, these files are named p[exponent] and
q[exponent]. At the conclusion of a p-1 factoring run, the primary p[exponent] savefile is
renamed p[exponent].s1. The reason for this is twofold:
[1] So an ensuing LL or PRP-test of the same exponent (assuming a factor was not found),
which uses the same-named savefile pair, does not overwrite the p-1 stage 1 data. "Why
might I want to save those?" you ask. Here's why:
[2] The stage result in the .s1 file permits optional later stage 2 continuation runs:
Say you ran p-1 on a low-memory system, such that only stage 1 was run. Later, you install
more RAM, which allows for more efficient stage 2 work. At that point, it may make sense
to rerun some exponents to deeper stage 2 bounds. See section [9b] for how to do this.
o P-1 runs also store a file p[exponent].s1_prod encoding the precomputed small-prime-powers
product used for the stage 1 left-to-right modular binary powering. This can save time on
restart for large stage 1 bounds, say B1 = 5 million or larger, since as of this writing I have
not made a special effort to optimize the computation of said product beyond use of 64-bit
integer hardware multiply where available. (Speeding things by replacing the O(n^2) "grammar
school" multiply with a subquadratic one is a possible future enhancement.) For a given
stage 1 bound B1, the s1_prod file will be roughly 0.2*B1 bytes in size.
======================
[13]: *** DON'Ts ***
o DON'T omit actually *reading* - not 'skimming' - the latest version of the README.html.
This is especially important for new users - Mlucas is not a one-or-two-click "do everything
for me" program. Experienced users will still want to peruse the online readme page,
especially for details about the latest releases.
o DON'T skip the post-build self-test step.
o DON'T run multiple Mlucas instances in a given run directory.
======================
[14]: General troubleshooting:
Please start by looking for posts about your issue in the
mersenneforum.org thread specific to the release you are using in the mersenneforum.org
Mlucas subforum at https://www.mersenneforum.org/forumdisplay.php?f=118.
If you don't find anything, make a post describing your problem.
[14a]: How to safely interrupt a running instance:
Note that halting (as opposed to merely suspending using 'kill -CONT' as described below) an Mlucas
instance should produce a clean "at end of the current iteration, write savefiles and exit" for LL-test,
PRP and p-1 stage 1 processing. For p-1 stage 2, the state machine involved is of sufficient complexity
that I have not (yet) implemented a clean-exit-with-savefile-write: an interrupt will cause you to lose
whatever work has been done since the last stage 2 savefile (p[expo].s2) write.
If you need to halt all Mlucas instances running on a system, use 'killall -STOP Mlucas' to merely suspend
processing and 'killall -CONT Mlucas' to resume; to instead terminate all instances, use 'killall Mlucas'.
To halt just one or a subset of multiple running instances, use 'pidof Mlucas' to find the process ID (pid).
If more than one ID comes up and you only want to interrupt or halt one or a subset thereof, you need
to first figure out which process IDs map to the desired subset. If you used a batch shell script to
start multiple instances, the resulting process IDs should end up being in ascending numeric order.
Once you have identified the desired process IDs:
[1] If you only wish to suspend processing and resume it later, use 'kill -STOP [pid]' to suspend
and 'kill -CONT [pid]' to resume, where [pid] refers to the process ID. Use "man kill" for detailed
info regarding the kill command and the flags it takes - note that it is not Mlucas but rather the
OS which listens for this particular pair of signal types.
[2] If you wish to kill the process(es), use 'Ctrl-c' for a foreground Mlucas process, which sends
a SIGINT signal. For a background process, use 'kill [pid]' -- the default signal for kill is SIGTERM,
which is another one of the signal types the program listens for. To kill all Mlucas instances running
on a given system, use 'killall Mlucas'.
If for any reason you end up with an Mlucas instance which is not responding to the above kinds of
interrupt signal, use the "kill it with fire" option, 'kill -9 [pid]'. If it's a run which you want
to resume at some later time, you can minimize lost runtime by waiting until the current iteration
interval completes, i.e. until the latest savefile update occurs.
For the current list of signal types which the program listens for, see the sig_handler() function near
the top of Mlucas.c . As of this writing (v20.1.1), the program listens for INT,TERM,HUP,ALRM,USR1 and USR2.
======================
Last updated: 28 Nov 2021