# Introduction

`xoroshiro128+`

(XOR/rotate/shift/rotate) is the
successor to `xorshift128+`

. Instead of perpetuating
Marsaglia's tradition of xorshift as a basic operation,
`xoroshiro128+`

uses a carefully handcrafted
shift/rotate-based linear transformation designed in collaboration with
David Blackman. The result is a significant improvement in speed (well
below a nanosecond per integer) and a significant improvement in
statistical quality, as detected by the long-range tests of PractRand.
`xoroshiro128+`

is our current suggestion for replacing
low-quality generators commonly found in programming languages. It is the
default generator in Erlang.

`xorshift*`

generators are fast, high-quality PRNGs (pseudorandom number
generators) obtained by scrambling the output of a Marsaglia `xorshift`

generator with a 64-bit invertible multiplier (as suggested by
Marsaglia in his paper). They are an excellent choice for all
non-cryptographic applications, as they are very fast, have long
periods and their output passes strong statistical test suites.

`xorshift+`

generators are a 64-bit version of Saito and Matsumoto's `XSadd`

generator. Instead of using a multiplication, we return the sum (in **Z**/2^{64}**Z**)
of two consecutive output of a `xorshift`

generator. The reverse of `XSadd`

fails systematically several BigCrush
tests, whereas our `xorshift128+`

generator generates a high-quality 64-bit value
very quickly (see below). Albeit superseded by `xoroshiro128+`

,
`xorshift128+`

is presently used in the JavaScript engines of
Chrome,
Firefox,
Safari,
Microsoft Edge.
If you find its period too short for
large-scale parallel simulations, use `xorshift1024*`

.

All generators, being based on linear recurrences, provide *jump
functions* that make it possible to simulate any number of calls to
the next-state function in constant time, once a suitable *jump
polynomial* has been computed. We provide ready-made jump functions for
a number of calls equal to the square root of the period, to make it easy
generating non-overlapping sequences for parallel computations.

**Warning**: the originally published jump function for `xorshift1024*`

contained a bug that made it work only when `p`

was zero. This has been corrected now: please propagate
the fix to your code if you're using it.

# A PRNG Shootout

This page contains a shootout of a few recent PRNGs (pseudorandom number generators) that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

## Speed

The speed reported in this page is the time required to emit 64 random bits. If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators (which will happen, ultimately). The tests were performed on an Intel® Core™ i7-7700 CPU @ 3.60GHz (Kaby Lake).

A few *caveats*:

- Timings are taken running a generator for billions of times in a loop; but this is not the way you use generators.
- There is some looping overhead, which is about 0.12 ns, but subtracting it from the timings is not going to be particularly meaningful due to instruction rescheduling, etc.
- Relative speed might be different on different CPUs and on different scenarios.
- Code has been compiled using
`gcc`

's`-fno-move-loop-invariants`

and`-fno-unroll-loops`

options. These options are*essential*to get a sensible result: without them, the compiler can move outside the testing loop constant loads (e.g., multiplicative constants) and may perform different loop unrolling depending on the generator. For this reason, we cannot provide timings with`clang`

: there are at the time of this writing no such options. If you find timings that are significantly better than those shown here on comparable hardware, they are likely to be unreliable and just due to compiler artifacts.

To ease replicability, I distribute a *harness* performing the measurement. You just
have to define a `next()`

function and include the harness. But the only realistic
suggestion is to try different generators in your application and see what happens.

## Quality

This is probably the more elusive property
of a PRNG. Here quality is measured using the powerful
BigCrush suite of tests. BigCrush is part of TestU01,
a monumental framework for testing PRNGs developed by Pierre L'Ecuyer
and Richard Simard (“TestU01: A C library for empirical testing
of random number generators”, *ACM Trans. Math. Softw.*
33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space
of the generator and collect *failures*—tests in which the
*p*-value statistics is outside the interval [0.001..0.999]. A failure
is *systematic* if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values
are passed to the test suite for each generated 64-bit value. Floating point numbers
are generated instead by dividing the unsigned output of the generator by 2^{64}.
Since this implies a bias towards the high bits (which is anyway a known characteristic
of TestU01), we run the test suite also on the *reverse* generator and
add up the resulting failures.

More detail about the whole process can be found in this paper.

PRNG | Period | Failures (std) | Failures (rev) | Overall | Systematic | ns/64 bits |
---|---|---|---|---|---|---|

`xoroshiro128+` | 2^{128} − 1 | 31 | 27 | 58 | — | 0.81 |

`xorshift128+` | 2^{128} − 1 | 38 | 32 | 70 | — | 1.02 |

`xorshift1024*` | 2^{1024} − 1 | 33 | 32 | 65 | — | 1.21 |

`Lehmer128` | 2^{126} | 39 | 38 | 77 | — | 1.31 |

PCG RXS M XS 64 (LCG) | 2^{64} | 48 | 29 | 75 | — | 1.43 |

SplitMix64 | 2^{64} | 35 | 45 | 80 | — | 1.35 |

`Ran` | ≈2^{191} | 32 | 42 | 74 | — | 1.91 |

`xorgens4096` | 2^{4096} − 1 | 42 | 40 | 82 | — | 1.91 |

`XSadd` | 2^{128} − 1 | 38 | 850 | 888 | LinearComp, MatrixRank, MaxOft, Permutation | 1.79 |

`MT19937-64` (Mersenne Twister) | 2^{19937} − 1 | 258 | 258 | 516 | LinearComp | 2.55 |

`WELL1024a` | 2^{1024} − 1 | 441 | 441 | 882 | MatrixRank, LinearComp | 8.96 |

I'll be happy to add to the list any PRNG that can improve significantly on the ones already listed. You can also install TestU01, download my code and start from there to do your own tests.

# Remarks

## A long period does not imply high quality

This is a common misconception. The generator `x++`

has
period \(2^k\), for any \(k\geq0\), provided that `x`

is
represented using \(k\) bits: nonetheless, it is a horrible generator.
The generator returning \(k-1\) zeroes followed by a one has period
\(k\).

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use \(t\) values, you need a generator with period at least \(t^2\). Moreover, if you run \(n\) independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that \(n\) sequences of \(t^2\) elements starting at random positions overlap is very low.

Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space overlap is \[ 1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P, \] assuming that \(P\) is large and \(nL/P\) is close to zero. If your generator has period \(2^{256}\) and you run on \(2^{64}\) cores (you will never have them) a computation using \(2^{64}\) pseudorandom numbers (you will never have the time) the probability of overlap would be less than \(2^{-64}\).

In other words: any generator with a period beyond, say, \(2^{1024}\) (just to stay again on the safe side) has a period is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a larger state), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state with independent, uniformly distributed random bits).

In case the generator provides a *jump function* that lets you skip through chunks of the output in constant
time, even a period of \(2^{128}\) can be sufficient, as it provides \(2^{64}\) non-overlapping sequences of length \(2^{64}\).

## Equidistribution

A `xorshift*`

generator with an `n`-bit state is `n`/64-dimensionally equidistributed: every
`n`/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the
best possible for 64-bit values).
A `xorshift+`

/`xoroshiro+`

generator is however only (`n`/64 − 1)-dimensionally equidistributed: every
(`n`/64 − 1)-tuple of consecutive 64-bit values appears exactly 2^{64} times in the output, except for
a missing zero tuple.

## Generating uniform doubles in the unit interval

A standard double (64-bit) floating-point number in
IEEE floating point format has 52 bits of
mantissa, plus an implicit one at the left of the mantissa. Thus, even if there are 52 bits of mantissa,
the representation can actually store numbers with *53* significant binary digits.

Because of this fact, in C99 a 64-bit unsigned integer `x`

should be converted to a 64-bit double
using the expression

#include <stdint.h> (x >> 11) * (1. / (UINT64_C(1) << 53))

This conversion guarantees that all dyadic rationals of the form `k` / 2^{−53}
will be equally likely. Note that this conversion prefers the high bits of `x`

, but you can alternatively
use the lowest bits.

An alternative, faster multiplication-free operation is

#include <stdint.h> static inline double to_double(uint64_t x) { const union { uint64_t i; double d; } u = { .i = UINT64_C(0x3FF) << 52 | x >> 12 }; return u.d - 1.0; }

The code above cooks up by bit manipulation
a real number in the interval [1..2), and then subtracts
one to obtain a real number in the interval [0..1). If `x`

is chosen uniformly among 64-bit integers,
`d`

is chosen uniformly among dyadic rationals of the form `k` / 2^{−52}. This
is the same technique used by generators providing directly doubles, such as the
dSFMT.

This technique is extremely fast, but *you will be generating half the values you could actually generate*.
The same problem plagues the dSFMT. All doubles generated will have the lowest mantissa bit set to zero (I must
thank Raimo Niskanen from the Erlang team for making me notice this—a previous version of this site
did not mention this issue).

In Java you can obtain an analogous result using suitable static methods:

Double.longBitsToDouble(0x3FFL << 52 | x >>> 12) - 1.0

On current fast CPUs, in C the bit-based conversion is 10% faster than the divison-based method. In java, instead, it is 50% faster. My current Java implementations use the bit-based conversion, but in C the loss of a bit of precision does not seem to be worth the effort.

Interestingly, these are not the only notions of “uniformity” you can come up with. Another possibility
is that of generating 1074-bit integers, normalize and return the nearest value representable as a
64-bit double (this is the theory—in practice, you will almost never
use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all
representable doubles could be in principle generated, albeit not every
returned double will appear with the same probability. A reference
implementation can be found here. Note that unless your generator has
at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect
(e.g., it might *never* return zero).

## Why just sizes 128 and 1024?

First of all, using powers of two in high dimension is advantageous because the modulo operation used to update the cyclic counter that simulates the block shift of the whole state becomes a logical and.

For 64 bits, we suggest to use a SplitMix64 generator, but 64 bits of state are not enough for any serious purpose. Nonetheless, SplitMix64 is very useful to initialize the state of other generators starting from a 64-bit seed, as research has shown that initialization must be performed with a generator radically different in nature from the one initialized to avoid correlation on similar seeds.

**128** works very well because you can just swap the two halves of the state
instead of keeping a circular counter. This makes for the incredible speed of a `xoroshiro128+`

generator,
which is a very reasonable choice for a general-purpose generator.
Note that is very easy to embed a `xoroshiro128+`

generator in Java without creating an additional Java object,
as the state can be represented by two `long`

variables.

**1024** seems the most reasonable choice for a general-purpose generator for massively
parallel applications: it is large enough, but not uselessly large.

## What about the generator used in C, say by `gcc`

?

Applying the same tests from the C API is impossible: the generator
is not standardized, `rand()`

returns a weird number of bits (usually 31) and there's
not way to manipulate directly the state of the generator.
Traditionally, C standard libraries used a linear congruential
generator similar to `java.util.Random`

. Apparently, recent
versions of `glibc`

use the SIMD version of `MT19937`

.