Introduction

xoroshiro128+ (XOR/rotate/shift/rotate) is the successor to xorshift128+. Instead of perpetuating Marsaglia's tradition of xorshift as a basic operation, xoroshiro128+ uses a carefully handcrafted shift/rotate-based linear transformation designed in collaboration with David Blackman. The result is a significant improvement in speed (well below a nanosecond per integer) and a significant improvement in statistical quality, as detected by the long-range tests of PractRand. xoroshiro128+ is our current suggestion for replacing low-quality generators commonly found in programming languages.

xorshift* generators are fast, high-quality PRNGs (pseudorandom number generators) obtained by scrambling the output of a Marsaglia xorshift generator with a 64-bit invertible multiplier (as suggested by Marsaglia in his paper). They are an excellent choice for all non-cryptographic applications, as they are very fast, have long periods and their output passes strong statistical test suites.

xorshift+ generators are a 64-bit version of Saito and Matsumoto's XSadd generator. Instead of using a multiplication, we return the sum (in Z/264Z) of two consecutive output of a xorshift generator. The reverse of XSadd fails systematically several BigCrush tests, whereas our xorshift128+ generator generates a high-quality 64-bit value very quickly (see below). Albeit superseded by xoroshiro128+, xorshift128+ is presently used in the JavaScript engines of Chrome, Firefox and Safari, and it is the default generator in Erlang. If you find its period too short for large-scale parallel simulations, use xorshift1024*.

All generators, being based on linear recurrences, provide jump functions that make it possible to simulate any number of calls to the next-state function in constant time, once a suitable jump polynomial has been computed. We provide ready-made jump functions for a number of calls equal to the square root of the period, to make it easy generating non-overlapping sequences for parallel computations.

Warning: the originally published jump function for xorshift1024* contained a bug that made it work only when p was zero. This has been corrected now: please propagate the fix to your code if you're using it.

A PRNG Shootout

This page contains a shootout of a few recent PRNGs (pseudorandom number generators) that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

Speed

The speed reported in this page is the time required to emit 64 random bits. If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators (which will happen, ultimately). The tests were performed on an Intel® Core™ i7-4770 CPU @3.40GHz (Haswell).

A few caveats:

To ease replicability, I distribute a harness performing the measurement. You just have to define a next() function and include the harness. But the only realistic suggestion is to try different generators in your application and see what happens.

Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space of the generator and collect failures—tests in which the p-value statistics is outside the interval [0.001..0.999]. A failure is systematic if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values are passed to the test suite for each generated 64-bit value. Floating point numbers are generated instead by dividing the unsigned output of the generator by 264. Since this implies a bias towards the high bits (which is anyway a known characteristic of TestU01), we run the test suite also on the reverse generator and add up the resulting failures.

More detail about the whole process can be found in this paper.

PRNG Period Failures (std) Failures (rev) Overall Systematic ns/64 bits
xoroshiro128+ 2128 − 131 27 580.87
xorshift128+ 2128 − 138 32 701.06
xorshift1024* 21024 − 133 32 651.34
Lehmer128 212639 38 771.41
PCG RXS M XS 64 (LCG) 26448 29 751.43
SplitMix64 26435 45 801.93
Ran ≈219132 42 742.06
xorgens4096 24096 − 142 40 822.06
XSadd 2128 − 1 38 850 888LinearComp, MatrixRank, MaxOft, Permutation2.06
MT19937-64 (Mersenne Twister) 219937 − 1 258 258 516LinearComp2.66
WELL1024a 21024 − 1 441 441 882MatrixRank, LinearComp10.56

I'll be happy to add to the list any PRNG that can improve significantly on the ones already listed. You can also install TestU01, download my code and start from there to do your own tests.

Remarks

A long period does not imply high quality

This is a common misconception. The generator x++ has period \(2^k\), for any \(k\geq0\), provided that x is represented using \(k\) bits: nonetheless, it is a horrible generator. The generator returning \(k-1\) zeroes followed by a one has period \(k\).

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use \(t\) values, you need a generator with period at least \(t^2\). Moreover, if you run \(n\) independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that \(n\) sequences of \(t^2\) elements starting at random positions overlap is very low.

Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space overlap is \[ 1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P, \] assuming that \(P\) is large and \(nL/P\) is close to zero. If your generator has period \(2^{256}\) and you run on \(2^{64}\) cores (you will never have them) a computation using \(2^{64}\) pseudorandom numbers (you will never have the time) the probability of overlap would be less than \(2^{-64}\).

In other words: any generator with a period beyond, say, \(2^{1024}\) (just to stay again on the safe side) has a period is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a larger state), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state with independent, uniformly distributed random bits).

In case the generator provides a jump function that lets you skip through chunks of the output in constant time, even a period of \(2^{128}\) can be sufficient, as it provides \(2^{64}\) non-overlapping sequences of length \(2^{64}\).

Equidistribution

A xorshift* generator with an n-bit state is n/64-dimensionally equidistributed: every n/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the best possible for 64-bit values). A xorshift+/xoroshiro+ generator is however only (n/64 − 1)-dimensionally equidistributed: every (n/64 − 1)-tuple of consecutive 64-bit values appears exactly 264 times in the output, except for a missing zero tuple.

Generating uniform doubles in the unit interval

With the exception of generators designed to provide directly double-precision floating-point numbers, the fastest way to convert in C99 a 64-bit unsigned integer x to a 64-bit double is

    #include <stdint.h>

    static inline double to_double(uint64_t x) {
       const union { uint64_t i; double d; } u = { .i = UINT64_C(0x3FF) << 52 | x >> 12 };
       return u.d - 1.0;
    }

The code above cooks up by bit manipulation a real number in the interval [1..2), and then subtracts one to obtain a real number in the interval [0..1). If x is chosen uniformly among 64-bit integers, d is chosen uniformly among dyadic rationals of the form k / 2−52.

In Java you can obtain an analogous result using suitable static methods:

    Double.longBitsToDouble(0x3FFL << 52 | x >>> 12) - 1.0

Interestingly, this is not the only notion of “uniformity” you can come up with. Another possibility is that of generating 1074-bit integers, normalize and return the nearest value representable as a 64-bit double (this is the theory—in practice, you will almost never use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all representable doubles could be in principle generated, albeit not every returned double will appear with the same probability. A reference implementation can be found here. Note that unless your generator has at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect (e.g., it might never return zero).

Why just sizes 128 and 1024?

First of all, using powers of two in high dimension is advantageous because the modulo operation used to update the cyclic counter that simulates the block shift of the whole state becomes a logical and.

For 64 bits, we suggest to use a SplitMix64 generator, but 64 bits of state are not enough for any serious purpose. Nonetheless, SplitMix64 is very useful to initialize the state of other generators starting from a 64-bit seed, as research has shown that initialization must be performed with a generator radically different in nature from the one initialized to avoid correlation on similar seeds.

128 works very well because you can just swap the two halves of the state instead of keeping a circular counter. This makes for the incredible speed of a xoroshiro128+ generator, which is a very reasonable choice for a general-purpose generator. Note that is very easy to embed a xoroshiro128+ generator in Java without creating an additional Java object, as the state can be represented by two long variables.

1024 seems the most reasonable choice for a general-purpose generator for massively parallel applications: it is large enough, but not uselessly large.

What about the generator used in C, say by gcc?

Applying the same tests from the C API is impossible: the generator is not standardized, rand() returns a weird number of bits (usually 31) and there's not way to manipulate directly the state of the generator. Traditionally, C standard libraries used a linear congruential generator similar to java.util.Random. Apparently, recent versions of glibc use a trinomial-based linear-feedback shift register or even the SIMD version of MT19937.