# Introduction

`xoroshiro128+`

(XOR/rotate/shift/rotate) is the
successor to `xorshift128+`

. Instead of perpetuating
Marsaglia's tradition of xorshift as a basic operation,
`xoroshiro128+`

uses a carefully handcrafted
shift/rotate-based linear transformation designed in collaboration with
David Blackman. The result is a significant improvement in speed (well
below a nanosecond per integer) and a significant improvement in
statistical quality, as detected by the long-range tests of PractRand.
`xoroshiro128+`

is our current suggestion for replacing
low-quality generators commonly found in programming languages. It is the
default generator in Erlang.

`xorshift*`

generators are fast, high-quality PRNGs (pseudorandom number
generators) obtained by scrambling the output of a Marsaglia `xorshift`

generator with a 64-bit invertible multiplier (as suggested by
Marsaglia in his paper). They are an excellent choice for all
non-cryptographic applications, as they are very fast, have long
periods and their output passes BigCrush from TestU01.

`xorshift+`

generators are a 64-bit version of Saito and Matsumoto's `XSadd`

generator. Instead of using a multiplication, we return the sum (in **Z**/2^{64}**Z**)
of two consecutive output of a `xorshift`

generator. The reverse of `XSadd`

fails systematically several BigCrush
tests, whereas our `xorshift128+`

generator generates very quickly
(see below) a 64-bit value passing BigCrush from TestU01.
Albeit superseded by `xoroshiro128+`

,
`xorshift128+`

is presently used in the JavaScript engines of
Chrome,
Firefox,
Safari,
Microsoft Edge.
If you find its period too short for
large-scale parallel simulations, use `xorshift1024*φ`

.

All generators, being based on linear recurrences, provide *jump
functions* that make it possible to simulate any number of calls to
the next-state function in constant time, once a suitable *jump
polynomial* has been computed. We provide ready-made jump functions for
a number of calls equal to the square root of the period, to make it easy
generating non-overlapping sequences for parallel computations.

# A PRNG Shootout

This page contains a shootout of a few recent PRNGs (pseudorandom number generators) that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

## Speed

The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns. The tests were performed on an Intel® Core™ i7-7700 CPU @ 3.60GHz (Kaby Lake).

A few *caveats*:

- Timings are taken running a generator for billions of times in a loop; but this is not the way you use generators.
- There is some looping overhead, which is about 0.12 ns, but subtracting it from the timings is not going to be particularly meaningful due to instruction rescheduling, etc.
- Relative speed might be different on different CPUs and on different scenarios.
- Code has been compiled using
`gcc`

's`-fno-move-loop-invariants`

and`-fno-unroll-loops`

options. These options are*essential*to get a sensible result: without them, the compiler can move outside the testing loop constant loads (e.g., multiplicative constants) and may perform different loop unrolling depending on the generator. For this reason, we cannot provide timings with`clang`

: there are at the time of this writing no such options. If you find timings that are significantly better than those shown here on comparable hardware, they are likely to be unreliable and just due to compiler artifacts.

To ease replicability, I distribute a *harness* performing the measurement. You just
have to define a `next()`

function and include the harness. But the only realistic
suggestion is to try different generators in your application and see what happens.

## Quality

This is probably the more elusive property
of a PRNG. Here quality is measured using the powerful
BigCrush suite of tests. BigCrush is part of TestU01,
a monumental framework for testing PRNGs developed by Pierre L'Ecuyer
and Richard Simard (“TestU01: A C library for empirical testing
of random number generators”, *ACM Trans. Math. Softw.*
33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space
of the generator and collect *failures*—tests in which the
*p*-value statistics is outside the interval [0.001..0.999]. A failure
is *systematic* if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values
are passed to the test suite for each generated 64-bit value. Floating point numbers
are generated instead by dividing the unsigned output of the generator by 2^{64}.
Since this implies a bias towards the high bits (which is anyway a known characteristic
of TestU01), we run the test suite also on the *reverse* generator and
add up the resulting failures.

More detail about the whole process can be found in this paper.

PRNG | Period | Failures (std) | Failures (rev) | Overall | Systematic | ns/64 bits | cycles/B |
---|---|---|---|---|---|---|---|

`xoroshiro128+` | 2^{128} − 1 | 31 | 27 | 58 | — | 0.81 | 0.36 |

`xorshift128+` | 2^{128} − 1 | 38 | 32 | 70 | — | 1.02 | 0.46 |

`xorshift1024*φ` | 2^{1024} − 1 | 37 | 39 | 76 | — | 1.21 | 0.55 |

SplitMix64 | 2^{64} | 35 | 45 | 80 | — | 1.25 | 0.56 |

`Lehmer128` | 2^{126} | 39 | 38 | 77 | — | 1.31 | 0.60 |

PCG RXS M XS 64 (LCG) | 2^{64} | 48 | 29 | 75 | — | 1.43 | 0.64 |

`Ran` | ≈2^{191} | 32 | 42 | 74 | — | 1.91 | 0.86 |

`xorgens4096` | 2^{4096} − 1 | 42 | 40 | 82 | — | 1.91 | 0.86 |

`MT19937-64` (Mersenne Twister) | 2^{19937} − 1 | 258 | 258 | 516 | LinearComp | 2.55 | 1.15 |

`WELL1024a` | 2^{1024} − 1 | 441 | 441 | 882 | MatrixRank, LinearComp | 8.96 | 4.09 |

I'll be happy to add to the list any PRNG that can improve significantly on the ones already listed. You can also install TestU01, download my code and start from there to do your own tests.

# Remarks

## A long period does not imply high quality

This is a common misconception. The generator `x++`

has
period \(2^k\), for any \(k\geq0\), provided that `x`

is
represented using \(k\) bits: nonetheless, it is a horrible generator.
The generator returning \(k-1\) zeroes followed by a one has period
\(k\).

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use \(t\) values, you need a generator with period at least \(t^2\). Moreover, if you run \(n\) independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that \(n\) sequences of \(t^2\) elements starting at random positions overlap is very low.

Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space overlap is \[ 1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P, \] assuming that \(P\) is large and \(nL/P\) is close to zero. If your generator has period \(2^{256}\) and you run on \(2^{64}\) cores (you will never have them) a computation using \(2^{64}\) pseudorandom numbers (you will never have the time) the probability of overlap would be less than \(2^{-64}\).

In other words: any generator with a period beyond, say, \(2^{1024}\) (just to stay again on the safe side) has a period is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a larger state), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state with independent, uniformly distributed random bits).

In case the generator provides a *jump function* that lets you skip through chunks of the output in constant
time, even a period of \(2^{128}\) can be sufficient, as it provides \(2^{64}\) non-overlapping sequences of length \(2^{64}\).

## Equidistribution

A `xorshift*`

generator with an `n`-bit state is `n`/64-dimensionally equidistributed: every
`n`/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the
best possible for 64-bit values).
A `xorshift+`

/`xoroshiro+`

generator is however only (`n`/64 − 1)-dimensionally equidistributed: every
(`n`/64 − 1)-tuple of consecutive 64-bit values appears exactly 2^{64} times in the output, except for
a missing zero tuple.

## Generating uniform doubles in the unit interval

A standard double (64-bit) floating-point number in
IEEE floating point format has 52 bits of
mantissa, plus an implicit one at the left of the mantissa. Thus, even if there are 52 bits of mantissa,
the representation can actually store numbers with *53* significant binary digits.

Because of this fact, in C99 a 64-bit unsigned integer `x`

should be converted to a 64-bit double
using the expression

#include <stdint.h> (x >> 11) * (1. / (UINT64_C(1) << 53))

In Java, the same result can be obtained with

(x >>> 11) * 0x1.0p-53

This conversion guarantees that all dyadic rationals of the form `k` / 2^{−53}
will be equally likely. Note that this conversion prefers the high bits of `x`

, but you can alternatively
use the lowest bits.

An alternative, faster multiplication-free operation is

#include <stdint.h> static inline double to_double(uint64_t x) { const union { uint64_t i; double d; } u = { .i = UINT64_C(0x3FF) << 52 | x >> 12 }; return u.d - 1.0; }

The code above cooks up by bit manipulation
a real number in the interval [1..2), and then subtracts
one to obtain a real number in the interval [0..1). If `x`

is chosen uniformly among 64-bit integers,
`d`

is chosen uniformly among dyadic rationals of the form `k` / 2^{−52}. This
is the same technique used by generators providing directly doubles, such as the
dSFMT.

This technique is extremely fast, but *you will be generating half the values you could actually generate*.
The same problem plagues the dSFMT. All doubles generated will have the lowest mantissa bit set to zero (I must
thank Raimo Niskanen from the Erlang team for making me notice this—a previous version of this site
did not mention this issue).

In Java you can obtain an analogous result using suitable static methods:

Double.longBitsToDouble(0x3FFL << 52 | x >>> 12) - 1.0

To adhere to the principle of least surprise, my implementations now use the multiplicative version, everywhere.

Interestingly, these are not the only notions of “uniformity” you can come up with. Another possibility
is that of generating 1074-bit integers, normalize and return the nearest value representable as a
64-bit double (this is the theory—in practice, you will almost never
use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all
representable doubles could be in principle generated, albeit not every
returned double will appear with the same probability. A reference
implementation can be found here. Note that unless your generator has
at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect
(e.g., it might *never* return zero).

## Why just sizes 128 and 1024?

First of all, using powers of two in high dimension is advantageous because the modulo operation used to update the cyclic counter that simulates the block shift of the whole state becomes a logical and.

For 64 bits, we suggest to use a SplitMix64 generator, but 64 bits of state are not enough for any serious purpose. Nonetheless, SplitMix64 is very useful to initialize the state of other generators starting from a 64-bit seed, as research has shown that initialization must be performed with a generator radically different in nature from the one initialized to avoid correlation on similar seeds.

**128** works very well because you can just address directly the two 64-bit halves of the state
instead of keeping a circular counter. This makes for the incredible speed of a `xoroshiro128+`

generator,
which is a very reasonable choice for a general-purpose generator.
Note that is very easy to embed a `xoroshiro128+`

generator in Java without creating an additional Java object,
as the state can be represented by two `long`

variables.

**1024** seems the most reasonable choice for a general-purpose generator for massively
parallel applications: it is large enough, but not uselessly large.

## What about the generator used in C, say by `gcc`

?

Applying the same tests from the C API is impossible: the generator
is not standardized, `rand()`

returns a weird number of bits (usually 31) and there's
not way to manipulate directly the state of the generator.
Traditionally, C standard libraries used a linear congruential
generator similar to `java.util.Random`

. Apparently, recent
versions of `glibc`

use the SIMD version of `MT19937`

.

# FUD

Since the introduction of the generators described in these pages
and the PRNG shootout there has been an enormous amount of FUD (fear,
uncertainty and doubt) spread in the Internet through blogs, forums,
Reddit, and so on. I guess this is the result of having `xorshift128+`

in all browsers
and Javascript engines (something similar happened when we measured
Facebook's degrees or separation). In this page I discuss some technical
issues about false statements, misleading statements and weasel words.

## Speed tests

Some speed tests appeared, showing radically different timings from the ones in my page. After checking, the explanation was usually:

- Different compiler options, for example loop unrolling enabled; since unrolling happens differently for different generators, it introduces noise.
- Too few iterations; some of the generators have sub-nanoseconds timings—it is simply nonsensical to measure their speed with less than a billion iterations, independently of whether you are computing wall time or clock cycles.
- No inlining; again, since some of the generators have sub-nanoseconds timings it is essential that they are measured inlined, or the function call will take a time comparable to that of the generator, flattening out the results.

And, once again: the only real benchmark is putting the generator in *your*
application and see what happens.

## Predictability and “cracking” a non-secure PRNG

Some people is using the work “cracking” for the act of finding the
internal state of a non-secure PRNG given its output, claiming to have “cracked”
`xoroshiro128+`

and other generators.

This is a classic case of a weasel word: you cannot crack a non-secure PRNG more than you can crack a fig. Nobody in the literature ever used that word with such a meaning. You can crack a secure PRNG, though (and at that point it won't be secure anymore). That is sensible: it is like cracking a nut.

Indeed, all non-secure generators currently in use, like the Mersenne Twister, are trivially predictable because they have a clear structure leading to good properties. But anyone claiming to have “cracked the Mersenne Twister” would be immediately considered a crank.

Remember that there are secure PRNGs (like Fortuna) and non-secure PRNGs. The idea that there are “a little bit more secure” or a “a little bit more unpredictable” PRNGs is akin to the idea that you can be a little bit more or less pregnant: no, it does not work like that.

## Alleged failures in BigCrush and other statistical test suites

All generators described in these pages pass BigCrush from TestU01 without systematic failures. Since they are 64-bit generators, and BigCrush is a 32-bit suite, the output of the generator is passed in its entirety to the test suite by splitting into two 32-bit pieces. The two pieces are passed to the test as if they were generated consecutively. It is important to pass all the bits generated to the test suite because some dependency might be visible only if all bits are considered. This protocol is clearly explained in the papers, and the test code is available. Several people have replicated the results without problems.

Nonetheless, as explained in the papers and in the documentation, in
isolation the two lowest bit of all generators will fail binary-rank
tests, and possible linear compression tests, as the mechanism to
eliminate such failures from all other bits does not work for the two
lowest bits. For example, a generator based only on the lowest bit of
`xoroshiro128+`

(using 64 calls to the base generator to
obtain a 64-bit value) will fail binary-rank tests (MatrixRank in
TestU01), as the lowest bit is an LFSR (linear-feeedback shift
register). Note that it won't fail any other test—in fact, LFSRs
are excellent PRNGs.

The binary-rank
test was invented by Marsaglia and Tsay in 1985 to find a statistical
bias in LFSRs. The idea is to fill
a matrix with the bits output by the LFSR, interpreting them as
values in the field **Z**/2**Z**. If the source is random, the distribution
of the rank of the matrix can be computed exactly. The binary-rank
test generates a number of matrices and computes a *p*-value
accordingly. (The linear compression test finds linear dependencies
in a different way, which you can read in the TestU01 documentation.)

Now, every generator purely based on linear transformations on
**Z**/2**Z** will fail such a test on a large enough matrix, because
*every bit* of the generator is an LFSR.
Such generators include classic generators, like the Mersenne Twister,
which `gcc`

and Python use as basic generator, the WELL
family, `xorshift`

generators, and so on. Indeed, these
generators fail binary-rank or linear-compression tests (see above),
which were designed to spot LFSRs. In all papers
describing these generators, the authors (including Marsaglia itself)
consider the failure of the binary-rank test irrelevant from a
practical viewpoint, as the output of the generator is used as a number,
not as a vector on **Z**/2**Z**. While other
statistical tests in test suites are modeled from some behavior
you expect from a random source of numbers (e.g., if you throw a bunch of random numbers
in the unit intervals they should be spaced in a specific way),
the binary-rank test has been designed “backwards” by
looking at LSFRs and trying to find a statistical bias.

Linear generators have the advantage of making it easy to create large-period generators that fully use the state space and output all possible 64-bit values, have strong equidistribution properties, jump functions, and all bits with full period. Getting these properties is quite difficult using any other method.

The linear dependencies causing the failure of the binary-rank test
can be reduced by applying a nonlinear operation
on the output: for example, multiplying by a constant, or adding as
64-bit integers (thus, in the ring **Z**/2^{64}**Z**) two
pieces of the state space. This is what happens in the generators described
here. With more operations it could be possible to completely eliminate
the linear artifacts, but I consider this a waste
of time: as we've seen, some of the most used generators around have linear dependencies
*in every bit*, and this is not considered a problem. Note also
that when generating floating point values you never use the two
lowest bits.

In fact, the remaining 62 bits of `xoroshiro128+`

in isolation pass the
binary-rank test (and all other tests). This is why the
whole generator passes BigCrush: the influence of two bits over
64 is not sufficient to trigger a statistical deviation in the
rank of the matrices on **Z**/2**Z** of the size chosen by TestU01.
It will be sufficient, however, if you throw away enough of
the other bits. But you're not testing the generator—you're testing
the lowest bits of the generator; and we already know that the lowest
two bits have linear dependencies.