# Introduction

xoroshiro128+ (XOR/rotate/shift/rotate) is the successor to xorshift128+. Instead of perpetuating Marsaglia's tradition of xorshift as a basic operation, xoroshiro128+ uses a carefully handcrafted shift/rotate-based linear transformation designed in collaboration with David Blackman. The result is a significant improvement in speed (well below a nanosecond per integer) and a significant improvement in statistical quality, as detected by the long-range tests of PractRand. xoroshiro128+ is our current suggestion for replacing low-quality generators commonly found in programming languages. It is the default generator in Erlang.

xorshift* generators are fast, high-quality PRNGs (pseudorandom number generators) obtained by scrambling the output of a Marsaglia xorshift generator with a 64-bit invertible multiplier (as suggested by Marsaglia in his paper). They are an excellent choice for all non-cryptographic applications, as they are very fast, have long periods and their output passes BigCrush from TestU01.

xorshift+ generators are a 64-bit version of Saito and Matsumoto's XSadd generator. Instead of using a multiplication, we return the sum (in Z/264Z) of two consecutive output of a xorshift generator. The reverse of XSadd fails systematically several BigCrush tests, whereas our xorshift128+ generator generates very quickly (see below) a 64-bit value passing BigCrush from TestU01. Albeit superseded by xoroshiro128+, xorshift128+ is presently used in the JavaScript engines of Chrome, Firefox, Safari, Microsoft Edge. If you find its period too short for large-scale parallel simulations, use xorshift1024*φ.

All generators, being based on linear recurrences, provide jump functions that make it possible to simulate any number of calls to the next-state function in constant time, once a suitable jump polynomial has been computed. We provide ready-made jump functions for a number of calls equal to the square root of the period, to make it easy generating non-overlapping sequences for parallel computations.

# ﻿A PRNG Shootout

This page contains a shootout of a few recent PRNGs (pseudorandom number generators) that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

## ﻿Speed

The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns. The tests were performed on an Intel® Core™ i7-7700 CPU @ 3.60GHz (Kaby Lake).

A few caveats:

• Timings are taken running a generator for billions of times in a loop; but this is not the way you use generators.
• There is some looping overhead, which is about 0.12 ns, but subtracting it from the timings is not going to be particularly meaningful due to instruction rescheduling, etc.
• Relative speed might be different on different CPUs and on different scenarios.
• Code has been compiled using gcc's -fno-move-loop-invariants and -fno-unroll-loops options. These options are essential to get a sensible result: without them, the compiler can move outside the testing loop constant loads (e.g., multiplicative constants) and may perform different loop unrolling depending on the generator. For this reason, we cannot provide timings with clang: there are at the time of this writing no such options. If you find timings that are significantly better than those shown here on comparable hardware, they are likely to be unreliable and just due to compiler artifacts.

To ease replicability, I distribute a harness performing the measurement. You just have to define a next() function and include the harness. But the only realistic suggestion is to try different generators in your application and see what happens.

## ﻿Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space of the generator and collect failures—tests in which the p-value statistics is outside the interval [0.001..0.999]. A failure is systematic if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values are passed to the test suite for each generated 64-bit value. Floating point numbers are generated instead by dividing the unsigned output of the generator by 264. Since this implies a bias towards the high bits (which is anyway a known characteristic of TestU01), we run the test suite also on the reverse generator and add up the resulting failures.

More detail about the whole process can be found in this paper.

PRNG Period Failures (std) Failures (rev) Overall Systematic ns/64 bits cycles/B
xoroshiro128+ 2128 − 131 27 580.810.36
xorshift128+ 2128 − 138 32 701.020.46
xorshift1024*φ 21024 − 137 39 761.210.55
SplitMix64 26435 45 801.250.56
Lehmer128 212639 38 771.310.60
PCG RXS M XS 64 (LCG) 26448 29 751.430.64
Ran ≈219132 42 741.910.86
xorgens4096 24096 − 142 40 821.910.86
MT19937-64 (Mersenne Twister) 219937 − 1 258 258 516LinearComp2.551.15
WELL1024a 21024 − 1 441 441 882MatrixRank, LinearComp8.964.09

I'll be happy to add to the list any PRNG that can improve significantly on the ones already listed. You can also install TestU01, download my code and start from there to do your own tests.

# ﻿Remarks

## A long period does not imply high quality

This is a common misconception. The generator x++ has period $$2^k$$, for any $$k\geq0$$, provided that x is represented using $$k$$ bits: nonetheless, it is a horrible generator. The generator returning $$k-1$$ zeroes followed by a one has period $$k$$.

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use $$t$$ values, you need a generator with period at least $$t^2$$. Moreover, if you run $$n$$ independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that $$n$$ sequences of $$t^2$$ elements starting at random positions overlap is very low.

Now, given a generator with period $$P$$, the probability that $$n$$ subsequences of length $$L$$ starting at random points in the state space overlap is $1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P,$ assuming that $$P$$ is large and $$nL/P$$ is close to zero. If your generator has period $$2^{256}$$ and you run on $$2^{64}$$ cores (you will never have them) a computation using $$2^{64}$$ pseudorandom numbers (you will never have the time) the probability of overlap would be less than $$2^{-64}$$.

In other words: any generator with a period beyond, say, $$2^{1024}$$ (just to stay again on the safe side) has a period is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a larger state), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state with independent, uniformly distributed random bits).

In case the generator provides a jump function that lets you skip through chunks of the output in constant time, even a period of $$2^{128}$$ can be sufficient, as it provides $$2^{64}$$ non-overlapping sequences of length $$2^{64}$$.

## Equidistribution

A xorshift* generator with an n-bit state is n/64-dimensionally equidistributed: every n/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the best possible for 64-bit values). A xorshift+/xoroshiro+ generator is however only (n/64 − 1)-dimensionally equidistributed: every (n/64 − 1)-tuple of consecutive 64-bit values appears exactly 264 times in the output, except for a missing zero tuple.

## Generating uniform doubles in the unit interval

A standard double (64-bit) floating-point number in IEEE floating point format has 52 bits of mantissa, plus an implicit one at the left of the mantissa. Thus, even if there are 52 bits of mantissa, the representation can actually store numbers with 53 significant binary digits.

Because of this fact, in C99 a 64-bit unsigned integer x should be converted to a 64-bit double using the expression

    #include <stdint.h>

(x >> 11) * (1. / (UINT64_C(1) << 53))


In Java, the same result can be obtained with

    (x >>> 11) * 0x1.0p-53


This conversion guarantees that all dyadic rationals of the form k / 2−53 will be equally likely. Note that this conversion prefers the high bits of x, but you can alternatively use the lowest bits.

An alternative, faster multiplication-free operation is

    #include <stdint.h>

static inline double to_double(uint64_t x) {
const union { uint64_t i; double d; } u = { .i = UINT64_C(0x3FF) << 52 | x >> 12 };
return u.d - 1.0;
}


The code above cooks up by bit manipulation a real number in the interval [1..2), and then subtracts one to obtain a real number in the interval [0..1). If x is chosen uniformly among 64-bit integers, d is chosen uniformly among dyadic rationals of the form k / 2−52. This is the same technique used by generators providing directly doubles, such as the dSFMT.

This technique is extremely fast, but you will be generating half the values you could actually generate. The same problem plagues the dSFMT. All doubles generated will have the lowest mantissa bit set to zero (I must thank Raimo Niskanen from the Erlang team for making me notice this—a previous version of this site did not mention this issue).

In Java you can obtain an analogous result using suitable static methods:

    Double.longBitsToDouble(0x3FFL << 52 | x >>> 12) - 1.0


To adhere to the principle of least surprise, my implementations now use the multiplicative version, everywhere.

Interestingly, these are not the only notions of “uniformity” you can come up with. Another possibility is that of generating 1074-bit integers, normalize and return the nearest value representable as a 64-bit double (this is the theory—in practice, you will almost never use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all representable doubles could be in principle generated, albeit not every returned double will appear with the same probability. A reference implementation can be found here. Note that unless your generator has at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect (e.g., it might never return zero).

## Why just sizes 128 and 1024?

First of all, using powers of two in high dimension is advantageous because the modulo operation used to update the cyclic counter that simulates the block shift of the whole state becomes a logical and.

For 64 bits, we suggest to use a SplitMix64 generator, but 64 bits of state are not enough for any serious purpose. Nonetheless, SplitMix64 is very useful to initialize the state of other generators starting from a 64-bit seed, as research has shown that initialization must be performed with a generator radically different in nature from the one initialized to avoid correlation on similar seeds.

128 works very well because you can just address directly the two 64-bit halves of the state instead of keeping a circular counter. This makes for the incredible speed of a xoroshiro128+ generator, which is a very reasonable choice for a general-purpose generator. Note that is very easy to embed a xoroshiro128+ generator in Java without creating an additional Java object, as the state can be represented by two long variables.

1024 seems the most reasonable choice for a general-purpose generator for massively parallel applications: it is large enough, but not uselessly large.

## What about the generator used in C, say by gcc?

Applying the same tests from the C API is impossible: the generator is not standardized, rand() returns a weird number of bits (usually 31) and there's not way to manipulate directly the state of the generator. Traditionally, C standard libraries used a linear congruential generator similar to java.util.Random. Apparently, recent versions of glibc use the SIMD version of MT19937.

# ﻿FUD

Since the introduction of the generators described in these pages and the PRNG shootout there has been an enormous amount of FUD (fear, uncertainty and doubt) spread in the Internet through blogs, forums, Reddit, and so on. I guess this is the result of having xorshift128+ in all browsers and Javascript engines (something similar happened when we measured Facebook's degrees or separation). In this page I discuss some technical issues about false statements, misleading statements and weasel words.

## Speed tests

Some speed tests appeared, showing radically different timings from the ones in my page. After checking, the explanation was usually:

• Different compiler options, for example loop unrolling enabled; since unrolling happens differently for different generators, it introduces noise.
• Too few iterations; some of the generators have sub-nanoseconds timings—it is simply nonsensical to measure their speed with less than a billion iterations, independently of whether you are computing wall time or clock cycles.
• No inlining; again, since some of the generators have sub-nanoseconds timings it is essential that they are measured inlined, or the function call will take a time comparable to that of the generator, flattening out the results.

And, once again: the only real benchmark is putting the generator in your application and see what happens.

## Predictability and “cracking” a non-secure PRNG

Some people is using the work “cracking” for the act of finding the internal state of a non-secure PRNG given its output, claiming to have “cracked” xoroshiro128+ and other generators.

This is a classic case of a weasel word: you cannot crack a non-secure PRNG more than you can crack a fig. Nobody in the literature ever used that word with such a meaning. You can crack a secure PRNG, though (and at that point it won't be secure anymore). That is sensible: it is like cracking a nut.

Indeed, all non-secure generators currently in use, like the Mersenne Twister, are trivially predictable because they have a clear structure leading to good properties. But anyone claiming to have “cracked the Mersenne Twister” would be immediately considered a crank.

Remember that there are secure PRNGs (like Fortuna) and non-secure PRNGs. The idea that there are “a little bit more secure” or a “a little bit more unpredictable” PRNGs is akin to the idea that you can be a little bit more or less pregnant: no, it does not work like that.

## Alleged failures in BigCrush and other statistical test suites

All generators described in these pages pass BigCrush from TestU01 without systematic failures. Since they are 64-bit generators, and BigCrush is a 32-bit suite, the output of the generator is passed in its entirety to the test suite by splitting into two 32-bit pieces. The two pieces are passed to the test as if they were generated consecutively. It is important to pass all the bits generated to the test suite because some dependency might be visible only if all bits are considered. This protocol is clearly explained in the papers, and the test code is available. Several people have replicated the results without problems.

Nonetheless, as explained in the papers and in the documentation, in isolation the two lowest bit of all generators will fail linearity tests (such as binary rank and linear compression), as the mechanism to eliminate such failures from all other bits does not work for the two lowest bits. For example, a generator based only on the lowest bit of xoroshiro128+ (using 64 calls to the base generator to obtain a 64-bit value) will fail linearity tests (MatrixRank in TestU01), as the lowest bit is an LFSR (linear-feeedback shift register) of degree 128. Note that it will not fail any other test—in fact, LFSRs are excellent PRNGs. The next bit has degree 8256, and in the long run will fail linearity tests, too.

The binary-rank test was invented by Marsaglia and Tsay in 1985 to find a statistical bias in LFSRs. The idea is to fill a matrix with the bits output by the LFSR, interpreting them as values in the field Z/2Z. If the source is random, the distribution of the rank of the matrix can be computed exactly. The binary-rank test generates a number of matrices and computes a p-value accordingly. (The linear compression test finds linear dependencies in a different way, which you can read in the TestU01 documentation.)

Now, every generator purely based on linear transformations on Z/2Z will fail such a test on a large enough matrix, because every bit of the generator is an LFSR. Such generators include classic generators, like the Mersenne Twister, which gcc and Python use as basic generator, the WELL family, xorshift generators, and so on. Indeed, these generators fail binary-rank or linear-compression tests (see above), which were designed to spot LFSRs. In all papers describing these generators, the authors (including Marsaglia itself) consider the failure of the binary-rank test irrelevant from a practical viewpoint, as the output of the generator is used as a number, not as a vector on Z/2Z. While other statistical tests in test suites are modeled from some behavior you expect from a random source of numbers (e.g., if you throw a bunch of random numbers in the unit intervals they should be spaced in a specific way), the binary-rank test has been designed “backwards” by looking at LSFRs and trying to find a statistical bias.

Linear generators have the advantage of making it easy to create large-period generators that fully use the state space and output all possible 64-bit values, have strong equidistribution properties, jump functions, and all bits with full period. Getting these properties is quite difficult using any other method.

The linear dependencies causing the failure of the binary-rank test can be reduced by applying a nonlinear operation on the output: for example, multiplying by a constant, or adding as 64-bit integers (thus, in the ring Z/264Z) two pieces of the state space. This is what happens in the generators described here. With more operations it could be possible to completely eliminate the linear artifacts, but I consider this a waste of time: as we've seen, some of the most used generators around have linear dependencies in every bit, and this is not considered a problem. Note also that when generating floating point values you never use the two lowest bits.

In fact, the remaining 62 bits of xoroshiro128+ in isolation pass the binary-rank test (and all other tests). This is why the whole generator passes BigCrush: the influence of two bits over 64 is not sufficient to trigger a statistical deviation in the rank of the matrices on Z/2Z of the size chosen by TestU01. It will be sufficient, however, if you throw away enough of the other bits. But you're not testing the generator—you're testing the lowest bits of the generator; and we already know that the lowest two bits have linear dependencies.