How to generate a vector of random numbers on a GPU

A modern hardware for an old problem

Jul 13, 2021

5 min read

Random numbers are an essential part of the life of a data scientist. You must generate them, for instance, if you need matrices that initialise neural-networks, to perform data analysis using null-models or even when running Monte Carlo simulations.

Imagine you have to generate a lot of independent sequences of random numbers, lets’ say, for instance, a set of Monte Carlo Markov Chains(MCMC). In this kind of problems, given a starting point and a rule to generate the state n+1 given the state n it is possible to sample many state of a given distribution. It is also possible to use MCMC to perform Natural Language Processing [1].

The ideal hardware to perform the same task on multiple independent objects is a Graphical Processing Unit. GPUs can, indeed, be programmed to run multiple independent threads which performs the same operation concurrently.

Here I will describe a brief example on how to generate sequences of random numbers using the Apple Metal Framework to run operations on fast GPU hardware.

A vector of random numbers

The idea here is to generate a vector of independent random numbers uniformly distributed. Each number could represent a state at the end of a Markov Chain for instance.

The code is composed in three part:

a vector of initial states (seeds) is prepared
the vector is passed to the GPU and each thread run a chain and output a single number (state)
the states are copied from GPU to CPU and then to a file on the disk to be analysed later on.

Running on GPU

I choose to take advantage of Apple Metal Framework to perform the operations described above and to perform operations on the GPU. The whole code can be easily translated to other languages such as OpenCL or CUDA.

Metal – Apple Developer

To execute the pipeline against our data on Metal, we need to create a command buffer, write commands into it, and commit the buffer to a command queue. Metal will send the commands to the GPU to be executed.

In this example we will use the ComputeCommandBuffer, since our task is a computation task and not a graphical task.

Our program, written in C++, wrapping [2] the metal code will run efficiently on the GPU built-in in the Mac.

GPU usage statistic during computation. Image by Author.

Algorithms

The generation of (pseudo) random numbers starts with an integer value called seed. Then, there are different approaches to generate subsequent random numbers; Mersenne Twister and xorshift are two examples.

Mersenne Twister algorithm

The Mersenne Twister is a pseudorandom number generator and it is by far the most widely used general-purpose. It was developed in 1997 by Makoto Matsumoto and Takuji Nishimura and it was designed specifically to rectify most of the flaws found in older generators.

The most commonly used version of the Mersenne Twister algorithm is based on the Mersenne prime 2¹⁹⁹³⁷-1. The standard implementation of that, MT19937, uses a 32-bit word length.

In this example I will use a Metal implementation of Mersenne Twister.

Marsaglia’s xorshift

Another easy to implement algorithm to generate random numbers is xorshift [3]. It is a very fast type of generator. Its creator also suggested as an improvement the xorwow generator, in which the output of a xorshift generator is added with a Weyl sequence. The xorwow generator is the default generator in the CURAND library of the nVidia CUDA application programming interface for graphics processing units.

A GPU implementation of xorshift is the following.

uint rand_xorshift(uint rng_state){    
    rng_state ^= (rng_state << 13);
    rng_state ^= (rng_state >> 17);    
    rng_state ^= (rng_state << 5);    
    return rng_state;
}

This can be easily portable to other GPU programming languages such as OpenCL.

Examples and runs

In the example I will describe the threads should generate random numbers drawn from an uniform distribution with bounds 0 and 1000.

The Metal code shown below shows an example of the process of generating 100 random numbers on each GPU thread and store the last one to an array (vOut) which will be later accessed by the CPU code.

Each thread has its one seed passed by the main CPU thread through the vIn array. Then each and every thread create its own generator

#include <metal_stdlib>

#include "mersenne.metal"
using namespace metal;
kernel void storyMersenne(
const device uint *vIn [[ buffer(0) ]],
device int *vOut [[ buffer(1) ]],
uint id[[ thread_position_in_grid ]]){

mt19937 generator;

generator(vIn[id]);

uint r;
for (uint l = 0; l < 100; l++) r = generator.rand();

vOut[id] = r;

When each thread of the GPU has generated its random numbers, it is possible to look at their distribution and verify whether they are uniformly distributed as expected.

The figure shows that yes, they are well uniformly distributed.

What if there are multiple runs

Another criteria one can use to evaluate a random number generator tells that there should be a high probability that generated sequences of random numbers are different from each other.

I tested this fact generating multiple sequences of random numbers and studying the correlations between them. The figure shows that points generated in different runs are effectively not correlated.

Correlation of different runs. Image by Author.

Conclusions

There are many applications in which having a good source of random numbers is necessary. Generating them fast would be an advantage.

When you have to do computations or Markov Chain Monte Carlo involving sequences of random numbers, as far as they are independent and knowing that GPU hardware is optimised to perform many operations in parallel, it could be a good option to generate them using algorithms implemented in Metal or OpenCL.

A simple example shows that it is possible to generate uniformly distributed and uncorrelated random numbers easily taking advantage of a MacBook GPU.

References

[1] Gerlach, M., Pexioto T.P., Altmann E.G. A network approach to topic models. (2018)

[2] Jason R. Blevins C++ Mersenne Twister wrapper class (2006).

[3] Marsaglia, George. Random number generators. (2003) Journal of Modern Applied Statistical Methods.

Written By

Filippo Valle

See all from Filippo Valle

Topics:

Cpp, Gpu, Mcmc, Metal, Random Generators

Share this article:

How to generate a vector of random numbers on a GPU

A vector of random numbers

Running on GPU

Algorithms

Mersenne Twister algorithm

Marsaglia’s xorshift

Examples and runs

What if there are multiple runs

Conclusions

References

Related Articles

Doing Explanatory Data Analysis

Creating Word Embeddings with Jax

Deep Learning: Faster, Better, and Free, in 3 Easy Steps

Training Speed of TensorFlow in macOS Monterey

Intro to BlazingSQL in 5 minutes

Speed Cubing for Machine Learning

How to generate a vector of random numbers on a GPU

A vector of random numbers

Running on GPU

Algorithms

Mersenne Twister algorithm

Marsaglia’s xorshift

Examples and runs

What if there are multiple runs

Conclusions

References

Related Articles

Doing Explanatory Data Analysis

Creating Word Embeddings with Jax

Deep Learning: Faster, Better, and Free, in 3 Easy Steps

No GPUs Available, What Now?

Training Speed of TensorFlow in macOS Monterey

Intro to BlazingSQL in 5 minutes

Speed Cubing for Machine Learning