Shuffle

A DNA sequence shuffling program written by Edward H. Trager • Copyright (c) 2005 by the Regents of the University of Michigan, All Rights Reserved • Released under the GNU GPL.

Introduction

Shuffle is a utility written in optimized C++ to randomly shuffle a DNA sequence any number of times. The program uses a high-quality portable random number generator presented in Press et al.'s Numerical Recipes in C1 with a period greater than 2 x 1018. In addition, the program obtains seeds for the random number generator from the /dev/random or /dev/urandom entropy pools present on modern Linux and Unix machines. The user can decide from which device he or she wishes to obtain a starting seed.

The program accepts as input a file containing a sequence in the commonly-used FASTA format. The program shuffles this sequence as many times as desired. Output is produced in FASTA format. An ordinal is appended to the sequence identifier to identify each shuffled sequence in the output.

>TestSequence
CGTAAAAGCTAGCAATTAGGCAAGTCCATTGGATGCATTAGACTTAGACA
GTCAGCAGTCCCGGCAACACTCTCTCTCTCTCAGGACTTCTACTCATGCA
TGACGATCGCAGATCAACAGATCGGCAACCCCTTTGGTGTGTGTCCAAAT
CCTCCGGACTTCGGCGACCGCAGCAGGCGCACGCCGGCGATTACATATNN
    

Fig. 1. An example input file in FASTA format.

~> shuffle test.seq 3 TRUE_RANDOM
Entropy source is TRUE_RANDOM ...
Please enter some random keystrokes to gather entropy ...
Thank you! That's enough.
The SEED is: -1550659743
>TestSequence-1
CTTAAAAATTCTGCGCCCAGACACCCAGCCCGGGTACCACTACGGCCAGG
CCTATATGTTGTTCAGGTAGGAATAACCTGGTAACAATTCGGTTATTTTG
CACCTCCCTTAGCGGCGATCGGGGAAGACCCAACCTGATACAANCTTTCC
ACCCCCACCGGGCGATACCAGGGATNCTCTGCTTAGACAATCTGGGGTAA

>TestSequence-2
TGCGACATCGATTTTGTGCTCAGCACTGAGGTCACAATCCAAACCATACA
AGCGAACCCTCGTCATTTGCTAAGGCCCTCACTTAATAAGCCAACGGCCA
CCCAGTGNGAGGCTATCGAAGATTCCAAGTCTCGAGCGNCCGGCAATATG
TAAGTGGAAGAGGACTCCGCGCATCACCCTTCGCTGGATCTTGCTCTTCG

>TestSequence-3
CCCATTGCAAAGACACTCCTCGCTGCGACTACAGGTAATTATGGACGATT
ACTTTGCAGGCTGCGCTTCGTTTAGGGGCGCTTTGCACAATTTCCCGAAN
ACTCACGCGGCTCACACGTCGGGCATCGTGCGGAAATCTACTATAACAGA
CAGCAAAACCTGATATACTACCCAGGCTCTGACCTGCACNGAGTGGCACC
    

Fig. 2. Running shuffle, with resulting output.

Program Design

We wrote shuffle after discovering that the "shuffleseq" program in the EMBOSS suite was returning the same "randomized" sequence 3 or 4 times in a row when called from a shell script! This led us to suspect that the procedure used for getting a starting seed for the random number generator was based on a call to a system "time" function. Indeed, this is the case: In EMBOSS 2.1.0, which we had been using, a call to the system's time() function, which only has a resolution to seconds, is used in the ajRandomSeed() function. In the latest version of EMBOSS (version 2.9.0), this has been replaced with a call to gettimeofday() which has a resolution of microseconds. But even that is not good enough, or at the very least should make you raise your eyebrow in an expression of suspicion and doubt, because on today's fast servers a shell script might call shuffleseq repeatedly within the span of a microsecond or two.

Since we were already suspicious about the seeding procedure, we also began to wonder how far we could trust the randomization algorithm in EMBOSS's shuffleseq. We also looked at some other sequence shuffling programs, one which used a system-supplied rand(): system-supplied rand() functions are historically notorious for being botched.

Since we were concerned about botched seeding and randomization in other's utilities, we decided to write our own DNA shuffling utility.

To generate random numbers, we decided to implement the ran2 routine presented in Press, Flannery, Teukolsky, and Vetterling's Numerical Recipes in C: The Art of Scientific Computing (NR)1. This routine uses the random number generator of L'Ecuyer2 with Bays-Durham shuffle3 and added safeguards. Schrage's method4 is used to prevent overflows. We reimplemented the routine as a class in C++.

How good is NR's ran2 routine? The authors say:

We think that, within the limits of its floating-point precision, ran2 provides perfect random numbers; a practical definition of "perfect" is that we will pay $1000 to the first reader who convinces us otherwise (by finding a statistical test that ran2 fails in a nontrivial way, excluding the ordinary limitations of a machine's floating-point representation).5

In order to obtain random starting seeds for the generator, our class includes an entropy-gathering method which grabs entropy from either the /dev/random or /dev/urandom devices. /dev/random is an entropy pool which collects entropy from sources such as the interval between keystrokes. If entropy is not available, reading from /dev/random will block until more entropy becomes available on the system. In order to avoid this blocking behaviour, one can choose instead to read from /dev/urandom which returns entropy when available, or a pseudo-random number when the entropy pool has been exhausted.

In addition to providing good randomization, we have also optimized the code. Our first implementation of this program required approximately 58 seconds on our primary compute server to create 100,000 randomized shuffles of a 200 base pair sequence. Our second implementation required approximately 11 seconds to perform the same task. The current version requires just 9.6 seconds. The results that you obtain will depend on the speed of your server, the number of simulations, and the sequence length.


1. Press, Flannery, Teukolsky, & Vetterling. 1992. Numerical Recipes in C: The Art of Scientific Computing (NR). Cambridge University Press, ISBN 0-521-43108-5.

2. L'Ecuyer, P. 1988. Communications of the ACM, vol 31, pp.742-774.

3. Bays, Carter, & Durham, S.D. 1976. ACM Transactions on Mathematical Software, Vol. 2, p. 59.

4. Schrage, L. 1979. ACM Transactions on Mathematical Software, vol. 5, pp. 132-138.

5. Press et. al., NR, p. 281.

Tested Platforms

We have successfully compiled shuffle using both the older GCC g++ version 2.95, the newer version 3.3 and the latest version 3.4.3. We have successfully compiled shuffle on SuSE Linux 7.3, Cygwin on Windows XP on Intel machines, and on Sun Solaris 8 on the Sparc architecture.

Usage

Sequence Identifier

In the current version, the sequence identifier cannot contain spaces but must be one continuous string of letters, digits, or underscore characters.

Sequences

At present, the program will only process the first sequence present in a FASTA file. You can of course run the program on as many different files as you want. The sequence must begin on the line immediately following the FASTA sequence identifier line.

Output

At present, output is sent to standard out where it can be redirected into a file as necessary.

Output Sequences

Each shuffled sequence is identified using the original sequence identifier with an ordinal appended to the end. For example, if the original identifier was ">myoc", the program will generate ">myoc-1", ">myoc-2", ">myoc-3", etc.

Program Options

Arguments to the program are as follows:

shuffle <file_name> <number_of_simulations> <entropy_source> <starting_ordinal>
    

Fig. 3. Program arguments.

... where:

Download

The latest source code and documentation for shuffle can always be downloaded from http://eyegene.ophthy.med.umich.edu/shuffle/download/