Introduction
Shuffle is a utility written in optimized C++ to randomly shuffle a DNA sequence
any number of times. The program uses a high-quality portable random number generator
presented in Press et al.'s
Numerical Recipes in C1 with a period
greater than 2 x 1018. In addition, the program obtains seeds for the random
number generator from the /dev/random or /dev/urandom entropy pools present
on modern Linux and Unix machines. The user can decide from which device he or she wishes to obtain a
starting seed.
The program accepts as input a file containing a sequence in the commonly-used FASTA format. The program shuffles this sequence as many times as desired. Output is produced in FASTA format. An ordinal is appended to the sequence identifier to identify each shuffled sequence in the output.
>TestSequence
CGTAAAAGCTAGCAATTAGGCAAGTCCATTGGATGCATTAGACTTAGACA
GTCAGCAGTCCCGGCAACACTCTCTCTCTCTCAGGACTTCTACTCATGCA
TGACGATCGCAGATCAACAGATCGGCAACCCCTTTGGTGTGTGTCCAAAT
CCTCCGGACTTCGGCGACCGCAGCAGGCGCACGCCGGCGATTACATATNN
Fig. 1. An example input file in FASTA format.
~> shuffle test.seq 3 TRUE_RANDOM
Entropy source is TRUE_RANDOM ...
Please enter some random keystrokes to gather entropy ...
Thank you! That's enough.
The SEED is: -1550659743
>TestSequence-1
CTTAAAAATTCTGCGCCCAGACACCCAGCCCGGGTACCACTACGGCCAGG
CCTATATGTTGTTCAGGTAGGAATAACCTGGTAACAATTCGGTTATTTTG
CACCTCCCTTAGCGGCGATCGGGGAAGACCCAACCTGATACAANCTTTCC
ACCCCCACCGGGCGATACCAGGGATNCTCTGCTTAGACAATCTGGGGTAA
>TestSequence-2
TGCGACATCGATTTTGTGCTCAGCACTGAGGTCACAATCCAAACCATACA
AGCGAACCCTCGTCATTTGCTAAGGCCCTCACTTAATAAGCCAACGGCCA
CCCAGTGNGAGGCTATCGAAGATTCCAAGTCTCGAGCGNCCGGCAATATG
TAAGTGGAAGAGGACTCCGCGCATCACCCTTCGCTGGATCTTGCTCTTCG
>TestSequence-3
CCCATTGCAAAGACACTCCTCGCTGCGACTACAGGTAATTATGGACGATT
ACTTTGCAGGCTGCGCTTCGTTTAGGGGCGCTTTGCACAATTTCCCGAAN
ACTCACGCGGCTCACACGTCGGGCATCGTGCGGAAATCTACTATAACAGA
CAGCAAAACCTGATATACTACCCAGGCTCTGACCTGCACNGAGTGGCACC
Fig. 2. Running shuffle, with resulting output.
Program Design
We wrote shuffle after discovering that the "shuffleseq" program in the
EMBOSS suite was returning the
same "randomized" sequence 3 or 4 times in a row when called from a shell script! This led us to
suspect that the procedure used for getting a starting seed for the random number generator was based
on a call to a system "time" function. Indeed, this is the case: In EMBOSS 2.1.0, which we had
been using, a call to the
system's time() function, which only has a resolution to seconds, is used in the
ajRandomSeed() function.
In the latest version of EMBOSS (version 2.9.0), this has been replaced with a call to
gettimeofday() which has a resolution of microseconds. But even that is not good enough, or at the very
least should make you raise your eyebrow in an expression of suspicion and doubt,
because on today's fast servers
a shell script might call shuffleseq repeatedly within the span of a microsecond or
two.
Since we were already suspicious about the seeding procedure, we also began to wonder how far we could
trust the randomization algorithm in EMBOSS's shuffleseq. We also looked at some other
sequence shuffling programs, one which used a system-supplied rand(): system-supplied
rand() functions are historically notorious for being botched.
Since we were concerned about botched seeding and randomization in other's utilities, we decided to write our own DNA shuffling utility.
To generate random numbers, we decided to implement the ran2 routine presented in Press,
Flannery, Teukolsky, and Vetterling's Numerical Recipes in C: The Art
of Scientific Computing (NR)1. This routine uses
the random number generator of L'Ecuyer2 with Bays-Durham shuffle3
and added safeguards. Schrage's method4 is used to prevent overflows. We reimplemented
the routine as a class in C++.
How good is NR's ran2 routine? The authors say:
We think that, within the limits of its floating-point precision, ran2
provides perfect random numbers; a practical definition of "perfect" is that we will pay
$1000 to the first reader who convinces us otherwise (by finding a statistical test that ran2 fails
in a nontrivial way, excluding the ordinary limitations of a machine's floating-point representation).5
In order to obtain random starting seeds for the generator, our class
includes an entropy-gathering method which grabs entropy from either the /dev/random or
/dev/urandom devices. /dev/random is an entropy pool which collects entropy
from sources such as the interval between keystrokes. If entropy is not available,
reading from /dev/random will block until more entropy becomes available on the system.
In order to avoid this blocking behaviour, one can choose instead to read from /dev/urandom
which returns entropy when available, or a pseudo-random number when the entropy pool has been
exhausted.
In addition to providing good randomization, we have also optimized the code. Our first implementation of this program required approximately 58 seconds on our primary compute server to create 100,000 randomized shuffles of a 200 base pair sequence. Our second implementation required approximately 11 seconds to perform the same task. The current version requires just 9.6 seconds. The results that you obtain will depend on the speed of your server, the number of simulations, and the sequence length.
1. Press, Flannery, Teukolsky, & Vetterling. 1992. Numerical Recipes in C: The Art of Scientific Computing (NR). Cambridge University Press, ISBN 0-521-43108-5.
2. L'Ecuyer, P. 1988. Communications of the ACM, vol 31, pp.742-774.
3. Bays, Carter, & Durham, S.D. 1976. ACM Transactions on Mathematical Software, Vol. 2, p. 59.
4. Schrage, L. 1979. ACM Transactions on Mathematical Software, vol. 5, pp. 132-138.
5. Press et. al., NR, p. 281.
Tested Platforms
We have successfully compiled shuffle using both the older GCC g++ version 2.95, the newer
version 3.3 and
the latest version 3.4.3.
We have successfully compiled shuffle on SuSE Linux 7.3, Cygwin on Windows XP on Intel
machines, and
on Sun Solaris 8 on the Sparc architecture.
Usage
Sequence Identifier
In the current version, the sequence identifier cannot contain spaces but must be one continuous string of letters, digits, or underscore characters.
Sequences
At present, the program will only process the first sequence present in a FASTA file. You can of course run the program on as many different files as you want. The sequence must begin on the line immediately following the FASTA sequence identifier line.
Output
At present, output is sent to standard out where it can be redirected into a file as necessary.
Output Sequences
Each shuffled sequence is identified using the original sequence identifier with an
ordinal appended to the end. For example, if the original identifier was ">myoc",
the program will generate ">myoc-1", ">myoc-2", ">myoc-3", etc.
Program Options
Arguments to the program are as follows:
shuffle <file_name> <number_of_simulations> <entropy_source> <starting_ordinal>
Fig. 3. Program arguments.
... where:
-
<file_name>is the name of a file containing sequence in FASTA format. -
<number_of_simulations>indicates the number of shuffled output sequences you want. Defaults to 100. -
<entropy_source>is eitherTRUE_RANDOMif you want to use the blocking/dev/randomdevice, orPSEUDO_RANDOMif you want to use the non-blocking/dev/urandomdevice. Defaults toPSEUDO_RANDOM. -
<starting_ordinal>: Shuffled sequences are labeled starting at<starting_ordinal>plus one. Defaults to zero (0).
Download
The latest source code and documentation for shuffle can always be downloaded from http://eyegene.ophthy.med.umich.edu/shuffle/download/