dieharder

Langue: en

Version: Copyright 2004 Robert G. Brown (debian - 07/07/09)

Section: 1 (Commandes utilisateur)

NAME

rand_rate - A testing and benchmarking tool for GSL random number generators

SYNOPSIS

dieharder [-a] [-b bits] [-d diehard test number] [-f filename]
          [-g generator number] [-h] [-i iterations] [-l]
          [-n ntuple] [-p number of p samples] [-q] [-N]
          [-r rgb test number] [-s sts test number]
          [-t number of test samples] [-u user test number]
          [-v verbose flag] [-x xvalue] [-y yvalue] [-z zvalue]

DieHarder OPTIONS

-a Runs all the tests with standard/default options to create a report
-b bits - sets the number of bits to be used in tests that act on a bit
string of variable length, e.g. the rgb bitdist test.
-d test number - selects specific diehard test.
-f filename - when used with -g for raw or (ascii formatted) input,
allows random numbers to be read in from a file. Note that the stdin (raw) interface does not require a -f entry, only the appropriate -g.
-g generator number - selects a specific generator for testing. A
generator number of -1 (or dieharder entered alone with no arguments) causes all known generators to be printed out to the display.
-h prints context-sensitive help -- usually Usage (this message) or a
test synopsis if entered as e.g. dieharder -d 3 -h.
-i iterations - sets iteration count for timing runs (should not be
needed).
-l list all known tests.
-n ntuple - set ntuple length for tests on short bit strings that permit
the length to be varied (e.g. rgb bitdist).
-p count - sets the number of p-value samples per test (default 100).
-q selects per test (where applicable). This is a way of getting a very compact
report.
-N selects non-overlapping samples for overlapping tests where the
non-overlapping target statistic is known. Note that overlapping sample tests generally have internal covariance that is quite difficult to remove. Consequently they on average contain LESS than one where non-overlapping samples are more tightly distributed. With built-in generators, non-overlapping tests take longer and may have a different sensitivity.
-r test number - selects specific rgb test.
-s test number - selects specific sts test.
-t count - sets the number of random entities used in each test, where
possible. Be warned -- some tests have fixed sample sizes; others are variable but have practical minimum sizes. It is suggested you begin with the values used in -a and experiment carefully on a test by test basis.
-u test number - selects specific user-developed test, if you've added
one or more to this tool. dieharder provides this interface to make it easy to add your own tests.
-v verbose flag -- controls the verbosity of the output for debugging
only. Probably of little use to non-developers, and developers can read the enum(s) in dieharder.h and the test sources to see which flag values turn on output on which routines. 1 is result in a fairly detailed trace of program activity.
-x,-y,-z number - Some tests have parameters that can safely be varied
from their default value. For example, in the diehard birthdays test, one can vary the number of length, which can also be varied. -x 2048 -y 30 alters these two values but should still run fine.

NOTE WELL: The "bogomegarates" (number of rands generated per second) returned by this tool are BOGUS and may not be even approximately correct in your context. Also, the quality assessment(s) for the rngs may, in fact, be completely incorrect (due to errors in the tests) or misleading. Learn what p-values mean (see below)! Use them at your Own Risk! Be Warned!

DESCRIPTION

DieHarder This is the current snapshot of the dieharder random number tester. It encapsulates all of the Gnu Scientific Library random number generators as well as /dev/random and /dev/urandom in a single harness that can time them and subject them to various tests for randomness. These tests are variously drawn from George Marsaglia's "Diehard battery of random number tests", the NIST Statistical Test Suite, and include a few useful e.g. binning or spectral tests that I've found doing research into random number tests or tests that I myself have made up or that are improvements on tests from other sources.

The primary point of this tester is to make it easy to time and test (pseudo)random number generators OR hardware or other generators. Three examples are provided of wrapping a random number generator and inserting it so that it is can be called via the GSL interface. An interface is planned that would allow random numbers to be read in from a file, allowing the tool to be used with any generator (wrappable or not) that can generate a table of random numbers.

Another important motivation for writing dieharder is to have the entire test code and documentation be fully Gnu Public Licensed and hence openly available for adaptation, testing, modification, including the addition of new tests. The primary objections I have towards diehard and sts are not that they are or are not adequate and complete; it is that the code is obscure and not explicitly published for reuse. It is my hope that by providing this tool in autodocumenting source, it makes it easy to add new tests, critique older tests, and improve the suite in general.

P-VALUES AND THE NULL HYPOTHESIS

DieHarder returns p-values. To understand what a p-value is and how to use it, it is essential to understand the null hypothesis.

The null hypothesis for random number generator testing is "This generator is a perfect random number generator, and for any choice of seed produces a infinitely long, unique sequence of numbers that have all the expected statistical properties of random numbers, to all orders". Note well that we know that this hypothesis is technically false for all software generators as they are periodic and do not have the correct entropy content for this statement to be strictly true. Many hardware generators fail a priori as well, as they contain subtle bias or correlations due to the deterministic physics that underlies them.

It can be practically true, however. Both software and hardware generators can be "random" enough that their sequences cannot be distinguished from random ones. Hence the null hypothesis as a practical, not a theoretically pure, statement.

To test it, one uses the rng in question to generate a sequence of presumably random numbers. Using these numbers one can generate any one of a wide range of test statistics -- empirically computed numbers that are considered random samples that may or may not be independent, depending on whether overlapping sequences of random numbers are used to generate successive samples while generating the statistic(s), drawn from a known distribution. From a knowledge of the target distribution of the statistic(s) and the associated cumulative distribution function (CDF) and the empirical value of the randomly generated statistic(s), one can read off the probability of obtaining the empirical result if the sequence was truly random, that is, if the null hypothesis is true! This probability is the "p-value" for the particular test run.

The usual dogma is that if this p-value is low -- typically less than 0.05 -- one "rejects" the null hypothesis. In a word, it is improbable that one would get the result obtained if the generator is a good one. If it is any other value, one does not "accept" the generator as good, one "fails to reject" the generator as bad for this particular test. A "good random number generator" is hence one that we haven't been able to make fail yet!

This criterion is, of course, naive in the extreme and cannot be used with dieharder! It makes just as much sense to reject a generator that has p-values of 0.95 or more! Both of these p-value ranges are equally unlikely on any given test run, and should be returned for (on average) 5% of all test runs by a perfect random number generator. A generator that fails to produce p-values less than 0.05 5% of the time it is tested with different seeds is a bad random number generator, one that fails the test of the null hypothesis.

p-values themselves, as it turns out, are test statistics. They should be uniformly distributed on the range 0-1. In 100 test runs with independent seeds, one should not be surprised to obtain 0, 1, 2, or even 3 p-values less than 0.01. On the other hand obtaining 7 p-values in the range 0.54-0.55 should make the generator highly suspect! One can in fact convert a set of p-values into a p-value by comparing their distribution to the expected one, using a Kolmogorov-Smirnov test, usually either Anderson-Darling or Kuiper KS (where for a variety of reasons we use Kuiper instead of Anderson-Darling although in the limit of many samples it should not matter).

The p-values obtained should in turn be uniformly distributed and can be subjected to still more KS tests. The distribution of p-values for a good generator should be idempotent, even across different test statistics and multiple runs.

A failure of the distribution of p-values at any level of aggregation signals trouble. In fact, if the p-values of any given test are subjected to a KS test, and those p-values are then subjected to a KS test, as we add more p-values to either level we will either observe idempotence of the resulting distribution of p to uniformity, or we will observe idempotence to a single p-value of zero! That is, a good generator will produce a roughly uniform distribution of p-values, in the specific sense that the p-values of the distributions of p-values are themselves roughly uniform and so on ad infinitum, while a bad generator will produce a non-uniform distribution of p-values, and as more p-values drawn from the non-uniform distribution are added to its KS test, at some point the failure will be absolutely unmistakeable as the resulting p-value approaches 0 in the limit. Trouble indeed!

The question is, trouble with what? Random number tests are themselves complex computational objects, and there is a probability that their code is incorrectly framed or that roundoff or other numerical -- not methodical -- errors are contributing to a distortion of the distribution of some of the p-values obtained. This is not an idle observation; when one works on writing random number generator testing programs, one is always testing the tests themselves with "good" (we hope) random number generators so that egregious failures of the null hypothesis signal not a bad generator but an error in the test code. The null hypothesis above is correctly framed from a theoretical point of view, but from a real and practical point of view it should read: "This generator is a perfect random number generator, and for any choice of seed produces a infinitely long, unique sequence of numbers that have all the expected statistical properties of random numbers, to all orders and this test is a perfect test and returns precisely correct p-values from the test computation." Observed "failure" of this null hypothesis can come from failure of either or both of these disjoint components, and comes from the second as often or more often than the first during the development process. When one cranks up the "resolution" of the test (discussed next) to where a generator starts to fail some test one realizes, or should realize, that development never ends and that new test regimes will always reveal new failures not only of the generators but of the code.

With that said, one of dieharder's most significant advantages is the control that it gives you over a critical test parameter. From the remarks above, we can see that we should feel very uncomfortable about "failing" any given random number generator on the basis of a 5%, or even a 1%, criterion, especially when we apply a test suite like dieharder that returns 76 distinct test p-values as of the last snapshot. We want failure to be unambiguous and reproducible!

To accomplish this, one can simply crank up its resolution. If we ran any given test against a random number generator and it returned a p-value of (say) 0.007328, we'd be perfectly justified in wondering if it is really a good generator. However, the probability of getting this result isn't really all that small -- when one uses dieharder for hours at a time numbers like this will definitely happen and mean nothing. If one runs the same test again (with a different seed or part of the random sequence) and gets a p-value of 0.009122, and a third time and gets 0.002669 -- well, that's three 1% (or less) shots in a row and that should happen only one in a million times. One way to clearly resolve failures, then, is to increase the number of p-values generated in a test run. If the actual distribution of p being returned by the test is not uniform, a KS test will eventually return a p-value that is not some ambiguous 0.035517 but is instead 0.000001, with the latter produced time after time as we rerun.

dieharder permits one to increase the number of p-values generated for any test, subject only to the availability of enough random numbers (for file based tests) and time, to make failures unambiguous. However, because it is a research tool it is strongly suggested that one always consider the alternative null hypothesis -- that the failure is a failure of the test code in some limit of large numbers -- and take at least some steps to ensure that the failure is in the generator and not the dieharder code.

The steps I myself take during development is to rely heavily on several "presumed good" random number generators. These are generators that we have theoretical reasons to expect are extraordinarily good and to lack correlations out to some known underlying dimensionality, and that also test out extremely well. By using several generators and not just one, one can hope that those generators have (at the very least) different correlations and should not all uniformly fail a test in the same way and with the same number of p-values. When all of these generators consistently fail a test, I tend to suspect that the problem is in the test code, not the generators. One advantage of dieharder is that it has a number of these "good generators" immediately available for comparison runs, courtesy of the Gnu Scientific Library. I use mt19937_1999, gfsr4, ranldx2 and taus2 (as well as "true random" numbers from random.org) for this purpose, and I try to ensure that dieharder will "pass" in particular the Mersenne Twister generator at any reasonable p-value resolution out to as much as -p 1000.

Tests (such as the diehard operm5 test as of 3/27/08) that consistently fail at sufficiently high resolution are often flagged as being "suspect" -- possible failures of the alternative null hypothesis -- and their results should not be used to test random number generators pending agreement in the statistics and random number community that those tests are in fact valid and correct so that observed failures can indeed safely be attributed to a failure of the intended null hypothesis.

dieharder is community supported. Bear in mind that I (rgb) am professionally a physicist and that I have taken precisely one course in statistics and that one was in high school. That does not mean I am incompetent to be the primary keeper of the dieharder code as I am quite competent in programming and have actively studied statistics on my own to a level of at least enough competence to explain the above in a hopefully extremely clear way (something that has eluded many better-trained workers in this arena). It does, however, mean that my knowledge of statistics contains holes and that I have to work very hard to understand the theory underlying the more difficult tests. I am not proud about this and would cherish instruction and correction by real experts.

I therefore openly invite experts in statistics and programming to examine the source code of those tests and help me as I attempt to determine whether or not the tests are in fact valid or whether the failure is due to a bug in the code, and to help me fix the code or algorithms being implemented. I would like to see this test suite ultimately be validated by the general statistics community in hard use in an open environment, where every possible failure of the testing mechanism is subject to scrutiny and eventual correction. In this way we will eventually achieve a very powerful suite of tools indeed, ones that may well give us very specific information not just about failure but of the mode of failure as well, just how the sequence tested deviates from randomness.

This will be a lot of work (but perhaps work suitable for directed research by statistics undergrad or grad students who also are competent programmers). Note well -- many of the tests that use overlapping samples must deal with covariance by means of e.g. weak inverses of a covariance matrix in order to convert a vector of test statistics into a multinomial form suitable for a chi square test returning a p-value. The code for doing so is nontrivial and requires that the covariance matrix (or its suitable decomposition) be either computed in situ or entered in data.

Checking the code for overlapping sample tests in particular is not easy, in other words. The theory is difficult, the programming is difficult, and the numerics are difficult. This is why it is (in my opinion) essential that the testing tool be fully open source and why most sophisticated users of the tool will wish to work using a copy of dieharder built from source rather than "just" installed from a binary package. The alternative null hypothesis is there for all random number generator testers, but only with an open source tester integrated with some known-excellent generators to test against and compare to is it possible to consider it and take steps to validate not only the generator but the tool used to test it at the source code level. "Black box" commercial testers are to be eschewed at all costs, although it is perfectly reasonable to consider employing skilled consultants using an open product to test e.g. a new hardware or software generator for its suitability in some given application.

Thus far, dieharder has benefitted tremendously from the community. Individuals have openly contributed tests, new generators to be tested, and fixes for existing tests that were revealed by their own work with the testing instrument. Efforts are underway to make dieharder more portable so that it will build on more platforms and faster so that more thorough testing can be done. Please feel free to participate.

FILE INPUT

To test your own random number generator or to test the randomness of some file of presumably random numbers, two approaches are possible. First of all, there are five generators that are wrapped up in GSL compatible clothes and linked to the GSL so that the standard GSL interface works for them. Using these as prototypes and working with the fully GPL dieharder sources, any software or hardware random number generator that one wishes to test can be added for testing using these as prototypes, and can likely be submitted to the GSL for inclusion if they pass the tests as well or better than the generators that are already there. Dieharder is designed to (ultimately) be a very convenient tool for testing new software RNGs.

However, the last two non-gsl generators are "universal" generators in the sense that they permit you to input a random number stream from a file (but not from /dev/random or /dev/urandom, be warned).

The file_input generator requires a file of "cooked" (ascii readable) random numbers, one per line, with a header that describes the format to dieharder. This interface is still somewhat experimental -- not all ascii formats have been tested. However, it has been tested and should work for 32 bit unsigned integers represented directly in ascii or as 32 bits of binary. Example files for a couple of possible input formats are included in the sources.

Finally, the type file_input_raw accepts a file of raw bits as input, such as might be generated by:


 dd if=/dev/urandom of=testrands.raw bs=4 count=1000000

(to generate 1,000,000 four-byte ints directly from the software-augmented kernel entropy generator). That is, running the tests from such a file should be approximately the same as testing /dev/urandom directly.

The main (important!) difference is that some of the test require a lot of random numbers -- far more than were needed by diehard. Indeed, dieharder typically runs many of the diehard tests 100 independent times, generating a p-value for each, and plots a histogram of the p-values and generates a p-value for the (presumed uniform) distribution of p-values! This approach mimics the histogram presented in the STS suite but augments it with a hard number.

This protects one from the "p happens" problem described by Marsaglia (and described in even more detail above) where it is remarkably common to see very low or high p-values returned from any given test of even a "perfect" random number generator, but where over time (or many test runs) a good generator will generate a uniform, idempotent distribution of p-values.

File input rands are delivered to the tests on demand, but if the test needs more than are available it simply rewinds the file and cycles through it again, and again, and again as needed. Obviously this significantly reduces the sample space and can lead to completely incorrect results for the p-value histograms unless there are enough rands to run EACH test without repetition (it is harmless to reuse the sequence for different tests). Let the user beware!

One last warning for those who are testing files of random numbers. dieharder is a tool that tests random number generators, not files of random numbers! It is extremely inappropriate to try to "certify" a file of random numbers as being random just because it fails to "fail" any of the dieharder tests in e.g. a dieharder -a run. To put it bluntly, if one rejects all such files that fail any test at the 0.05 level (or any other), the one thing one can be certain of is that the files in question are not random, as a truly random sequence would fail any given test at the 0.05 level 5% of the time!

To put it another way, any file of numbers produced by a generator that "fails to fail" the dieharder suite should be considered "random", even if it contains sequences that might well "fail" any given test at some specific cutoff. One has to presume that passing the broader tests of the generator itself, it was determined that the p-values for the test involved was globally correctly distributed, so that e.g. failure at the 0.01 level occurs neither more nor less than 1% of the time, on average, over many many tests. If one particular file generates a failure at this level, one can therefore safely presume that it is a random file pulled from many thousands of similar files the generator might create that have the correct distribution of p-values at all levels of testing and aggregation.

To sum up, use dieharder to validate your generator (via input from files or an embedded stream. Then by all means use your generator to produce files or streams of random numbers. Do not use dieharder as an accept/reject tool to validate the files themselves.

EXAMPLES

To demonstrate all tests, run on the default GSL rng, enter:


  dieharder -a

To demonstrate a test of an external generator of a raw binary stream of bits, use the stdin (raw) interface:


  cat /dev/urandom | dieharder -g 75 -a

(be sure to run dieharder by itself on a line to verify that the number of the stdin generator is still 75).

To use it with an ascii formatted file:


  dieharder -g 65 -f testrands.txt -a

(testrands.txt should consist of a header such as:


 #==================================================================
 # generator mt19937_1999  seed = 1274511046
 #==================================================================
 type: d
 count: 100000
 numbit: 32
 3129711816
   85411969
 2545911541

etc.).

To use it with a binary file


  dieharder -g 66 -f testrands.bin -a

(or cat testrands.bin | dieharder -g 75 -a).

PUBLICATION RULES

DieHarder is entirely original code and can be modified and used at will by any user, provided that:


  a) The original copyright notices are maintained and that the source, including all modifications, is made publically available at the time of any derived publication. This is open source software according to the precepts and spirit of the Gnu Public License. See the accompanying file COPYING, which also must accompany any redistribution.


  b) The author of the code (Robert G. Brown) is appropriately acknowledged and referenced in any derived publication. It is strongly suggested that George Marsaglia and the Diehard suite and the various authors of the Statistical Test Suite be similarly acknowledged, although this suite shares no actual code with these random number test suites.


  c) Full responsibility for the accuracy, suitability, and effectiveness of the program rests with the users and/or modifiers. As is clearly stated in the accompanying copyright.h:

THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

ACKNOWLEDGEMENTS

The author of this suite gratefully acknowledges George Marsaglia (the author of the diehard test suite) and the various authors of NIST Special Publication 800-22 (which describes the Statistical Test Suite for testing pseudorandom number generators for cryptographic applications), for excellent descriptions of the tests therein. These descriptions enabled this suite to be developed with a clean copyright, licensable under the GPL.

The author also wishes to reiterate that the academic correctness and accuracy of the implementation of these tests is his sole responsibility and not that of the authors of the Diehard or STS suites. This is especially true where he has seen fit to modify those tests from their strict original descriptions.

GPL 2b; see the file COPYING that accompanies the source of this program. This is the "standard Gnu General Public License version 2 or any later version", with the one minor (humorous) "Beverage" modification listed below. Note that this modification is probably not legally defensible and can be followed really pretty much according to the honor rule.

As to my personal preferences in beverages, red wine is great, beer is delightful, and Coca Cola or coffee or tea or even milk acceptable to those who for religious or personal reasons wish to avoid stressing my liver.

The Beverage Modification to the GPL:

Any satisfied user of this software shall, upon meeting the primary author(s) of this software for the first time under the appropriate circumstances, offer to buy him or her or them a beverage. This beverage may or may not be alcoholic, depending on the personal ethical and moral views of the offerer. The beverage cost need not exceed one U.S. dollar (although it certainly may at the whim of the offerer:-) and may be accepted or declined with no further obligation on the part of the offerer. It is not necessary to repeat the offer after the first meeting, but it can't hurt...