RANDOM NUMBER GENERATORS IN SAS AND R                SPH 7460 notes.001

     Statisticians make extensive use of pseudo-random number generators (PRNGs)
in simulation studies of statistical tests, estimation procedures, and design
of experiments.  The reason for this is that summary and sufficient statistics
are often nonlinear combinations of data from a large number of observations,
and in general there are no closed-form analytic expressions of the resulting
parameter estimates or probabilities.  Even when available, analytic expressions
often apply only asymptotically, and the numbers of observations required for the
analytic expressions to be acceptably close to the truth often are not known.

     SAS and R both have a number of pseudo-random number generators which
perform very reliably to produce sequences of numbers whose distributions are
very close to what is claimed.

     Below is a program which generates 100 random numbers from seven SAS
random number generators: binomial, poisson, exponential, gamma, normal,
uniform, and triangular.  SAS 8 and earlier versions do not have random number generators for
negative binomial, hypergeometric, Weibull, chi-square, beta, F, or other
commonly used distributions.

     The function RAND in SAS 9.1.3+ can generate pseudorandom numbers from any one of the
following distributions: bernoulli, beta, binomial, cauchy, chisquare, erlang,
exponential, F, gamma, geometric, hypergeometric, lognormal, negbinomial,
normal, poisson, t, triangle, uniform, and weibull.

     For example: to generate a pseudorandom number x with a Weibull
distribution:

     x = rand('weibull', a, b) ;

where 

     f(x) = (a/b^a) * x^(a - 1) * e((-x/b)^a),

and a is the 'shape' parameter, and b is the 'scale' parameter.

* ============================================================================== ;

*  ~john-c/randgens.sas
*  SAS random number generators.                                                 ;

OPTIONS  LINESIZE = 120 ERRORS = 1  MPRINT ;
%let n = 100 ;

footnote "&sysdate &systime ~john-c/5421/randgens.sas" ;

data rands ;

     seed = today() ;

     do i = 1 to &n ;

       rbinnom = ranbin(seed, 100, .3) ;
       rcauchy = rancau(seed) ;
       rpoiss  = ranpoi(seed, 30) ;
       rexpon  = ranexp(seed) ;
       rgamma  = rangam(seed, 1) ;
       rnormal = rannor(seed) ;
       runif   = ranuni(seed) ;
       rtri    = rantri(seed, .5) ;
       output ;

     end ;

run ;

proc print data = rands ;
title "Output from SAS random number generators ... &n observations."  ;

* ================================================================== ;

     Note that this program makes use of a macro variable (&n).  This
will be explained later.

     Here is a list of the R functions that will generate a random sample from common 
distributions: runif, rnorm, rpois, rmvnorm, rnbinom, rbinom, rbeta, rchisq, rexp, rgamma, 
rlogis, rstab, rt, rgeom, rhyper, rwilcox, rweibull.

     Both SAS and  R have advantages in describing data.  For graphics, it
is often easier to use R than SAS.  It is a good idea to know how to use
SAS and R together.  The SAS program below generates 1000 numbers (x) from
a chi-square distribution with 3 degrees of freedom.  It also computes
an approximate derivative of the cumulative distribution function for a chi-square
variate with 3 DF.  It writes these data out to a SAS system file.  It also
runs PROC UNIVARIATE to illustrate the distribution of the random variable x,
and a PROC PLOT to show the approximate PDF for a chi-square with 3 DF.

     The second program is an R program which (1) reads the SAS csv output
created by the SAS program, and (2) creates and histogram for the simulated
chi-square data, and (3) overlays a plot of the approximate PDF on the
histogram.  In this case the plot is a postscript file called chi3.ps.  You can
view and print this file by using ghostview on an X-terminal.

* ============================================================================== ;

* This SAS program simulates 1000 points from a chi-square distribution with  ;
* 3 degrees of freedom.  It also computes an approximate pdf for a chi-square ;
* with 3 DF.  It writes out the simulated data to a SAS system file called    ;
* "examp".  The actual UNIX name of this file is "examp.ssd01".  It is in a   ;
* SAS library called 'sasplus'.                                               ;
*                                                                             ;
* The program also uses PROC UNIVARIATE to illustrate the distribution of the ;
* simulated data, and  PROC PLOT to show the approximate pdf.                 ;
*                                                                             ;

options linesize = 80 ;
footnote "~john-c/5421/csv.output.sas &sysdate &systime" ;

data rands ;

  n = 1000 ;
  seed = today() ;
  low = 0 ;
  high = +12 ;
  h = (high - low) / n ;

  df =  3 ;

  do i = 1 to n ;

     tph = i * h + low ;
     t = tph - h ;
     xpdf = (probchi(tph, df) - probchi(t, df)) / h ;
     z1 = rannor(seed) ;
     z2 = rannor(seed) ;
     z3 = rannor(seed) ;
     x = z1 * z1 + z2 * z2 + z3 * z3 ;

     output ;

   end ;

run ;

ods csv file = 'sas.output.csv' ;

proc print data = rands ;
run ;
ods trace on ;
ods csv close ;

data sasplus.examp ;
     set rands ;
     keep i t x xpdf ;
run ;

proc univariate plot normal ;
     var x xpdf ;

proc plot data = rands ;
     plot xpdf * t ;

* ============================================================================== ;

# R program example ... histogram of a distribution.
#
# To run this program:
#
# 1.  First run the sas program, sasplusexamp.sas
#     The output is on a file called sas.output.csv
#
# 2.  Get into R by typing, at the Unix command prompt,
#     R
#
# 3.  Run the program below.  Note particularly the step:
#     sasoutput <- read.csv(file="sas.output.csv", head=TRUE, sep = ",")
#
# 4.  The output is a postscript file, chi3.ps, which you can
#     examine using ghostview.
#

postscript("chi3.ps", horizontal = F, onefile = T)

par(mfrow = c(2,1), oma = c(2, 2, 2, 2))

sasoutput <- read.csv(file="sas.output.csv", head=TRUE, sep = ",")

x <- sasoutput$x
y <- sasoutput$y

hist(x, nclass = 20, xlim = c(0, 12),
     cex = .7, main = "SPH 5421 Histogram example: Chi-Square (DF = 3)")

par(new = T, xaxs = "d")

plot(t, y, axes = F, type = "l")

# rm(x, t, xpdf, nobs, y)

dev.off()

* ============================================================================== ;

PROBLEM SET 1

1.  Use the seven SAS random number generators to generate 1000 numbers
    from each of the 7 distributions in the first SAS example.

2.  Find and give examples of the use of the Splus random number
    generators for the same 7 distributions as in the SAS program
    example.

3.  Use SAS/GRAPH or SPlus to graph histograms for each of the above.

4.  Superimpose the pdfs for the random variables on each histogram.

5.  Use the random number generator for the gamma distribution
    to generate random numbers from a chi-square distribution
    with 3 degrees of freedom.  How can you test whether your
    method produces observations from the chi-square distribution
    with df = 3 ?

----------------------------------------------------------------------

It is possible to use a uniform random number generator to generate
random numbers from the standard normal N(0, 1) distribution.  The
method depends on the 'Box-Muller Transformation'.

Basically, if X1 and X2 have independent U[0, 1] distributions,
then

         Y = (sqrt(-2 * log(X1))) * COS(2 * pi * X2)

has an N(0, 1) distribution.  Here 'log' is the natural logarithm,
and pi is the usual pi = 3.141592....

The Box-Muller transformation was first described in the following short note:

Box GEP, Muller ME.  A note on the generation of random normal deviates.
Annals of Math. Stat. 29: 610-611, 1958.

Some languages, such as BASIC, have only a uniform random number
generator.  Below is a BASIC program which uses the Box-Muller
transformation to generate a random N(0, 1) variate.

======================================================================

10 CLS : KEY OFF : SCREEN 9
20 '
30 ' This program implements the Box-Muller method of generating
40 ' normal random deviates.
50 '
60   DIM Z(640)
70   RANDOMIZE : PRINT
80   TWOPI = 4 * ATN(1)
90 '
100  PRINT "This program generates N(0, 1) random deviates, using the"
110  PRINT "Box-Muller approach, and graphs a histogram ... "
120  PRINT
130  INPUT; "Enter N : ", N     : PRINT : PRINT
140 '
150  CLS
160 '
170  C = 1
180  FOR I = 1 TO N
190 '
200    X1 = RND
210    X2 = RND
220    IF X1 = 0 THEN 200
230    W = -2 * LOG(X1)
240    Y = (SQR(W)) * COS(TWOPI * X2)
250 '
260 '  Note: y has a standard normal distribution, N(0, 1).
270 '
280    YSTAR = Y + 3.2
290    J = INT(YSTAR * 100)
300    IF YSTAR < 0 THEN J = 1
310    IF YSTAR > 6.4 THEN J = 640
320    Z(J) = Z(J) + 1
330    PSET(J, 349 - Z(J)), 15
340    IF Z(J) > 348 THEN 410
350 '
360  NEXT I
370 '
380  C = C + 1
390  IF C > 15 THEN C = 1
400  GOTO 180
410  GOTO 410

======================================================================

PROJECT ASSIGNMENT 2

1.  Write a SAS version of the BASIC program.  Don't worry about doing
    the graphics part (lines 280-340).

2.  Generate 1000 numbers using your program and create a histogram.
    Use PROC UNIVARIATE to check that the distribution is approximately
    normal.

3.  Read the Box-Muller paper and explain in your own words why the
    transformation works.

----------------------------------------------------------------------

Suppose you had access to only a uniform pseudorandom number generator.

How might you generate random numbers from chi-square, t, and f distributions?

If you find a way to do this, how might you check that it is
working correctly?

----------------------------------------------------------------------

PROJECT ASSIGNMENT 3

1.  Find a way in SAS to generate random numbers for chi-square,
    t-, and F-distributions.

2.  Use a Pearson chi-square approach to check that your proposed
    methods actually work.

3.  The 'birthday problem' is a reference to the fact that if about N = 30
    or more people are chosen at random, it is likely that at least two will have 
    the same birthday (that is, same day and month).  Simulate the birthday 
    problem with N = 30.

    Carry out 1000 simulations, and count in how many of them at least 2 of
    the N people have the same birthday.


~john-c/5421/notes.001  Revised Sep. 2, 2013.