SIMULATION FROM ADDITIONAL DISTRIBUTIONS                  SPH 5421 notes.012

     SAS includes seven pseudo-random number generators which are quite useful for
the purpose of simulating from the corresponding distributions.  However there are
many other distributions that occur frequently in applications.  An example is the
Weibull distribution, a flexible univariate distribution that is often used in
survival analysis.  The CDF for the Weibull distribution is

[1]                     F(t) = 1 - exp(-a * t^b),

where a is called the 'scale' parameter and b is the 'shape' parameter. Because the
Weibull is usually used in survival analysis, the random variable t can be thought
of as time to an event.  The exponential distribution is a special case of the
Weibull distribution (when b = 1).

     Note that F(t), like all values of CDFs, is between 0 and 1.  It is inter-
preted as the probability that a person has an event at or before time t.  Of course
t >= 0.

     The inverse function for F can be computed in this case.  Just solve equation
[1] for t:

[2]                     t = [- log(1 - F(t)) / a]^1/b.

     This provides a way to generate a random survival time from the Weibull
distribution.  First, generate a random probability p from the uniform U[0, 1]
distribution.  Insert that value into equation [2] instead of F(t), and compute t.

     For example, suppose a = 10 and b = 1.2, and the randomly generated p is
p = .350.  Then
                        t = [- log(1 - .35) / 10]^1/1.2 = 0.0728

--------------------------------------------------------------------------------
PROBLEM 13

1.  Write a program to generate random samples from the Weibull distribution.
    Both a and b should be parameters in the program which can be easily
    changed.

2.  What is the PDF for the Weibull?  Graph the PDF for the following values
    of a and b:

              a = 0.5, b = 0.5, 1.0, 1.5
              a = 1.0, b = 0.5, 1.0, 1.5
              a = 1.5, b = 0.5, 1.0, 1.5.

3.  For a = 1.5 and b = 1.5, generate a sample of size 1000 from the Weibull
    and graph the histogram.
--------------------------------------------------------------------------------

     Some distributions have complicated expressions for their CDFs and it is
difficult or impossible to find a nice expression for the inverse of the CDF. One
such example is the beta distribution.  The beta distribution is a two-parameter
family of distributions for random variables which are restricted to the interval
[0, 1].  Let B(x; a, b) be the CDF for the beta distribution, in which x is the
random variate and a and b are the parameters.  In SAS, B(x; a, b) is represented by
the function PROBBETA(x, a, b), and the SAS Language manual gives an expresssion for
B(x; a, b) in terms of an integral.

     As it happens, SAS also provides the inverse function for the beta
distribution.  It is called BETAINV(p, a, b).  So you can use this as explained
above for the Weibull distribution to generate random numbers from the beta
distribution.

     But what might you do if no such inverse function were available?

     Assume you are given a continuous CDF, F(x), which can be computed. Perform the
following steps:

     1)   Generate a random number  p  between 0 and 1, from the U[0, 1]
          distribution.  Find two numbers, w1 and w2, such that

                 F(w1) < p < F(w2).

     2)   Define w' = average of w1 and w2.  Compute F(w').  There are two cases
          to consider:

          1.   F(w1) < F(w') < p, or

          2.   p < F(w') < F(w2).

          In case 1., redefine w1 = w' and go back to step 1.

          In case 2., redefine w2 = w' and go back to step 1.

     3)   Stop the process when abs(w1 - w2) < .0001.  The value that you want is w'.


     This is essentially a binary search algorithm.  The process is guaranteed to
stop because the CDF is a monotone continuous function.  [Note: this does not work
for discontinuous CDFs (like the binomial).] In practice it will usually stop in
less than 20 iterations.

     Choosing w1 and w2 in the first place can be difficult.  For example, if p = 0
and F(X) is the CDF for the normal distribution, the corresponding w' would be
negative infinity.  It is possible, but extremely unlikely, that a uniform random
number generator will produce p = 0.  If it does, you will have to skip that value
and continue on to the next one; and similarly if p = 1.  If p = .0001 and F is
normal, it is sufficient to choose w1 = -5.0.

     This algorithm can be speeded up considerably by the use of
linear interpolation.  Suppose you want to find w such that
F(w) = p.  Suppose you know that w is somewhere between w0 and w1.
Let 

    R = (p - F(w0)) / (F(w1) - F(w0)),

and let

    w' = w0 + R * (w1 - w0).

    If w0 was your initial guess, your new guess is w'.  If
F(w') < p, then replace w0 by w' and repeat the process.   If
p < F(w'), then replace w1 by w' and repeat the process.  Stop when
F(w') is as close to p as you desire.

     The following program generates random observations from
a bimodal normal distribution, with the first mean at mu1 = 1
and the second at mu2 = 4, sigma = .5, and probability of
being in the first 'bump' of .3.  The bimodality of the
resulting distribution was verified using PROC UNIVARIATE:

=======================================================================

options linesize = 80 ;
footnote "~john-c/5421/bimodal.sas &sysdate &systime" ;

FILENAME GRAPH 'gsas.grf' ;
LIBNAME  loc '' ;

OPTIONS  LINESIZE = 80 MPRINT ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = PORTRAIT
         FTEXT = SWISSB
         DEVICE = PSCOLOR
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        

data bimodal ;

     mu1 = 1 ; mu2 = 4 ; sigma = .5 ;

     p = .4 ;

     n = 1000 ;
     seed = 864131530 ;
     eps = .001 ;

     do i = 1 to n ;

        x = ranuni(seed) ;

        t1 = -10 ; t2 = 10 ;
        pt1 = 0 ;
        pt2 = 1 ;

        diff = 20 ;
        j = 1 ;

        do while (j < 30 and diff > eps) ;

           j = j + 1 ;
           tave = .5*(t1 + t2) ;

           ptave = p * probnorm((tave - mu1)/sigma)
                   + (1 - p) * probnorm((tave - mu2)/sigma);

           if pt1 < x < ptave then do ;

              t2 = tave ;
              pt2 = ptave ;
              diff = t2 - t1 ;
              goto jump1 ;

           end ;

           if ptave < x < pt2 then do ;

              t1 = tave ;
              pt1 = ptave ;

           end ;

jump1:

         end ;

         output ;

     end ;

run ;

proc univariate plot normal ;
     var tave ;
title 'Simulated observations from bimodal normal' ;





                   Simulated observations from bimodal normal                  1
                                                   18:43 Monday, October 4, 2004

                              Univariate Procedure

Variable=TAVE

                                    Moments

                    N              1000  Sum Wgts       1000
                    Mean       2.793487  Sum        2793.487
                    Std Dev    1.556346  Variance   2.422213
                    Skewness   -0.34675  Kurtosis   -1.49797
                    USS        10223.36  CSS        2419.791
                    CV         55.71339  Std Mean   0.049216
                    T:Mean=0   56.75974  Pr>|T|       0.0001
                    Num ^= 0       1000  Num > 0         991
                    M(Sign)         491  Pr>=|M|      0.0001
                    Sgn Rank   249933.5  Pr>=|S|      0.0001
                    W:Normal   0.852066  Pr < W       0.0001


                                Quantiles(Def=5)

                     100% Max  5.575562       99%  5.039673
                      75% Q3   4.100494       95%  4.659116
                      50% Med  3.517227       90%  4.493561
                      25% Q1   1.112671       10%  0.664368
                       0% Min  -0.70862        5%    0.4562
                                               1%  0.048447
                     Range      6.28418                    
                     Q3-Q1     2.987823                    
                     Mode      1.245728                    


                                    Extremes

                       Lowest    Obs     Highest    Obs
                     -0.70862(      35) 5.224304(     540)
                     -0.70007(      23) 5.269623(     537)
                     -0.41779(     536) 5.285492(     870)
                     -0.33752(     877) 5.291138(     693)
                     -0.16052(     311) 5.575562(      68)


                    Simulated observations from bimodal normal                  2
                                                   18:43 Monday, October 4, 2004

                              Univariate Procedure

Variable=TAVE

                         Histogram                       #             Boxplot
     5.75+*                                              1                |   
         .***                                           11                |   
         .*****************                             85                |   
         .*****************************************    201             +-----+
         .******************************************   208             *-----*
         .*****************                             84             |     |
         .**                                             9             |  +  |
         .**                                             6             |     |
         .*********                                     44             |     |
         .******************************               148             +-----+
         .*****************************                145                |   
         .**********                                    49                |   
         .**                                             7                |   
    -0.75+*                                              2                |   
          ----+----+----+----+----+----+----+----+--              
          * may represent up to 5 counts                          


                                 Normal Probability Plot              
              5.75+                                           +++    *
                  |                                        +++     ***
                  |                                    ++**********   
                  |                              *********            
                  |                         ******++                  
                  |                       *** +++                     
                  |                       *+++                        
                  |                    ++*+                           
                  |                 +++ **                            
                  |              +++*****                             
                  |         *********                                 
                  |  ********+                                        
                  |**  ++++                                           
             -0.75+*+++                                               
                   +----+----+----+----+----+----+----+----+----+----+
                       -2        -1         0        +1        +2     


========================================================================
PROBLEM 13a

1.  Use the algorithm above to generate 1000 random variates from the beta
    distribution with parameters  a = 2, b = 4.

2.  Make a histogram of your data from part 1., and superimpose on it the
    PDF for Beta(x, 2, 4).

3.  Use something like the method described above to solve the equation

                cos(x) = x,

    where x is in radians.

4.  Write a program which implements the linear interpolation
    algorithm.  Suppose X has a beta distribution with parameters
    (2, 3).  Carry out comparisons of the number of steps required
    for the binary search algorithm with the number required for the
    linear interpolation algorithm for generating random observations
    for X.

========================================================================

    The same methodology as is used here for generating random observations
from a distribution given an expression for its CDF can also be used to
solve nonlinear equations of the form g(x) = c.  It is again a binary search
method.  It will work provide g(x) is a continuous function and monotone
increasing or monotone decreasing in some interval containing the unknown
solution.  

     For example: let g(x) = sin(sin(x)).  Suppose you want to solve
g(x) = .5.  The function g(x) is monotonic in the interval [0, pi/2], and
g(0) = 0 and g(pi/2) = 0.84147.  Therefore there is some value x between
0 and pi/2 such that g(x) = .5.  The following program provides the solution:

------------------------------------------------------------------------------

options linesize = 80  MPRINT ;
footnote "~john-c/5421/binarysolver.sas &sysdate &systime" ;

%let function = sin(sin(x)) ;
%let target   = .5 ;

data function ;

* Solve the equation &function = &target ;
* Note &function is increasing in the interval [0, pi/2] ;

     pi = 4 * atan(1) ;

     xlow = 0 ;
     xhigh = pi / 2 ;
     diff = xhigh - xlow ;

     epsilon = 1e-6 ;

     do i = 1 to 100 while (diff > epsilon) ;

        x = .5 * (xlow + xhigh) ;

        y = &function ;
        if y gt &target then xhigh = x ;
        if y le &target then xlow = x ;
        diff = xhigh - xlow ;
        output ;

      end ;

run ;

proc print data = function ;
title1 "Solution to &function = &target" ;

------------------------------------------------------------------------------

                          Solution to sin(sin(x)) = .5                         1
                                                   18:13 Monday, October 9, 2006

 Obs      pi       xlow     xhigh      diff    epsilon    i      x         y

   1   3.14159   0.00000   0.78540   0.78540   .000001    1   0.78540   0.64964
   2   3.14159   0.39270   0.78540   0.39270   .000001    2   0.39270   0.37341
   3   3.14159   0.39270   0.58905   0.19635   .000001    3   0.58905   0.52743
   4   3.14159   0.49087   0.58905   0.09817   .000001    4   0.49087   0.45413
   5   3.14159   0.53996   0.58905   0.04909   .000001    5   0.53996   0.49175
   6   3.14159   0.53996   0.56450   0.02454   .000001    6   0.56450   0.50984
   7   3.14159   0.53996   0.55223   0.01227   .000001    7   0.55223   0.50086
   8   3.14159   0.54610   0.55223   0.00614   .000001    8   0.54610   0.49632
   9   3.14159   0.54917   0.55223   0.00307   .000001    9   0.54917   0.49859
  10   3.14159   0.55070   0.55223   0.00153   .000001   10   0.55070   0.49973
  11   3.14159   0.55070   0.55147   0.00077   .000001   11   0.55147   0.50029
  12   3.14159   0.55070   0.55108   0.00038   .000001   12   0.55108   0.50001
  13   3.14159   0.55089   0.55108   0.00019   .000001   13   0.55089   0.49987
  14   3.14159   0.55099   0.55108   0.00010   .000001   14   0.55099   0.49994
  15   3.14159   0.55103   0.55108   0.00005   .000001   15   0.55103   0.49997
  16   3.14159   0.55106   0.55108   0.00002   .000001   16   0.55106   0.49999
  17   3.14159   0.55106   0.55107   0.00001   .000001   17   0.55107   0.50000
  18   3.14159   0.55106   0.55107   0.00001   .000001   18   0.55106   0.50000
  19   3.14159   0.55107   0.55107   0.00000   .000001   19   0.55107   0.50000
  20   3.14159   0.55107   0.55107   0.00000   .000001   20   0.55107   0.50000
  21   3.14159   0.55107   0.55107   0.00000   .000001   21   0.55107   0.50000
 
 
                  ~john-c/5421/binarysolver.sas 09OCT06 18:13
------------------------------------------------------------------------------

     In this case approximately 21 steps were needed to find the solution,
approximately .55107, to within an accuracy of epsilon = .000001.

     Also note the use of macro variables &function and &target in this program.

==============================================================================

     The first part of notes.012 shows how to generate pseudorandom numbers from
a distribution by inverting the CDF for that distribution and then using a uniform 
random number generator.  There is a theorem behind this:

     Theorem.  Let X be a random variable with CDF F(X).  If U = F(X), then U has a 
U[0, 1] distribution.

     Proof.  Let G be the CDF of U.  That is, G(U) = prob(u < U), where u is 
assumed to follow the distribution.  Note that the statement 

        u < U       

is equivalent to the statement

       FINV(u) < FINV(U)

where FINV is the function-inverse of F.  But now, let

       X = FINV(U) and x = FINV(u).  Then

       prob(u < U) = prob(x < X) = F(X) = F(FINV(U)) = U.

That is, G(U) = U.  By definition, this means that U has a
uniform distribution.
      


/home/walleye/john-c/5421/notes.012    Last update: September 23, 2009.