SIMULATION FROM ADDITIONAL DISTRIBUTIONS SPH 5421 notes.012 SAS includes seven pseudo-random number generators which are quite useful for the purpose of simulating from the corresponding distributions. However there are many other distributions that occur frequently in applications. An example is the Weibull distribution, a flexible univariate distribution that is often used in survival analysis. The CDF for the Weibull distribution is [1] F(t) = 1 - exp(-a * tb), where a is called the 'scale' parameter and b is the 'shape' parameter. Because the Weibull is usually used in survival analysis, the random variable t can be thought of as time to an event. The exponential distribution is a special case of the Weibull distribution (when b = 1). Note that F(t), like all values of CDFs, is between 0 and 1. It is inter- preted as the probability that a person has an event at or before time t. Of course t >= 0. The inverse function for F can be computed in this case. Just solve equation [1] for t: [2] t = [- log(1 - F(t)) / a]1/b. This provides a way to generate a random survival time from the Weibull distribution. First, generate a random probability p from the uniform U[0, 1] distribution. Insert that value into equation [2] instead of F(t), and compute t. For example, suppose a = 10 and b = 1.2, and the randomly generated p is p = .350. Then t = [- log(1 - .35) / 10]1/1.2 = 0.0728 -------------------------------------------------------------------------------- PROBLEM 13 1. Write a program to generate random samples from the Weibull distribution. Both a and b should be parameters in the program which can be easily changed. 2. What is the PDF for the Weibull? Graph the PDF for the following values of a and b: a = 0.5, b = 0.5, 1.0, 1.5 a = 1.0, b = 0.5, 1.0, 1.5 a = 1.5, b = 0.5, 1.0, 1.5. 3. For a = 1.5 and b = 1.5, generate a sample of size 1000 from the Weibull and graph the histogram. -------------------------------------------------------------------------------- Some distributions have complicated expressions for their CDFs and it is difficult or impossible to find a nice expression for the inverse of the CDF. One such example is the beta distribution. The beta distribution is a two-parameter family of distributions for random variables which are restricted to the interval [0, 1]. Let B(x; a, b) be the CDF for the beta distribution, in which x is the random variate and a and b are the parameters. In SAS, B(x; a, b) is represented by the function PROBBETA(x, a, b), and the SAS Language manual gives an expresssion for B(x; a, b) in terms of an integral. As it happens, SAS also provides the inverse function for the beta distribution. It is called BETAINV(p, a, b). So you can use this as explained above for the Weibull distribution to generate random numbers from the beta distribution. But what might you do if no such inverse function were available? Assume you are given a continuous CDF, F(x), which can be computed. Perform the following steps: 1) Generate a random number p between 0 and 1, from the U[0, 1] distribution. Find two numbers, w1 and w2, such that F(w1) < p < F(w2). 2) Define w' = average of w1 and w2. Compute F(w'). There are two cases to consider: 1. F(w1) < F(w') < p, or 2. p < F(w') < F(w2). In case 1., redefine w1 = w' and go back to step 1. In case 2., redefine w2 = w' and go back to step 1. 3) Stop the process when abs(w1 - w2) < .0001. The value that you want is w'. This is essentially a binary search algorithm. The process is guaranteed to stop because the CDF is a monotone continuous function. [Note: this does not work for discontinuous CDFs (like the binomial).] In practice it will usually stop in less than 20 iterations. Choosing w1 and w2 in the first place can be difficult. For example, if p = 0 and F(X) is the CDF for the normal distribution, the corresponding w' would be negative infinity. It is possible, but extremely unlikely, that a uniform random number generator will produce p = 0. If it does, you will have to skip that value and continue on to the next one; and similarly if p = 1. If p = .0001 and F is normal, it is sufficient to choose w1 = -5.0. This algorithm can be speeded up considerably by the use of linear interpolation. Suppose you want to find w such that F(w) = p. Suppose you know that w is somewhere between w0 and w1. Let R = (p - F(w0)) / (F(w1) - F(w0)), and let w' = w0 + R * (w1 - w0). If w0 was your initial guess, your new guess is w'. If F(w') < p, then replace w0 by w' and repeat the process. If p < F(w'), then replace w1 by w' and repeat the process. Stop when F(w') is as close to p as you desire. The following program generates random observations from a bimodal normal distribution, with the first mean at mu1 = 1 and the second at mu2 = 4, sigma = .5, and probability of being in the first 'bump' of .3. The bimodality of the resulting distribution was verified using PROC UNIVARIATE: ======================================================================= options linesize = 80 ; footnote "~john-c/5421/bimodal.sas &sysdate &systime" ; FILENAME GRAPH 'gsas.grf' ; LIBNAME loc '' ; OPTIONS LINESIZE = 80 MPRINT ; GOPTIONS RESET = GLOBAL ROTATE = PORTRAIT FTEXT = SWISSB DEVICE = PSCOLOR GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; data bimodal ; mu1 = 1 ; mu2 = 4 ; sigma = .5 ; p = .4 ; n = 1000 ; seed = 864131530 ; eps = .001 ; do i = 1 to n ; x = ranuni(seed) ; t1 = -10 ; t2 = 10 ; pt1 = 0 ; pt2 = 1 ; diff = 20 ; j = 1 ; do while (j < 30 and diff > eps) ; j = j + 1 ; tave = .5*(t1 + t2) ; ptave = p * probnorm((tave - mu1)/sigma) + (1 - p) * probnorm((tave - mu2)/sigma); if pt1 < x < ptave then do ; t2 = tave ; pt2 = ptave ; diff = t2 - t1 ; goto jump1 ; end ; if ptave < x < pt2 then do ; t1 = tave ; pt1 = ptave ; end ; jump1: end ; output ; end ; run ; proc univariate plot normal ; var tave ; title 'Simulated observations from bimodal normal' ; Simulated observations from bimodal normal 1 18:43 Monday, October 4, 2004 Univariate Procedure Variable=TAVE Moments N 1000 Sum Wgts 1000 Mean 2.793487 Sum 2793.487 Std Dev 1.556346 Variance 2.422213 Skewness -0.34675 Kurtosis -1.49797 USS 10223.36 CSS 2419.791 CV 55.71339 Std Mean 0.049216 T:Mean=0 56.75974 Pr>|T| 0.0001 Num ^= 0 1000 Num > 0 991 M(Sign) 491 Pr>=|M| 0.0001 Sgn Rank 249933.5 Pr>=|S| 0.0001 W:Normal 0.852066 Pr < W 0.0001 Quantiles(Def=5) 100% Max 5.575562 99% 5.039673 75% Q3 4.100494 95% 4.659116 50% Med 3.517227 90% 4.493561 25% Q1 1.112671 10% 0.664368 0% Min -0.70862 5% 0.4562 1% 0.048447 Range 6.28418 Q3-Q1 2.987823 Mode 1.245728 Extremes Lowest Obs Highest Obs -0.70862( 35) 5.224304( 540) -0.70007( 23) 5.269623( 537) -0.41779( 536) 5.285492( 870) -0.33752( 877) 5.291138( 693) -0.16052( 311) 5.575562( 68) Simulated observations from bimodal normal 2 18:43 Monday, October 4, 2004 Univariate Procedure Variable=TAVE Histogram # Boxplot 5.75+* 1 | .*** 11 | .***************** 85 | .***************************************** 201 +-----+ .****************************************** 208 *-----* .***************** 84 | | .** 9 | + | .** 6 | | .********* 44 | | .****************************** 148 +-----+ .***************************** 145 | .********** 49 | .** 7 | -0.75+* 2 | ----+----+----+----+----+----+----+----+-- * may represent up to 5 counts Normal Probability Plot 5.75+ +++ * | +++ *** | ++********** | ********* | ******++ | *** +++ | *+++ | ++*+ | +++ ** | +++***** | ********* | ********+ |** ++++ -0.75+*+++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 ======================================================================== PROBLEM 13a 1. Use the algorithm above to generate 1000 random variates from the beta distribution with parameters a = 2, b = 4. 2. Make a histogram of your data from part 1., and superimpose on it the PDF for Beta(x, 2, 4). 3. Use something like the method described above to solve the equation cos(x) = x, where x is in radians. 4. Write a program which implements the linear interpolation algorithm. Suppose X has a beta distribution with parameters (2, 3). Carry out comparisons of the number of steps required for the binary search algorithm with the number required for the linear interpolation algorithm for generating random observations for X. ======================================================================== The same methodology as is used here for generating random observations from a distribution given an expression for its CDF can also be used to solve nonlinear equations of the form g(x) = c. It is again a binary search method. It will work provide g(x) is a continuous function and monotone increasing or monotone decreasing in some interval containing the unknown solution. For example: let g(x) = sin(sin(x)). Suppose you want to solve g(x) = .5. The function g(x) is monotonic in the interval [0, pi/2], and g(0) = 0 and g(pi/2) = 0.84147. Therefore there is some value x between 0 and pi/2 such that g(x) = .5. The following program provides the solution: ------------------------------------------------------------------------------ options linesize = 80 MPRINT ; footnote "~john-c/5421/binarysolver.sas &sysdate &systime" ; %let function = sin(sin(x)) ; %let target = .5 ; data function ; * Solve the equation &function = &target ; * Note &function is increasing in the interval [0, pi/2] ; pi = 4 * atan(1) ; xlow = 0 ; xhigh = pi / 2 ; diff = xhigh - xlow ; epsilon = 1e-6 ; do i = 1 to 100 while (diff > epsilon) ; x = .5 * (xlow + xhigh) ; y = &function ; if y gt &target then xhigh = x ; if y le &target then xlow = x ; diff = xhigh - xlow ; output ; end ; run ; proc print data = function ; title1 "Solution to &function = &target" ; ------------------------------------------------------------------------------ Solution to sin(sin(x)) = .5 1 18:13 Monday, October 9, 2006 Obs pi xlow xhigh diff epsilon i x y 1 3.14159 0.00000 0.78540 0.78540 .000001 1 0.78540 0.64964 2 3.14159 0.39270 0.78540 0.39270 .000001 2 0.39270 0.37341 3 3.14159 0.39270 0.58905 0.19635 .000001 3 0.58905 0.52743 4 3.14159 0.49087 0.58905 0.09817 .000001 4 0.49087 0.45413 5 3.14159 0.53996 0.58905 0.04909 .000001 5 0.53996 0.49175 6 3.14159 0.53996 0.56450 0.02454 .000001 6 0.56450 0.50984 7 3.14159 0.53996 0.55223 0.01227 .000001 7 0.55223 0.50086 8 3.14159 0.54610 0.55223 0.00614 .000001 8 0.54610 0.49632 9 3.14159 0.54917 0.55223 0.00307 .000001 9 0.54917 0.49859 10 3.14159 0.55070 0.55223 0.00153 .000001 10 0.55070 0.49973 11 3.14159 0.55070 0.55147 0.00077 .000001 11 0.55147 0.50029 12 3.14159 0.55070 0.55108 0.00038 .000001 12 0.55108 0.50001 13 3.14159 0.55089 0.55108 0.00019 .000001 13 0.55089 0.49987 14 3.14159 0.55099 0.55108 0.00010 .000001 14 0.55099 0.49994 15 3.14159 0.55103 0.55108 0.00005 .000001 15 0.55103 0.49997 16 3.14159 0.55106 0.55108 0.00002 .000001 16 0.55106 0.49999 17 3.14159 0.55106 0.55107 0.00001 .000001 17 0.55107 0.50000 18 3.14159 0.55106 0.55107 0.00001 .000001 18 0.55106 0.50000 19 3.14159 0.55107 0.55107 0.00000 .000001 19 0.55107 0.50000 20 3.14159 0.55107 0.55107 0.00000 .000001 20 0.55107 0.50000 21 3.14159 0.55107 0.55107 0.00000 .000001 21 0.55107 0.50000 ~john-c/5421/binarysolver.sas 09OCT06 18:13 ------------------------------------------------------------------------------ In this case approximately 21 steps were needed to find the solution, approximately .55107, to within an accuracy of epsilon = .000001. Also note the use of macro variables &function and &target in this program. ============================================================================== The first part of notes.012 shows how to generate pseudorandom numbers from a distribution by inverting the CDF for that distribution and then using a uniform random number generator. There is a theorem behind this: Theorem. Let X be a random variable with CDF F(X). If U = F(X), then U has a U[0, 1] distribution. Proof. Let G be the CDF of U. That is, G(U) = prob(u < U), where u is assumed to follow the distribution. Note that the statement u < U is equivalent to the statement FINV(u) < FINV(U) where FINV is the function-inverse of F. But now, let X = FINV(U) and x = FINV(u). Then prob(u < U) = prob(x < X) = F(X) = F(FINV(U)) = U. That is, G(U) = U. By definition, this means that U has a uniform distribution. /home/walleye/john-c/5421/notes.012 Last update: September 23, 2009.