SIMULATION FROM ADDITIONAL DISTRIBUTIONS SPH 5421 notes.012
SAS includes seven pseudo-random number generators which are quite useful for
the purpose of simulating from the corresponding distributions. However there are
many other distributions that occur frequently in applications. An example is the
Weibull distribution, a flexible univariate distribution that is often used in
survival analysis. The CDF for the Weibull distribution is
[1] F(t) = 1 - exp(-a * tb),
where a is called the 'scale' parameter and b is the 'shape' parameter. Because the
Weibull is usually used in survival analysis, the random variable t can be thought
of as time to an event. The exponential distribution is a special case of the
Weibull distribution (when b = 1).
Note that F(t), like all values of CDFs, is between 0 and 1. It is inter-
preted as the probability that a person has an event at or before time t. Of course
t >= 0.
The inverse function for F can be computed in this case. Just solve equation
[1] for t:
[2] t = [- log(1 - F(t)) / a]1/b.
This provides a way to generate a random survival time from the Weibull
distribution. First, generate a random probability p from the uniform U[0, 1]
distribution. Insert that value into equation [2] instead of F(t), and compute t.
For example, suppose a = 10 and b = 1.2, and the randomly generated p is
p = .350. Then
t = [- log(1 - .35) / 10]1/1.2 = 0.0728
--------------------------------------------------------------------------------
PROBLEM 13
1. Write a program to generate random samples from the Weibull distribution.
Both a and b should be parameters in the program which can be easily
changed.
2. What is the PDF for the Weibull? Graph the PDF for the following values
of a and b:
a = 0.5, b = 0.5, 1.0, 1.5
a = 1.0, b = 0.5, 1.0, 1.5
a = 1.5, b = 0.5, 1.0, 1.5.
3. For a = 1.5 and b = 1.5, generate a sample of size 1000 from the Weibull
and graph the histogram.
--------------------------------------------------------------------------------
Some distributions have complicated expressions for their CDFs and it is
difficult or impossible to find a nice expression for the inverse of the CDF. One
such example is the beta distribution. The beta distribution is a two-parameter
family of distributions for random variables which are restricted to the interval
[0, 1]. Let B(x; a, b) be the CDF for the beta distribution, in which x is the
random variate and a and b are the parameters. In SAS, B(x; a, b) is represented by
the function PROBBETA(x, a, b), and the SAS Language manual gives an expresssion for
B(x; a, b) in terms of an integral.
As it happens, SAS also provides the inverse function for the beta
distribution. It is called BETAINV(p, a, b). So you can use this as explained
above for the Weibull distribution to generate random numbers from the beta
distribution.
But what might you do if no such inverse function were available?
Assume you are given a continuous CDF, F(x), which can be computed. Perform the
following steps:
1) Generate a random number p between 0 and 1, from the U[0, 1]
distribution. Find two numbers, w1 and w2, such that
F(w1) < p < F(w2).
2) Define w' = average of w1 and w2. Compute F(w'). There are two cases
to consider:
1. F(w1) < F(w') < p, or
2. p < F(w') < F(w2).
In case 1., redefine w1 = w' and go back to step 1.
In case 2., redefine w2 = w' and go back to step 1.
3) Stop the process when abs(w1 - w2) < .0001. The value that you want is w'.
This is essentially a binary search algorithm. The process is guaranteed to
stop because the CDF is a monotone continuous function. [Note: this does not work
for discontinuous CDFs (like the binomial).] In practice it will usually stop in
less than 20 iterations.
Choosing w1 and w2 in the first place can be difficult. For example, if p = 0
and F(X) is the CDF for the normal distribution, the corresponding w' would be
negative infinity. It is possible, but extremely unlikely, that a uniform random
number generator will produce p = 0. If it does, you will have to skip that value
and continue on to the next one; and similarly if p = 1. If p = .0001 and F is
normal, it is sufficient to choose w1 = -5.0.
This algorithm can be speeded up considerably by the use of
linear interpolation. Suppose you want to find w such that
F(w) = p. Suppose you know that w is somewhere between w0 and w1.
Let
R = (p - F(w0)) / (F(w1) - F(w0)),
and let
w' = w0 + R * (w1 - w0).
If w0 was your initial guess, your new guess is w'. If
F(w') < p, then replace w0 by w' and repeat the process. If
p < F(w'), then replace w1 by w' and repeat the process. Stop when
F(w') is as close to p as you desire.
The following program generates random observations from
a bimodal normal distribution, with the first mean at mu1 = 1
and the second at mu2 = 4, sigma = .5, and probability of
being in the first 'bump' of .3. The bimodality of the
resulting distribution was verified using PROC UNIVARIATE:
=======================================================================
options linesize = 80 ;
footnote "~john-c/5421/bimodal.sas &sysdate &systime" ;
FILENAME GRAPH 'gsas.grf' ;
LIBNAME loc '' ;
OPTIONS LINESIZE = 80 MPRINT ;
GOPTIONS
RESET = GLOBAL
ROTATE = PORTRAIT
FTEXT = SWISSB
DEVICE = PSCOLOR
GACCESS = SASGASTD
GSFNAME = GRAPH
GSFMODE = REPLACE
GUNIT = PCT BORDER
CBACK = WHITE
HTITLE = 2 HTEXT = 1 ;
*===================================================================== ;
data bimodal ;
mu1 = 1 ; mu2 = 4 ; sigma = .5 ;
p = .4 ;
n = 1000 ;
seed = 864131530 ;
eps = .001 ;
do i = 1 to n ;
x = ranuni(seed) ;
t1 = -10 ; t2 = 10 ;
pt1 = 0 ;
pt2 = 1 ;
diff = 20 ;
j = 1 ;
do while (j < 30 and diff > eps) ;
j = j + 1 ;
tave = .5*(t1 + t2) ;
ptave = p * probnorm((tave - mu1)/sigma)
+ (1 - p) * probnorm((tave - mu2)/sigma);
if pt1 < x < ptave then do ;
t2 = tave ;
pt2 = ptave ;
diff = t2 - t1 ;
goto jump1 ;
end ;
if ptave < x < pt2 then do ;
t1 = tave ;
pt1 = ptave ;
end ;
jump1:
end ;
output ;
end ;
run ;
proc univariate plot normal ;
var tave ;
title 'Simulated observations from bimodal normal' ;
Simulated observations from bimodal normal 1
18:43 Monday, October 4, 2004
Univariate Procedure
Variable=TAVE
Moments
N 1000 Sum Wgts 1000
Mean 2.793487 Sum 2793.487
Std Dev 1.556346 Variance 2.422213
Skewness -0.34675 Kurtosis -1.49797
USS 10223.36 CSS 2419.791
CV 55.71339 Std Mean 0.049216
T:Mean=0 56.75974 Pr>|T| 0.0001
Num ^= 0 1000 Num > 0 991
M(Sign) 491 Pr>=|M| 0.0001
Sgn Rank 249933.5 Pr>=|S| 0.0001
W:Normal 0.852066 Pr < W 0.0001
Quantiles(Def=5)
100% Max 5.575562 99% 5.039673
75% Q3 4.100494 95% 4.659116
50% Med 3.517227 90% 4.493561
25% Q1 1.112671 10% 0.664368
0% Min -0.70862 5% 0.4562
1% 0.048447
Range 6.28418
Q3-Q1 2.987823
Mode 1.245728
Extremes
Lowest Obs Highest Obs
-0.70862( 35) 5.224304( 540)
-0.70007( 23) 5.269623( 537)
-0.41779( 536) 5.285492( 870)
-0.33752( 877) 5.291138( 693)
-0.16052( 311) 5.575562( 68)
Simulated observations from bimodal normal 2
18:43 Monday, October 4, 2004
Univariate Procedure
Variable=TAVE
Histogram # Boxplot
5.75+* 1 |
.*** 11 |
.***************** 85 |
.***************************************** 201 +-----+
.****************************************** 208 *-----*
.***************** 84 | |
.** 9 | + |
.** 6 | |
.********* 44 | |
.****************************** 148 +-----+
.***************************** 145 |
.********** 49 |
.** 7 |
-0.75+* 2 |
----+----+----+----+----+----+----+----+--
* may represent up to 5 counts
Normal Probability Plot
5.75+ +++ *
| +++ ***
| ++**********
| *********
| ******++
| *** +++
| *+++
| ++*+
| +++ **
| +++*****
| *********
| ********+
|** ++++
-0.75+*+++
+----+----+----+----+----+----+----+----+----+----+
-2 -1 0 +1 +2
========================================================================
PROBLEM 13a
1. Use the algorithm above to generate 1000 random variates from the beta
distribution with parameters a = 2, b = 4.
2. Make a histogram of your data from part 1., and superimpose on it the
PDF for Beta(x, 2, 4).
3. Use something like the method described above to solve the equation
cos(x) = x,
where x is in radians.
4. Write a program which implements the linear interpolation
algorithm. Suppose X has a beta distribution with parameters
(2, 3). Carry out comparisons of the number of steps required
for the binary search algorithm with the number required for the
linear interpolation algorithm for generating random observations
for X.
========================================================================
The same methodology as is used here for generating random observations
from a distribution given an expression for its CDF can also be used to
solve nonlinear equations of the form g(x) = c. It is again a binary search
method. It will work provide g(x) is a continuous function and monotone
increasing or monotone decreasing in some interval containing the unknown
solution.
For example: let g(x) = sin(sin(x)). Suppose you want to solve
g(x) = .5. The function g(x) is monotonic in the interval [0, pi/2], and
g(0) = 0 and g(pi/2) = 0.84147. Therefore there is some value x between
0 and pi/2 such that g(x) = .5. The following program provides the solution:
------------------------------------------------------------------------------
options linesize = 80 MPRINT ;
footnote "~john-c/5421/binarysolver.sas &sysdate &systime" ;
%let function = sin(sin(x)) ;
%let target = .5 ;
data function ;
* Solve the equation &function = &target ;
* Note &function is increasing in the interval [0, pi/2] ;
pi = 4 * atan(1) ;
xlow = 0 ;
xhigh = pi / 2 ;
diff = xhigh - xlow ;
epsilon = 1e-6 ;
do i = 1 to 100 while (diff > epsilon) ;
x = .5 * (xlow + xhigh) ;
y = &function ;
if y gt &target then xhigh = x ;
if y le &target then xlow = x ;
diff = xhigh - xlow ;
output ;
end ;
run ;
proc print data = function ;
title1 "Solution to &function = &target" ;
------------------------------------------------------------------------------
Solution to sin(sin(x)) = .5 1
18:13 Monday, October 9, 2006
Obs pi xlow xhigh diff epsilon i x y
1 3.14159 0.00000 0.78540 0.78540 .000001 1 0.78540 0.64964
2 3.14159 0.39270 0.78540 0.39270 .000001 2 0.39270 0.37341
3 3.14159 0.39270 0.58905 0.19635 .000001 3 0.58905 0.52743
4 3.14159 0.49087 0.58905 0.09817 .000001 4 0.49087 0.45413
5 3.14159 0.53996 0.58905 0.04909 .000001 5 0.53996 0.49175
6 3.14159 0.53996 0.56450 0.02454 .000001 6 0.56450 0.50984
7 3.14159 0.53996 0.55223 0.01227 .000001 7 0.55223 0.50086
8 3.14159 0.54610 0.55223 0.00614 .000001 8 0.54610 0.49632
9 3.14159 0.54917 0.55223 0.00307 .000001 9 0.54917 0.49859
10 3.14159 0.55070 0.55223 0.00153 .000001 10 0.55070 0.49973
11 3.14159 0.55070 0.55147 0.00077 .000001 11 0.55147 0.50029
12 3.14159 0.55070 0.55108 0.00038 .000001 12 0.55108 0.50001
13 3.14159 0.55089 0.55108 0.00019 .000001 13 0.55089 0.49987
14 3.14159 0.55099 0.55108 0.00010 .000001 14 0.55099 0.49994
15 3.14159 0.55103 0.55108 0.00005 .000001 15 0.55103 0.49997
16 3.14159 0.55106 0.55108 0.00002 .000001 16 0.55106 0.49999
17 3.14159 0.55106 0.55107 0.00001 .000001 17 0.55107 0.50000
18 3.14159 0.55106 0.55107 0.00001 .000001 18 0.55106 0.50000
19 3.14159 0.55107 0.55107 0.00000 .000001 19 0.55107 0.50000
20 3.14159 0.55107 0.55107 0.00000 .000001 20 0.55107 0.50000
21 3.14159 0.55107 0.55107 0.00000 .000001 21 0.55107 0.50000
~john-c/5421/binarysolver.sas 09OCT06 18:13
------------------------------------------------------------------------------
In this case approximately 21 steps were needed to find the solution,
approximately .55107, to within an accuracy of epsilon = .000001.
Also note the use of macro variables &function and &target in this program.
==============================================================================
The first part of notes.012 shows how to generate pseudorandom numbers from
a distribution by inverting the CDF for that distribution and then using a uniform
random number generator. There is a theorem behind this:
Theorem. Let X be a random variable with CDF F(X). If U = F(X), then U has a
U[0, 1] distribution.
Proof. Let G be the CDF of U. That is, G(U) = prob(u < U), where u is
assumed to follow the distribution. Note that the statement
u < U
is equivalent to the statement
FINV(u) < FINV(U)
where FINV is the function-inverse of F. But now, let
X = FINV(U) and x = FINV(u). Then
prob(u < U) = prob(x < X) = F(X) = F(FINV(U)) = U.
That is, G(U) = U. By definition, this means that U has a
uniform distribution.
/home/walleye/john-c/5421/notes.012 Last update: September 23, 2009.