SPH 5421 First Exam October 26, 2004 page 1 of 5 SOLUTION KEY Name: _________________________________________ ===================================================================================== 1. Mr. Smith goes to a gambling casino and plays the slot machine. Every time he pulls the lever on the slot machine, he has a probability p = .01 of winning. Let N the the number of times Mr. Smith pulls the lever before he finally wins. Assume that the outcomes of each pull are independent. 1.1 What is the probability that N = 3 ? What is the most probable value of N ? Prob (N = 3) = .99 * .99 * .01 = .009801 { 8} Most probable: N = 1: prob(N = 1) = .01. 1.2 Write a SAS program which produces 1000 simulated random values of N and also computes an estimated mean and standard deviation of N. data simslot ; {12} m = 1000 ; p = .01 ; nsum = 0 ; nsum2 = 0 ; do i = 1 to m ; problose = 1 - p ; lose = 1 ; n = 0 ; do while (lose eq 1) ; n = n + 1 ; r = ranuni(-1) ; if r lt p then do ; lose = 0 ; problose = problose * p ; end ; if r ge p then do ; problose = problose * (1 - p) ; end ; end ; nsum = nsum + n ; nsum2 = nsum2 + n*n ; output ; end ; nave = nsum / m ; nvar = (nsum2 - nsum*nsum/m)/(m - 1) ; nsdev = sqrt(nvar) ; output ; run ; proc print data = simslot ; where i ge 1000 ; var i n nave nvar nsdev ; run ; proc means n mean var stddev data = simslot ; var n ; run ; SPH 5421 First Exam October 26, 2004 page 2 of 5 Name: _________________________________________ ===================================================================================== 2. Values of weight, blood pressure, and cholesterol are on three different data files for a set of 10 people. Here are the three data files: Data File 1 Data File 2 Data File 3 ---------------------------- --------------------- -------------------------- ID Date Bldpress ID Date Weight ID Date Chol ----- ------ -------------- ----- ------ ------ ----- ------ ----------- 0001 040228 88 0004 030401 144 0004 030824 225 0002 040301 . 0010 030531 207 0012 030829 256 0003 040327 102 0001 030605 208 0007 030829 169 0004 040415 92 0006 040714 141 0006 031011 196 0005 040704 66 0005 040722 130 0005 031017 289 0006 040901 68 0009 040819 95 0011 031105 144 0007 041012 70 0003 040820 . 0003 031111 . 0008 041126 104 0007 040909 130 0008 031207 121 0009 041225 94 0008 041016 125 0009 031217 361 0010 041231 80 0002 041030 144 0001 031224 324 2.1 Write a program which produces a file that has blood pressure, weight, and cholesterol for a given ID all on the same line in the file data file1 ; infile 'file1' ; input id date1 bldpress ; run ; data file2 ; infile 'file2' ; input id date2 weight ; run ; data file3 ; infile 'file3' ; input id date3 chol ; run ; proc sort data = file1 ; by id ; run ; proc sort data = file2 ; by id ; run ; proc sort data = file3 ; by id ; run ; data allfiles ; merge file1 file2 file3 ; by id ; {12} 2.2 How many lines will the new file include ? Explain. 12, because file3 has 2 ids that the other two files do not have. {5} 2.3 Show the first three lines of the output file. obs ID date1 date2 date3 bldpress weight chol {6} --- ---- ------- ------- ------- -------- ------ ---- 1 1 040228 030605 031224 88 208 324 2 2 040301 041030 . . 144 . 3 3 040327 030605 031111 102 . . SPH 5421 First Exam October 26, 2004 page 3 of 5 Name: _________________________________________ ===================================================================================== 3. Randomization in clinical trials is sometimes done in such a way as to achieve PROBABLE balance between the two treatment groups A and B without using permuted blocks. 3.1 After the i-th treatment assignment, let S(i) be the proportion who are assigned to treatment group A. We specify that the probability that the (i + 1)-st treatment assignment is to group A is pA = .1 * S(i) + .9 * (1 - S(i)). (Note S(0) is defined to be 0.5. The probability of assignment to B is 1 - pA.) Suppose the first 9 treatment assignments are: B B A B B B A B B. What is the probability that the 10th person will be assigned to group A ? pA = .1 * (2/9) + .9 * (7/9) = .0222 + .70 = .7222. {5} 3.2 Write a randomization program which produces 1000 randomized treatment assignments with the probabilities of assignment as specified in 3.1. data probrand ; n = 1000 ; m = 0 ; r = ranuni(-1) ; if r < .5 then m = 1 ; {12} assign = m ; s = assign ; output ; do i = 2 to n ; assign = 0 ; r = ranuni(-1) ; pA = .1 * s + .9 * (1 - s) ; if r < pA then do ; assign = 1 ; m = m + 1 ; s = m / i ; end ; output ; end ; SPH 5421 First Exam October 26, 2004 page 4 of 5 Name: _________________________________________ ===================================================================================== Problem 3, contin. 3.3 What is an advantage of this kind of randomization schedule over a permuted- blocks randomization schedule? What is a disadvantage ? Advantages: 1. The more out of balance you are, the more likely you are to return to balance. 2. The schedule is not completely predictable at any point. {9} Disadvantages: 1. It does not absolutely prevent bad out-of-balance 2. It does not prevent long runs of the same assignment 3. If the formula for pA is known, you can compute exactly the probability that the next assignment will be A. 3.4 Suppose the formula for pA is changed to pA = .4 * S(i) + .6 * (1 - S(i)). What kind of difference will that make in the randomization schedule? (Hint: repeat the computation in 3.1 using this modified formula for pA.) pA = .4*(2/9) + .6*(7/9) = .08889 + .46667 = .55556. In general, it will result in a schedule which is closer to being completely random at any point. More long runs are likely, and out-of-balances are more likely. {9} SPH 5421 First Exam October 26, 2004 page 5 of 5 Name: _________________________________________ ===================================================================================== 4. In simple linear regression, where the model is Y = b0 + b1*X + e, where expectation(e) = 0 and variance(e) = sigma^2, an unbiased estimate of the value of sigma^2 can be found by computing the sum of squared residuals and dividing by (n - 2), where n is the number of observations. The residual for a given observation is the difference between the observed value of Y corresponding to a given value of X, and the predicted value based on the least-squares estimates of the slope and intercept. Write a program which performs this computation. You can assume that the input data file has values of X and Y for each of n = 100 observations. data regress outstats ; retain xsum 0 ysum 0 xysum 0 x2sum 0 nobs 0 ; infile 'xy' eof = stats ; input x y ; if x ne . and y ne . then do ; nobs = nobs + 1 ; xsum = xsum + x ; ysum = ysum + y ; {22} x2sum = x2sum + x * x ; xysum = xysum + x * y ; output ; end ; return ; stats: top = xysum - xsum * ysum / nobs ; bot = x2sum - xsum * xsum / nobs ; slope = top / bot ; yave = ysum / nobs ; xave = xsum / nobs ; intcpt = yave - slope * xave ; kobs = nobs ; do nobs = 1 to kobs ; output stats ; end ; data regress ; merge regress stats ; by nobs ; retain sumres2 0 ; predy = intcpt + slope * x ; resid = y - predy ; sumres2 = sumres2 + resid**2 ; s2 = sumres2 / (kobs - 2) ; run ; proc print data = regress ; where nobs eq kobs ; var s2 ; run ;