1. (a) You have a stick which is 1 unit long. You make two marks
on the stick at points X and Y, where both X and Y have a
uniform[0, 1] distribution (the left end of the stick is at 0).
This divides the stick into three pieces of lengths a, b, and c.
Use a pseudo-random number generator to estimate the probability
that a, b, and c can form the sides of a triangle.
(b) You have a stick which is 1 unit long. You make two marks
on the stick at points X and Y, where X has a uniform[0, 1]
distribution and Y has a uniform distribution on the interval [x, 1].
This divides the stick into three pieces of lengths a, b, and c.
Again use a pseudo-random number generator to estimate the probability
that a, b, and c can form the sides of a triangle.
(c) You have a stick which is 1 unit long. You make two marks
on the stick at points X and Y, where X has a uniform[0, 1]
distribution Y has a uniform distribution on either [0, x] or on [x, 1],
depending on which of these two subintervals is longer.
This divides the stick into three pieces of lengths a, b, and c.
Again use a pseudo-random number generator to estimate the probability
that a, b, and c can form the sides of a triangle.
What do you think the true answer is for part (a)? Can you prove it?
2. The random number generator rexpon in SAS generates observations having
an exponential distribution with expectation 1.0. Use rexpon to generate
observations with an exponential distributions which have expectation 0.5
and 2.0 respectively.
3. Use the random number generator rannor in SAS to generate two random variables
X and Y such that (1) X and Y both have normal distributions, and (2) Corr(X, Y) = .5.
Generate 1000 X Y pairs and find the estimated correlation of the two variables.
1. A clinical trial is planned of two blood pressure drugs, A and B.
Each person will be randomly assigned to take one of these two
for 30 days. At the end, the systolic blood pressure (SBP) will be
be measured. Here are some relevant assumptions:
1) The standard deviation of SBP is 10 mmHg.
2) The significance level is alpha = 0.05, two-sided.
3) Null hypothesis: mean(A) = mean(B).
4) Alternative hypothesis: mean(B) - mean(A) = 2 mmHg.
5) N = 60 people in each group.
6) The test statistic will be an unpaired t-test.
Estimate the power of this clinical trial. Do this in three ways:
a. By the use of a simulation study with at least 10,000 simulated
trials.
b. By the use of a formula for power (cite your source).
c. By the use of PROC POWER in SAS or an equivalent routine in R.
2. You need to write a program for a randomization schedule for a
clinical trial with 3 groups, A, B, and C. Assume you will use
a randomized block design with block sizes 3 and 6. Assume there
are two clinics and that your program allows for at least 120
treatment assignments for each clinic.
1. (a) Write a program to compute sample size for a clinical trial
with two groups, where the endpoint is time-to-event (i.e.,
survival). The sample size computation should be based on the
the description in Biostatistical Methods, by John Lachin,
pages 409-412. A copy is included in the class website, right
after notes.004.1. The test statistic is the logrank test. Constant
exponential hazards are assumed. You can assume that the sample
sizes in the two groups will be equal. Input parameters should
include the following:
==============================================================================
* alpha = two-sided signif level
* power = 1 - beta
*
* f = Maximal follow-up time
* a = Accrual time (assuming uniform accrual)
*
* r1 = proportion having event in group 1 at time = 1
* r2 = proportion having event in group 2 at time = 1
*
==============================================================================
Output from the program should look like the following:
==============================================================================
Logrank sample size program: {program name} 27AUG07 17:26
Computation based on Biostatistical Methods, John Lachin (2000)
Two groups with exponential hazard in each group
Two-sided alpha = 0.05
Power = 0.85
Maximal follow-up time f = 2.5
Accrual time = 1.5 (uniform accrual assumed)
Expected proportion of events in Group 1 in time = 1 : 0.55
Expected proportion of events in Group 2 in time = 1 : 0.44
Expected number of events in Group 1 : 189
Expected number of events in Group 2 : 161
Proportion of patients in Group 1: 0.5
Proportion of patients in Group 2: 0.5
Hazard in Group 1: 0.799
Hazard in Group 2: 0.580
Average hazard : 0.689
Relative hazard (Group 2 relative to Group1) : 0.726
Required total sample size : 513
===============================================================================
1(b). Check that your program is giving approximately the
right values by comparing the results to those you can obtain
from PROC POWER in SAS version 9.
2. Use Newton's Method to solve the following equation for x:
u = CDF(x),
where u = .1, .3, .5, .7, .9 and CDF(x) is the cumulative
distribution function for the lognormal distribution function
with theta = 0 and lambda = 1. Note that the CDF for the lognormal
distribution is available in SAS 9.3: Look in the SAS 9.3
documentation under "SAS(R) 9.3 Functions and CALL Routines: Reference"
1. A permutation test problem: the datafile asthma.out is on the
course website. It includes three variables: group (= 1 or 2),
number of events, and follow-up time. There are two ways to
summarize the data: the average rate of events per follow-up
time, where you compute the fraction
rate = (number of events / follow-up time)
for each person. You can then compare the two groups using
either a t-test or a nonparametric test such as the Wilcoxon.
The other way of summarizing the data is to compute
rateg = (total events ) / (total follow-up time)
where this is done separately for each group. You can then
use a permutation test to compare the groups.
Do both of these analyses and compare the results. For the
permutation test, base your results on at least 1000 permutations
of the group assignments.
For this problem use the same dataset as that given for Homework #4.
1. Compute the bootstrap estimate of the variance of the median of the
rate of events, where a separate rate is computed for each person.
Do this for each group separately.
2. Let the number of bootstrap samples be 1000. Graph a histogram
of the bootstrap medians.
3. Use the estimate of variance in part 1. to estimate 95%
confidence intervals for the medians of each group and for
the difference of the medians.
4. Use the bootstrap to compute a direct estimate of the 95%
confidence intervals for the medians of each group and for
the difference of the medians.
5. Also compute 95% confidence limits for the medians of each
group using the method described in:
https://epilab.ich.ucl.ac.uk/coursematerial/statistics/non_parametric/
confidence_interval.html
1. The function f(x) is defined as
f(x) = log(x) + x.
It is defined only for x > 0.
Use Newton's method to find an approximate solution to
the equation f(x) = 0. The solution should be correct to
within 10^(-6).
2. The function G(x) is defined as
G(x, y) = (xy - 1)^2 + cos(x - y) + sin(x^2 + y^2).
Use Newton's method to find a vector (x, y) such that
dG(x, y)/dx = 0 and dG(x, y)/dy = 0. Draw a contour
plot of the function to see if your answer looks
reasonable.
3. The random variable X has the following distribution:
Pr(X = n) = (1 - r) * r^n, where n is a nonnegative integer,
that is, n = 0, 1, 2, 3, ...
If y is not a nonnegative integer, Pr(X = y) = 0.
You don't know what r is, but it must be between 0 and 1.
Assume you have the following observations for X:
1, 2, 0, 3, 5, 2, 3, 8.
a) Find the maximum likelihood estimate of r.
b) Verify that the solution provides a maximum of the (log)likelihood,
not a minimum.
c) How would you generate random observations from this distribution?
d) Generate 100 pseudo-random observations from this distribution
for r = 0.2 and find the mean, variance, and standard deviation of these
values.
1. Problem 18A, notes.019: Do this using proc nlp and a similar routine in R. 2. Problem 19.1, notes.020: Do this using both proc iml, and proc nlp. 3. Problem 19.2, notes.020: Do this also using both proc iml and proc nlp.
Problem 20, notes.021. Use SAS PROC IML.
Problem 21, notes.021
Problem 21a, notes.021
See notes.021a. Assume the model
S2FEVPOS = b0 + b1*age + b2*bodymass + error, E(error) = 0, Var(error) = w^2.
Define two other variables U and V, linear combinations of age and bodymass, such that:
(1) U and V are uncorrelated
(2) The regression of S2FEVPOS on U and V has the same sum of squared residuals
as the regression of S2FEVPOS on age and bodymass
(3) The regression sum of squares for the model in part (2) is the sum of the
regression sum of squares for the regression of S2FEVPOS on U and the regression
sum of squares for the regression of S2FEVPOS on V.
This problem should be done using both R and SAS PROC IML.
1. Let A be the matrix {5 2, 2 8}. Let V be the vector {1, 0}.
Perform the following steps:
1. V = (1/w) * V where w = sqrt(V`*V).
2. V = A*V.
Repeat these steps 50 times. What is the result?
2. You have two datasets, one of which has the heights and weights
of some men, and the other of which has the heights and weights
of some women. Write a macro which does the following:
1. Computes the body mass index, BMI, for each person in the two datasets.
BMI = weight / (height * height), where weight is in kilograms
and height is in meters.
(If the body mass index is already nonblank on the file,
the above computation is skipped.)
2. For each man in dataset 1, computes the number of women
in dataset 2 who have a smaller BMI than that man. The percent
of women who have a smaller BMI than that man is also computed.
3. Compute the average of the percents for the men in 2.
4. Print the results in a table.
Apply this macro to the Lung Health Study dataset, lhs.data.
Note that the variable 'gender' on that file is coded as 0 = men,
1 = women. Also note that BMI is already computed there,
with the label 'bodymass'.
Web address: http://www.biostat.umn.edu/~john-c/assign7460.f2013.html Most recent update: November 29, 2013.