PubH 7460 homework

PubH 7460 Biostatistical Computing - Fall 2013

Main Page: http://www.biostat.umn.edu/~john-c/ph7460.f2006.html

Homework #1 - due September 10, 2013.

1.  (a) You have a stick which is 1 unit long.  You make two marks
        on the stick at points X and Y, where both X and Y have a
        uniform[0, 1] distribution (the left end of the stick is at 0).
        This divides the stick into three pieces of lengths a, b, and c.
        Use a pseudo-random number generator to estimate the probability
        that a, b, and c can form the sides of a triangle.

    (b) You have a stick which is 1 unit long.  You make two marks
        on the stick at points X and Y, where X has a uniform[0, 1]
        distribution and Y has a uniform distribution on the interval [x, 1].
        This divides the stick into three pieces of lengths a, b, and c.
        Again use a pseudo-random number generator to estimate the probability
        that a, b, and c can form the sides of a triangle.

    (c) You have a stick which is 1 unit long.  You make two marks
        on the stick at points X and Y, where X has a uniform[0, 1]
        distribution Y has a uniform distribution on either [0, x] or on [x, 1],
        depending on which of these two subintervals is longer.
        This divides the stick into three pieces of lengths a, b, and c.
        Again use a pseudo-random number generator to estimate the probability
        that a, b, and c can form the sides of a triangle.

What do you think the true answer is for part (a)?  Can you prove it?

2.  The random number generator rexpon in SAS generates observations having
    an exponential distribution with expectation 1.0.  Use rexpon to generate
    observations with an exponential distributions which have expectation 0.5
    and 2.0 respectively.

3.  Use the random number generator rannor in SAS to generate two random variables
    X and Y such that (1) X and Y both have normal distributions, and (2) Corr(X, Y) = .5.

    Generate 1000 X Y pairs and find the estimated correlation of the two variables.

Homework #2 - due September 24, 2013


1.  A clinical trial is planned of two blood pressure drugs, A and B.
    Each person will be randomly assigned to take one of these two
    for 30 days.  At the end, the systolic blood pressure (SBP) will be
    be measured.  Here are some relevant assumptions:

       1)  The standard deviation of SBP is 10 mmHg.

       2)  The significance level is alpha = 0.05, two-sided.

       3)  Null hypothesis: mean(A) = mean(B).

       4)  Alternative hypothesis: mean(B) - mean(A) = 2 mmHg.

       5)  N = 60 people in each group.

       6)  The test statistic will be an unpaired t-test.

    Estimate the power of this clinical trial.  Do this in three ways:

    a.  By the use of a simulation study with at least 10,000 simulated
        trials.

    b.  By the use of a formula for power (cite your source).

    c.  By the use of PROC POWER in SAS or an equivalent routine in R.

2.  You need to write a program for a randomization schedule for a
    clinical trial with 3 groups, A, B, and C.  Assume you will use
    a randomized block design with block sizes 3 and 6.  Assume there
    are two clinics and that your program allows for at least 120
    treatment assignments for each clinic.

Homework #3 - due October 3, 2013

1. (a) Write a program to compute sample size for a clinical trial
with two groups, where the endpoint is time-to-event (i.e.,
survival).  The sample size computation should be based on the
the description in Biostatistical Methods, by John Lachin,
pages 409-412.  A copy is included in the class website, right
after notes.004.1.  The test statistic is the logrank test.  Constant 
exponential hazards are assumed.  You can assume that the sample 
sizes in the two groups will be equal.  Input parameters should 
include the following:

==============================================================================

*  alpha = two-sided signif level
*  power = 1 - beta
*
*  f = Maximal follow-up time
*  a = Accrual time (assuming uniform accrual)
*
*  r1 = proportion having event in group 1 at time = 1
*  r2 = proportion having event in group 2 at time = 1
*

==============================================================================

 Output from the program should look like the following:

==============================================================================

Logrank sample size program: {program name} 27AUG07 17:26
 
Computation based on Biostatistical Methods, John Lachin (2000)
 
Two groups with exponential hazard in each group
 
Two-sided alpha = 0.05
Power           = 0.85
 
Maximal follow-up time f = 2.5
Accrual time             = 1.5  (uniform accrual assumed)
 
Expected proportion of events in Group 1 in time = 1 : 0.55
Expected proportion of events in Group 2 in time = 1 : 0.44
 
Expected number of events in Group 1 :    189
Expected number of events in Group 2 :    161
 
Proportion of patients in Group 1: 0.5
Proportion of patients in Group 2: 0.5
 
Hazard in Group 1:     0.799
Hazard in Group 2:     0.580
Average hazard   :     0.689
 
Relative hazard (Group 2 relative to Group1) :     0.726
 
Required total sample size         :    513

===============================================================================
1(b).  Check that your program is giving approximately the
right values by comparing the results to those you can obtain
from PROC POWER in SAS version 9.

2.  Use Newton's Method to solve the following equation for x:

        u = CDF(x),

    where u = .1, .3, .5, .7, .9 and CDF(x) is the cumulative
    distribution function for the lognormal distribution function
    with theta = 0 and lambda = 1.  Note that the CDF for the lognormal
    distribution is available in SAS 9.3: Look in the SAS 9.3
    documentation under "SAS(R) 9.3 Functions and CALL Routines: Reference"

Homework #4 - due October 10, 2013


1.  A permutation test problem: the datafile asthma.out is on the
    course website.  It includes three variables: group (= 1 or 2),
    number of events, and follow-up time.  There are two ways to
    summarize the data: the average rate of events per follow-up
    time, where you compute the fraction

          rate = (number of events / follow-up time)

    for each person.  You can then compare the two groups using
    either a t-test or a nonparametric test such as the Wilcoxon.

    The other way of summarizing the data is to compute

          rateg = (total events ) / (total follow-up time)

    where this is done separately for each group.  You can then
    use a permutation test to compare the groups.

    Do both of these analyses and compare the results.  For the
    permutation test, base your results on at least 1000 permutations 
    of the group assignments.

Homework #5 - due October 24, 2013


For this problem use the same dataset as that given for Homework #4.

1.  Compute the bootstrap estimate of the variance of the median of the
    rate of events, where a separate rate is computed for each person.

    Do this for each group separately.

2.  Let the number of bootstrap samples be 1000.  Graph a histogram
    of the bootstrap medians.

3.  Use the estimate of variance in part 1. to estimate 95%
    confidence intervals for the medians of each group and for
    the difference of the medians.

4.  Use the bootstrap to compute a direct estimate of the 95%
    confidence intervals for the medians of each group and for
    the difference of the medians.

5.  Also compute 95% confidence limits for the medians of each
    group using the method described in:

    https://epilab.ich.ucl.ac.uk/coursematerial/statistics/non_parametric/
    confidence_interval.html

Homework #6 - due Tuesday, November 5, 2013


  1.  The function f(x) is defined as

          f(x) = log(x) + x.

      It is defined only for x > 0.

      Use Newton's method to find an approximate solution to
      the equation f(x) = 0.  The solution should be correct to
      within 10^(-6).

  2.  The function G(x) is defined as

          G(x, y) = (xy - 1)^2 + cos(x - y) + sin(x^2 + y^2).

      Use Newton's method to find a vector (x, y) such that
      dG(x, y)/dx = 0 and dG(x, y)/dy = 0.  Draw a contour
      plot of the function to see if your answer looks
      reasonable.

  3.  The random variable X has the following distribution:

      Pr(X = n) = (1 - r) * r^n, where n is a nonnegative integer,
      that is, n = 0, 1, 2, 3, ...

      If y is not a nonnegative integer, Pr(X = y) = 0.

      You don't know what r is, but it must be between 0 and 1.

      Assume you have the following observations for X:

             1, 2, 0, 3, 5, 2, 3, 8.

      a) Find the maximum likelihood estimate of r.  

      b) Verify that the solution provides a maximum of the (log)likelihood, 
         not a minimum.

      c) How would you generate random observations from this distribution?

      d) Generate 100 pseudo-random observations from this distribution
         for r = 0.2 and find the mean, variance, and standard deviation of these
         values.

Homework #7 - due Thursday, November 14, 2013


1.  Problem 18A, notes.019:  Do this using proc nlp and a similar routine in R.

2.  Problem 19.1, notes.020: Do this using both proc iml, and proc nlp.

3.  Problem 19.2, notes.020: Do this also using both proc iml and proc nlp.

Homework #8 - due Tuesday, November 26, 2013


Problem 20, notes.021.  Use SAS PROC IML.

Problem 21, notes.021

Problem 21a, notes.021

See notes.021a.  Assume the model

    S2FEVPOS = b0 + b1*age + b2*bodymass + error,     E(error) = 0, Var(error) = w^2.

    Define two other variables U and V, linear combinations of age and bodymass, such that:
      (1) U and V are uncorrelated

      (2) The regression of S2FEVPOS on U and V has the same sum of squared residuals
          as the regression of S2FEVPOS on age and bodymass

      (3) The regression sum of squares for the model in part (2) is the sum of the
          regression sum of squares for the regression of S2FEVPOS on U and the regression
          sum of squares for the regression of S2FEVPOS on V.

   This problem should be done using both R and SAS PROC IML.

Homework #9 - due Thursday, December 5, 2013


1.  Let A be the matrix {5 2, 2 8}.  Let V be the vector {1, 0}.
    Perform the following steps:

    1.  V = (1/w) * V where w = sqrt(V`*V).

    2.  V = A*V.

    Repeat these steps 50 times.  What is the result?

2.  You have two datasets, one of which has the heights and weights
    of some men, and the other of which has the heights and weights
    of some women.  Write a macro which does the following:

    1.  Computes the body mass index, BMI, for each person in the two datasets.

        BMI = weight / (height * height), where weight is in kilograms
              and height is in meters.

        (If the body mass index is already nonblank on the file,
         the above computation is skipped.)

    2.  For each man in dataset 1, computes the number of women
        in dataset 2 who have a smaller BMI than that man.  The percent
        of women who have a smaller BMI than that man is also computed.

    3.  Compute the average of the percents for the men in 2.

    4.  Print the results in a table.

    Apply this macro to the Lung Health Study dataset, lhs.data.
    Note that the variable 'gender' on that file is coded as 0 = men,
    1 = women.  Also note that BMI is already computed there,
    with the label 'bodymass'.

Web address: http://www.biostat.umn.edu/~john-c/assign7460.f2013.html Most recent update: November 29, 2013.