PubH 7460 Biostatistical Computing - Fall 2013


  1. Homework #1 - due September 10, 2013.

    1.  (a) You have a stick which is 1 unit long.  You make two marks
            on the stick at points X and Y, where both X and Y have a
            uniform[0, 1] distribution (the left end of the stick is at 0).
            This divides the stick into three pieces of lengths a, b, and c.
            Use a pseudo-random number generator to estimate the probability
            that a, b, and c can form the sides of a triangle.
    
        (b) You have a stick which is 1 unit long.  You make two marks
            on the stick at points X and Y, where X has a uniform[0, 1]
            distribution and Y has a uniform distribution on the interval [x, 1].
            This divides the stick into three pieces of lengths a, b, and c.
            Again use a pseudo-random number generator to estimate the probability
            that a, b, and c can form the sides of a triangle.
    
        (c) You have a stick which is 1 unit long.  You make two marks
            on the stick at points X and Y, where X has a uniform[0, 1]
            distribution Y has a uniform distribution on either [0, x] or on [x, 1],
            depending on which of these two subintervals is longer.
            This divides the stick into three pieces of lengths a, b, and c.
            Again use a pseudo-random number generator to estimate the probability
            that a, b, and c can form the sides of a triangle.
    
    What do you think the true answer is for part (a)?  Can you prove it?
    
    2.  The random number generator rexpon in SAS generates observations having
        an exponential distribution with expectation 1.0.  Use rexpon to generate
        observations with an exponential distributions which have expectation 0.5
        and 2.0 respectively.
    
    3.  Use the random number generator rannor in SAS to generate two random variables
        X and Y such that (1) X and Y both have normal distributions, and (2) Corr(X, Y) = .5.
    
        Generate 1000 X Y pairs and find the estimated correlation of the two variables.
    
    
  2. Homework #2 - due September 24, 2013
    
    1.  A clinical trial is planned of two blood pressure drugs, A and B.
        Each person will be randomly assigned to take one of these two
        for 30 days.  At the end, the systolic blood pressure (SBP) will be
        be measured.  Here are some relevant assumptions:
    
           1)  The standard deviation of SBP is 10 mmHg.
    
           2)  The significance level is alpha = 0.05, two-sided.
    
           3)  Null hypothesis: mean(A) = mean(B).
    
           4)  Alternative hypothesis: mean(B) - mean(A) = 2 mmHg.
    
           5)  N = 60 people in each group.
    
           6)  The test statistic will be an unpaired t-test.
    
        Estimate the power of this clinical trial.  Do this in three ways:
    
        a.  By the use of a simulation study with at least 10,000 simulated
            trials.
    
        b.  By the use of a formula for power (cite your source).
    
        c.  By the use of PROC POWER in SAS or an equivalent routine in R.
    
    2.  You need to write a program for a randomization schedule for a
        clinical trial with 3 groups, A, B, and C.  Assume you will use
        a randomized block design with block sizes 3 and 6.  Assume there
        are two clinics and that your program allows for at least 120
        treatment assignments for each clinic.
    
    
  3. Homework #3 - due October 3, 2013
    1. (a) Write a program to compute sample size for a clinical trial
    with two groups, where the endpoint is time-to-event (i.e.,
    survival).  The sample size computation should be based on the
    the description in Biostatistical Methods, by John Lachin,
    pages 409-412.  A copy is included in the class website, right
    after notes.004.1.  The test statistic is the logrank test.  Constant 
    exponential hazards are assumed.  You can assume that the sample 
    sizes in the two groups will be equal.  Input parameters should 
    include the following:
    
    ==============================================================================
    
    *  alpha = two-sided signif level
    *  power = 1 - beta
    *
    *  f = Maximal follow-up time
    *  a = Accrual time (assuming uniform accrual)
    *
    *  r1 = proportion having event in group 1 at time = 1
    *  r2 = proportion having event in group 2 at time = 1
    *
    
    ==============================================================================
    
     Output from the program should look like the following:
    
    ==============================================================================
    
    Logrank sample size program: {program name} 27AUG07 17:26
     
    Computation based on Biostatistical Methods, John Lachin (2000)
     
    Two groups with exponential hazard in each group
     
    Two-sided alpha = 0.05
    Power           = 0.85
     
    Maximal follow-up time f = 2.5
    Accrual time             = 1.5  (uniform accrual assumed)
     
    Expected proportion of events in Group 1 in time = 1 : 0.55
    Expected proportion of events in Group 2 in time = 1 : 0.44
     
    Expected number of events in Group 1 :    189
    Expected number of events in Group 2 :    161
     
    Proportion of patients in Group 1: 0.5
    Proportion of patients in Group 2: 0.5
     
    Hazard in Group 1:     0.799
    Hazard in Group 2:     0.580
    Average hazard   :     0.689
     
    Relative hazard (Group 2 relative to Group1) :     0.726
     
    Required total sample size         :    513
    
    ===============================================================================
    1(b).  Check that your program is giving approximately the
    right values by comparing the results to those you can obtain
    from PROC POWER in SAS version 9.
    
    2.  Use Newton's Method to solve the following equation for x:
    
            u = CDF(x),
    
        where u = .1, .3, .5, .7, .9 and CDF(x) is the cumulative
        distribution function for the lognormal distribution function
        with theta = 0 and lambda = 1.  Note that the CDF for the lognormal
        distribution is available in SAS 9.3: Look in the SAS 9.3
        documentation under "SAS(R) 9.3 Functions and CALL Routines: Reference"
    
    
  4. Homework #4 - due October 10, 2013
    
    1.  A permutation test problem: the datafile asthma.out is on the
        course website.  It includes three variables: group (= 1 or 2),
        number of events, and follow-up time.  There are two ways to
        summarize the data: the average rate of events per follow-up
        time, where you compute the fraction
    
              rate = (number of events / follow-up time)
    
        for each person.  You can then compare the two groups using
        either a t-test or a nonparametric test such as the Wilcoxon.
    
        The other way of summarizing the data is to compute
    
              rateg = (total events ) / (total follow-up time)
    
        where this is done separately for each group.  You can then
        use a permutation test to compare the groups.
    
        Do both of these analyses and compare the results.  For the
        permutation test, base your results on at least 1000 permutations 
        of the group assignments.
    
    
  5. Homework #5 - due October 24, 2013
    
    For this problem use the same dataset as that given for Homework #4.
    
    1.  Compute the bootstrap estimate of the variance of the median of the
        rate of events, where a separate rate is computed for each person.
    
        Do this for each group separately.
    
    2.  Let the number of bootstrap samples be 1000.  Graph a histogram
        of the bootstrap medians.
    
    3.  Use the estimate of variance in part 1. to estimate 95%
        confidence intervals for the medians of each group and for
        the difference of the medians.
    
    4.  Use the bootstrap to compute a direct estimate of the 95%
        confidence intervals for the medians of each group and for
        the difference of the medians.
    
    5.  Also compute 95% confidence limits for the medians of each
        group using the method described in:
    
        https://epilab.ich.ucl.ac.uk/coursematerial/statistics/non_parametric/
        confidence_interval.html
    
    
  6. Homework #6 - due Tuesday, November 5, 2013
    
      1.  The function f(x) is defined as
    
              f(x) = log(x) + x.
    
          It is defined only for x > 0.
    
          Use Newton's method to find an approximate solution to
          the equation f(x) = 0.  The solution should be correct to
          within 10^(-6).
    
      2.  The function G(x) is defined as
    
              G(x, y) = (xy - 1)^2 + cos(x - y) + sin(x^2 + y^2).
    
          Use Newton's method to find a vector (x, y) such that
          dG(x, y)/dx = 0 and dG(x, y)/dy = 0.  Draw a contour
          plot of the function to see if your answer looks
          reasonable.
    
      3.  The random variable X has the following distribution:
    
          Pr(X = n) = (1 - r) * r^n, where n is a nonnegative integer,
          that is, n = 0, 1, 2, 3, ...
    
          If y is not a nonnegative integer, Pr(X = y) = 0.
    
          You don't know what r is, but it must be between 0 and 1.
    
          Assume you have the following observations for X:
    
                 1, 2, 0, 3, 5, 2, 3, 8.
    
          a) Find the maximum likelihood estimate of r.  
    
          b) Verify that the solution provides a maximum of the (log)likelihood, 
             not a minimum.
    
          c) How would you generate random observations from this distribution?
    
          d) Generate 100 pseudo-random observations from this distribution
             for r = 0.2 and find the mean, variance, and standard deviation of these
             values.
    
    
  7. Homework #7 - due Thursday, November 14, 2013
    
    1.  Problem 18A, notes.019:  Do this using proc nlp and a similar routine in R.
    
    2.  Problem 19.1, notes.020: Do this using both proc iml, and proc nlp.
    
    3.  Problem 19.2, notes.020: Do this also using both proc iml and proc nlp.
    
    
  8. Homework #8 - due Tuesday, November 26, 2013

    
    Problem 20, notes.021.  Use SAS PROC IML.
    
    Problem 21, notes.021
    
    Problem 21a, notes.021
    
    See notes.021a.  Assume the model
    
        S2FEVPOS = b0 + b1*age + b2*bodymass + error,     E(error) = 0, Var(error) = w^2.
    
        Define two other variables U and V, linear combinations of age and bodymass, such that:
          (1) U and V are uncorrelated
    
          (2) The regression of S2FEVPOS on U and V has the same sum of squared residuals
              as the regression of S2FEVPOS on age and bodymass
    
          (3) The regression sum of squares for the model in part (2) is the sum of the
              regression sum of squares for the regression of S2FEVPOS on U and the regression
              sum of squares for the regression of S2FEVPOS on V.
    
       This problem should be done using both R and SAS PROC IML.
    
    
  9. Homework #9 - due Thursday, December 5, 2013

    
    1.  Let A be the matrix {5 2, 2 8}.  Let V be the vector {1, 0}.
        Perform the following steps:
    
        1.  V = (1/w) * V where w = sqrt(V`*V).
    
        2.  V = A*V.
    
        Repeat these steps 50 times.  What is the result?
    
    2.  You have two datasets, one of which has the heights and weights
        of some men, and the other of which has the heights and weights
        of some women.  Write a macro which does the following:
    
        1.  Computes the body mass index, BMI, for each person in the two datasets.
    
            BMI = weight / (height * height), where weight is in kilograms
                  and height is in meters.
    
            (If the body mass index is already nonblank on the file,
             the above computation is skipped.)
    
        2.  For each man in dataset 1, computes the number of women
            in dataset 2 who have a smaller BMI than that man.  The percent
            of women who have a smaller BMI than that man is also computed.
    
        3.  Compute the average of the percents for the men in 2.
    
        4.  Print the results in a table.
    
        Apply this macro to the Lung Health Study dataset, lhs.data.
        Note that the variable 'gender' on that file is coded as 0 = men,
        1 = women.  Also note that BMI is already computed there,
        with the label 'bodymass'.
    
    

    Web address: http://www.biostat.umn.edu/~john-c/assign7460.f2013.html Most recent update: November 29, 2013.