PubH 5421 SOME REVIEW QUESTIONS


 August 19, 2011

 Some Sample Questions for Stat Computing II (SPH 5421)

 1.  Given a dataset of observations of the form (Xi, Yi), what
     computations would you do to decide between a linear model with
     an intercept and a linear model with no intercept?

 2.  How would you generate pseudorandom observations from a lognormal
     distribution?

 3.  You can estimate the number pi by (1) randomly generating points
     within the square which has vertices at (-1, -1), (-1, 1), (1, 1),
     and (1, -1), and then (2) computing what proportion of these
     points lie within a circle of radius 1 centered at the origin.

     Write SAS or SPlus code which accomplishes this and produces an
     estimate of pi.  Execute your program with 10000 simulated points
     and print the answer.

     Can you think of a way to assign a standard error to the estimate
     of pi that you obtain by this method (assuming you didn't already
     know the true value) ?

 4.  Given a list of 120 people, find an efficient way to randomly assign them to
     three groups so that there are exactly 40 in each group.

 5.  An investigator gives drug A to 100 people and drug B to another
     100 people for four weeks.  The drug is supposed to lower blood
     pressure.  Thus the outcome of the study is the change in blood
     pressure from baseline to the last day they are on drug.

     He finds that the mean blood pressure change in the A group is
     -12.4, and in the B group it is -15.8.  The difference between
     groups A and B is -12.4 - (-15.8) = 3.4.

     He wants to know how likely it is that this might have occurred
     by chance.  He decides to test the null hypothesis of no difference
     between the drugs by means of a permutation test.  To do this he
     takes the list of 200 people and their changes in blood pressure
     and randomly assigns them to a pseudo-A group and a pseudo-B
     group.  He then computes the difference in mean change in blood
     pressure between these pseudo-groups.  He repeats this process
     10,000 times.

     a) What do you expect the mean difference over the 10,000 pseudo-
        trials to be?

     b) How can he estimate the probability that he would obtain
        by chance a difference greater than that which he observed?

     c) Does this seem to be a valid way to carry out a statistical
        test?

     c) Write a SAS or Splus program which carries out his plan.


 6.  Assume a data file with the following structure:

     Obs    X     Y
     ---   ---   ---
      1     12    3
      2     16    2
      .      .    .
      .      .    .
      .      .    .
      N     29    7


      Write an IML program which computes regression coefficients
      for the model

             E(Y) = b0 + b1 * X + b2 * X^2.

      Include a computation of R-square.

7.  White blood cell counts may be related to age.  Suppose N people are
    sampled with ages A1, A2, ..., An and weight W1, W2, ..., Wn,
    and white cell counts Y1, Y2, ..., Yn. The white cell count Yi for
    a given person can be assumed to have a Poisson distribution with 
    expectation equal to

                 E(Yi) = exp(a*Wi + b*Ai),

    where Ai is the person's age and Wi is the person's weight.

    Write a proc iml procedure which will compute maximum likelihood estimates
    of a and b, and computes the approximate covariance matrix of these estimates.

8.  Given a 2 x 2 table as follows,

                     Disease
                ----------------
                    0       1
                ----------------
                |       |       |
            0   |   a   |   b   |
                |       |       |
 Exposure        ----------------
                |       |       |
            1   |   c   |   d   |
                |       |       |
                ----------------


     Define the estimated ODDS RATIO of disease for exposed versus nonexposed people as


                OR = (d / c) / (b / a).


     Write a macro which will compute and print the estimated odds ratio and a 95% confidence
interval for the true odds ratio.  The call to the macro should look like:

               %oddsmac (a, b, c, d, title);


9.  Write a program to generate random samples from the negative binomial
    distribution; from the geometric distribution; from the hypergeometric
    distribution.

10. Write a macro which shows the image of the unit circle (x^2 + y^2 = 1)
    under a linear transformation A: R^2 --> R^2, specified by the matrix


                         | a   b |
                     A = |       |
                         | c   d |

    The call to the macro should look like

             %circimg (a, b, c, d, title) ;

11. Describe why SYMPUT is a useful command.

12. List 6 reasons why macros are useful.

13. List some possible disadvantages of macros.

14. What is a good reason to avoid using the 'goto' statement in SAS (or any
    other language)?  What should you do instead ?

15. Why is it a good idea to include 'mprint' on an options card?

16. Which is better: SAS or Splus?

17. What things can go wrong in programs which are designed to minimize
    functions iteratively?

18. Write a program that (1) reads a dataset A which includes diastolic
    blood pressure (DBP) as a variable, and (2) reads a second dataset B which
    also includes DBP as a variable, and computes the percentages of
    people in dataset B that are above and below the median DBP of
    dataset A, and tests for whether these percentages are significantly
    different from 50%.

19. Write a macro for doing sample size calculations for a two-group
    clinical trial with a dichotomous endpoint.  The call to the macro
    should look like

          %sampsize(p1, p2, alpha, beta, s) ;

    where p1    = the event probability in group 1,
          p2    = the event probability in group 2,
          alpha = significance level (two-sided),
          beta  = probability of Type II error (= 1 - power),
          s     = the desired total sample size.

20. An observation X in a dataset is called an OUTLIER if

          X > Q3 + 1.5 * (Q3 - Q1)  or

          X < Q1 - 1.5 * (Q3 - Q1),

    where Q3 is the third quartile of dataset (i.e., 3/4 of the
    observations are less than Q3) and Q1 is the first quartile.

    Write a macro which lists all the outliers for a given variable
    in a dataset.  The call to the macro should be of the form

          %outliers (indata, xvar, outdata),

    where indata is the input dataset, xvar is the variable of
    interest, and outdata is the output dataset of outlying values.

Image: St. John's Wort Flower

  • Main Page: http://www.biostat.umn.edu/~john-c/7460.f2006.html

    Web address of this page: http://www.biostat.umn.edu/~john-c/review.s2003.html

    Most recent update: August 19, 2011.