August 19, 2011
Some Sample Questions for Stat Computing II (SPH 5421)
1. Given a dataset of observations of the form (Xi, Yi), what
computations would you do to decide between a linear model with
an intercept and a linear model with no intercept?
2. How would you generate pseudorandom observations from a lognormal
distribution?
3. You can estimate the number pi by (1) randomly generating points
within the square which has vertices at (-1, -1), (-1, 1), (1, 1),
and (1, -1), and then (2) computing what proportion of these
points lie within a circle of radius 1 centered at the origin.
Write SAS or SPlus code which accomplishes this and produces an
estimate of pi. Execute your program with 10000 simulated points
and print the answer.
Can you think of a way to assign a standard error to the estimate
of pi that you obtain by this method (assuming you didn't already
know the true value) ?
4. Given a list of 120 people, find an efficient way to randomly assign them to
three groups so that there are exactly 40 in each group.
5. An investigator gives drug A to 100 people and drug B to another
100 people for four weeks. The drug is supposed to lower blood
pressure. Thus the outcome of the study is the change in blood
pressure from baseline to the last day they are on drug.
He finds that the mean blood pressure change in the A group is
-12.4, and in the B group it is -15.8. The difference between
groups A and B is -12.4 - (-15.8) = 3.4.
He wants to know how likely it is that this might have occurred
by chance. He decides to test the null hypothesis of no difference
between the drugs by means of a permutation test. To do this he
takes the list of 200 people and their changes in blood pressure
and randomly assigns them to a pseudo-A group and a pseudo-B
group. He then computes the difference in mean change in blood
pressure between these pseudo-groups. He repeats this process
10,000 times.
a) What do you expect the mean difference over the 10,000 pseudo-
trials to be?
b) How can he estimate the probability that he would obtain
by chance a difference greater than that which he observed?
c) Does this seem to be a valid way to carry out a statistical
test?
c) Write a SAS or Splus program which carries out his plan.
6. Assume a data file with the following structure:
Obs X Y
--- --- ---
1 12 3
2 16 2
. . .
. . .
. . .
N 29 7
Write an IML program which computes regression coefficients
for the model
E(Y) = b0 + b1 * X + b2 * X^2.
Include a computation of R-square.
7. White blood cell counts may be related to age. Suppose N people are
sampled with ages A1, A2, ..., An and weight W1, W2, ..., Wn,
and white cell counts Y1, Y2, ..., Yn. The white cell count Yi for
a given person can be assumed to have a Poisson distribution with
expectation equal to
E(Yi) = exp(a*Wi + b*Ai),
where Ai is the person's age and Wi is the person's weight.
Write a proc iml procedure which will compute maximum likelihood estimates
of a and b, and computes the approximate covariance matrix of these estimates.
8. Given a 2 x 2 table as follows,
Disease
----------------
0 1
----------------
| | |
0 | a | b |
| | |
Exposure ----------------
| | |
1 | c | d |
| | |
----------------
Define the estimated ODDS RATIO of disease for exposed versus nonexposed people as
OR = (d / c) / (b / a).
Write a macro which will compute and print the estimated odds ratio and a 95% confidence
interval for the true odds ratio. The call to the macro should look like:
%oddsmac (a, b, c, d, title);
9. Write a program to generate random samples from the negative binomial
distribution; from the geometric distribution; from the hypergeometric
distribution.
10. Write a macro which shows the image of the unit circle (x^2 + y^2 = 1)
under a linear transformation A: R^2 --> R^2, specified by the matrix
| a b |
A = | |
| c d |
The call to the macro should look like
%circimg (a, b, c, d, title) ;
11. Describe why SYMPUT is a useful command.
12. List 6 reasons why macros are useful.
13. List some possible disadvantages of macros.
14. What is a good reason to avoid using the 'goto' statement in SAS (or any
other language)? What should you do instead ?
15. Why is it a good idea to include 'mprint' on an options card?
16. Which is better: SAS or Splus?
17. What things can go wrong in programs which are designed to minimize
functions iteratively?
18. Write a program that (1) reads a dataset A which includes diastolic
blood pressure (DBP) as a variable, and (2) reads a second dataset B which
also includes DBP as a variable, and computes the percentages of
people in dataset B that are above and below the median DBP of
dataset A, and tests for whether these percentages are significantly
different from 50%.
19. Write a macro for doing sample size calculations for a two-group
clinical trial with a dichotomous endpoint. The call to the macro
should look like
%sampsize(p1, p2, alpha, beta, s) ;
where p1 = the event probability in group 1,
p2 = the event probability in group 2,
alpha = significance level (two-sided),
beta = probability of Type II error (= 1 - power),
s = the desired total sample size.
20. An observation X in a dataset is called an OUTLIER if
X > Q3 + 1.5 * (Q3 - Q1) or
X < Q1 - 1.5 * (Q3 - Q1),
where Q3 is the third quartile of dataset (i.e., 3/4 of the
observations are less than Q3) and Q1 is the first quartile.
Write a macro which lists all the outliers for a given variable
in a dataset. The call to the macro should be of the form
%outliers (indata, xvar, outdata),
where indata is the input dataset, xvar is the variable of
interest, and outdata is the output dataset of outlying values.
Image: St. John's Wort Flower
Web address of this page: http://www.biostat.umn.edu/~john-c/review.s2003.html
Most recent update: August 19, 2011.