April 28, 2004
1. A datafile for a program is on an external file with the following
address:
/home/gnome/bobw/data/datafil1
Data on the file looks like the following:
ID Group Date Randomized Gender Birthdate Cholesterol
------- ------ ----------------- ------- ----------- -------------
A-003-4 No-Drug 2004/07/04 M 1976/11/08 230
A-005-8 Drug-A 2004/08/14 F 1970/02/28 -99
A-011-3 Drug-B 2004/10/11 M . 208
A-013-1 Drug-A 2004/10/18 M 1975/04/30 244
A-022-5 No-Drug 2005/03/12 F 1969/05/21 282
A-105-3 Drug-A 2003/12/21 -99 . 180
[more observations]
Write a program which reads this datafile and computes the person's
age in years. Note that values of gender and cholesterol which are missing
are denoted by '-99'. Missing dates are denoted by '.' .
Write a procedure to test the hypothesis that the serum cholesterol
level is not related to drug assignment (controlling for age and gender).
Answers:
----------------------------------------------------------------------------------
options linesize = 80 ;
footnote "/home/gnome/john-c/5421/review1.sas &sysdate &systime" ;
data chol ;
infile 'prob1.data' ;
length id $7 group $7 gender $3 ;
input id
group
@23 daterand yymmdd10.
gender
@49 birtdate yymmdd10.
chol ;
if gender eq '-99' then gender = '' ;
if chol le '-99' then chol = '.' ;
run ;
proc print data = chol ;
run ;
proc glm data = chol ;
class group gender ;
model chol = age gender group / solution ;
title 'Analysis of chol vs. drug group, controlling for age, gender' ;
run ;
----------------------------------------------------------------------------------
proc glm data = chol ;
class group gender ;
model chol = age gender group / solution ;
title 'Analysis of chol vs. drug group, controlling for age, gender' ;
run ;
==================================================================================
2. Dr. Smith performs a regression of diastolic blood pressure on dietary
intake of sodium. He plots the residuals and obtains the following
graph: x
x x x
15 | x x
|
10 | x
| x x x
5 | x x x x
| x x
0 ---------------x-------x-----------------------------------------------
| x x x x x
-5 | x x
| x
-10 | x x
| x x x
-15 | x
_______________________________________________________________________
100 200 300 400 500 600 700 800 1000
Sodium intake (mg/day)
a) What is the mean residual ?
Zero.
b) If you regress the residuals on sodium intake, what will be the
result ?
---> 0 intercept, 0 slope (or coefficient for sodium) where-ever it occurs in the
analysis section.
c) How can you use the residuals to test for outliers ?
---> Output the residuals from proc reg onto a SAS dataset, then use proc
univariate on them.
d) What does the above pattern of residuals suggest regarding the
error structure of the relationship between DBP and sodium intake?
---> Suggests that the variance of the measurements is not constant with
respect to sodium intake, and presumably also not constant with
respect to the response variable, chol. However you would need to
confirm the latter with a graph. If that turns out to be true,
then you should consider a transformation. One possible transformation
is:
transchol = sgn(chol - 80)*sqrt(abs(chol - 80)) ;
==================================================================================
3. The dataset below shows dose-levels of toxic chemical TCHEM, number of
mice exposed to that dose-level (NMICE), and number of mice that died
(NDIED):
TCHEM NMICE NDIED
------- ------- -------
0 16 1
1 20 3
2 21 4
4 18 8
8 22 14
16 23 16
32 19 19
It is believe that the probability of death is related to the log of
the dose of TCHEM.
a) Write a program to analyze these data.
b) How would you use output from a program to estimate the dose which
would be expected to result in the death of 50% of the mice ?
--------------------------------------------------------------------------------
a) data toxic ;
input dose nmice ndied ;
logdose = log(dose + 1) ;
nlived = nmice - ndied ;
do i = 1 to nlived ;
died = 0 ;
output ;
end ;
do j = 1 to ndied ;
died = 1 ;
output ;
end ;
run ;
proc logistic descending data = toxic ;
model died = logdose ;
run ;
--------------------------------------------------------------------------------
b) Let b0 = estimate of intercept, b1 = estimate of logdose1 from the previous
regression. You want to find dose such that
.5 = 1 / (1 + exp(-b0 - b1*logdose1))
This is simply a matter of solving this equation for logdose1:
2 = 1 + exp(-b0 - b1*logdose1), or
1 = exp(-b0 - b1*logdose1).
Take logs of both sides. Note that log(1) = 0. Therefore
0 = -b0 - b1*logdose1 , or
logdose1 = -b0/b1. This means that
log(dose + 1) = -b0/b1, or
dose + 1 = exp(-b0/b1), or
dose = exp(-b0/b1) - 1.
==================================================================================
4. The tables below show results of an experiment by a veterinary researcher
on the calming effects of a drug on dogs and cats. The objectives are to
see whether dogs and cats have similar reactions to the drug, and whether
the drug has a calming effect in either dogs or cats or both animals:
Dogs Cats
--------------------- ---------------------
Drug Placebo Drug Placebo
--------------------- ---------------------
| | | | | |
Agitated | 31 | 43 | | 16 | 50 |
| | | | | |
--------------------- ---------------------
| | | | | |
Calm | 69 | 57 | | 84 | 50 |
| | | | | |
--------------------- ---------------------
100 100 100 100
OR = OR =
a) Write a complete SAS program to read in these data.
data dogscats ;
input animal drug response count ;
cards ;
1 1 1 31
1 1 0 69
1 0 1 43
1 0 0 57
0 1 1 16
0 1 0 84
0 0 1 50
0 0 0 50
;
run ;
b) Write a PROC FREQ procedure to carry out an appropriate analysis,
and discuss how you would use the output.
---> proc freq data = dogscats ;
weight count ;
tables animal * response * drug / cmh measures ;
run ;
c) Write a PROC LOGISTIC procedures to carry out an appropriate analysis,
and again discuss how you would use the output.
---> 1. You need to re-write the datafile so that there is one observation
per animal.
2. proc logistic descending data = dogscats ;
class animal response drug ;
model response = animal drug animal*drug / clodds = pl ;
run ;
==================================================================================
5. a) Why are CLASS variables useful in PROC GLM ?
---> PROC GLM automatically creates indicator variables from the
CLASS variables, and correctly codes interactions if interactions
are specified.
b) Suppose CITY is a categorical variable with 5 possible values, and
GENDER is a categorical variable with 2 values, and WEIGHT is a
continuous (quantitative) variable. You want to analyze a collection
of 100 observations, studying the relationship of WEIGHT to CITY and
GENDER. You want to test for influential observations and outliers.
Finally, you want to test for whether there is an interaction of
CITY and GENDER. Describe how you carry out the analysis using PROC REG.
--------------------------------------------------------------------------------
1. Create indicator variables for each of the cities: e.g., if
one of the cities is St. Paul, create a variable called STPAUL
which equals 0 if CITY is not equal to StPaul, and which equals 1
if CITY equals St. Paul. Similarly create an indicator
variable for GENDER.
2. proc reg data = dataset ;
model WEIGHT = STPAUL CHICAGO NEWYORK BEIJING TOKYO GENDER ;
output out = regout rstudent = wstudent dffits = wdffits ;
run ;
proc print data = regout ;
var weight wstudent wdffits ;
run ;
--------------------------------------------------------------------------------
Write PROC GLM code to analyze the same data.
--------------------------------------------------------------------------------
proc glm data = dataset ;
class city gender ;
model weight = city | gender / solution ;
run ;
[note: GLM does not provide statistics on influential points or
outliers]
--------------------------------------------------------------------------------
c) Why are multiple comparisons tests (Bonferroni and others) sometimes
used in one-way analysis of variance ?
---> To decrease the probability that you will erroneously conclude that a factor
is a significant predictor when in fact it is one of several factors
that were considered, and the chance that one of them appears to be
significant exceeds the probability that it alone is significant
if the null hypothesis is true (i.e., actually none of the factors
is related to the outcome).
==================================================================================
6. The following datafile is for a case-control study of skin-cancer incidence
versus type of hat worn by farmers:
Case-Control Case or
Set # Control? Type of Hat Eye Color
------------ -------- ----------- ---------
1 case Strap Brown
1 control Cowboy Blue
1 control Strap Brown
2 case Strap Blue
2 control Cowboy Blue
2 control Cowboy Brown
3 case Cowboy Blue
3 control Cowboy Brown
3 control Strap Blue
[more observations]
Write a program which reads in the data and carries out an appropriate
case-control analysis to determine whether the type of hat is related
to the incidence of skin cancer, controlling for eye color.
--------------------------------------------------------------------------------
data casecon ;
input ccset status hat eyecolor ;
nstatus = . ;
if status eq 'case' then nstatus = 1 ;
if status eq 'control' then nstatus = 0 ;
nhat = . ;
if hat = 'Strap' then nhat = 1 ;
if hat = 'Cowboy' then nhat = 2 ;
ncolor = . ;
if eyecolor = 'Brown' then ncolor = 1 ;
if eyecolor = 'Blue' then ncolor = 2 ;
time = 1 - nstatus ;
cards ;
1 case Strap Brown
1 control Cowboy Blue
1 control Strap Brown
2 case Strap Blue
2 control Cowboy Blue
2 control Cowboy Brown
3 case Cowboy Blue
3 control Cowboy Brown
3 control Strap Blue
[more observations]
;
run ;
proc phreg data = casecon ;
model time * nstatus(0) = hat eyecolor / rl ;
run ;
==================================================================================
7. You conduct a random survey of people living in Minneapolis, MN,
recording their weight, gender, age, and carbohydrate intake (in
calories). You want to construct a table like the following:
Carbohydrate Intake(g) Weight
Carbohydrate --------------------- Percent ------------
Quintile Mean Min Max Mean Age Women Mean StdDev
------------ ---- ----- ----- -------- -------- ---- ------
1 201 50 300 35 67 56 20
2 233 302 390 34 70 52 14
[etc.]
How do you do this ?
--------------------------------------------------------------------------------
proc rank data = dataset group = 5 out = outrank ;
var carbohyd ;
ranks carbquin ;
run ;
proc means data = outrank n mean std min max ;
class carbquin ;
var carbohyd age gender weight ;
title 'Mean carbo intake, age, gender, and weight by carbo quintile.' ;
==================================================================================
8. The following is a summary of data from an eye study. Each person
in the study had one eye randomized to treatment and one eye randomized
to control. Each eye was evaluated as having a successful outcome or
a failure outcome.
Number of
Outcome Category People
---------------------------------------- ----------
Treated eye success, control eye success 100
Treated eye success, control eye failure 50
Treated eye failure, control eye success 30
Treated eye failure, control eye failure 80
Write a SAS program which reads in this data, and which carries out an
appropriate analysis using (1) PROC FREQ, and (2) PROC LOGISTIC.
--------------------------------------------------------------------------------
PROC FREQ approach:
data eyes ;
input trtsucc contsucc count ;
cards ;
1 1 100
1 0 50
0 1 30
0 0 80
;
run ;
proc freq data = eyes ;
weight count ;
tables trtsucc * contsucc / agree ;
run ;
--------------------------------------------------------------------------------
PROC LOGISTIC approach
data eyes ;
input trtsucc contsucc count ;
one = 1 ;
diff = trtsucc - contsucc ;
do i = 1 to count ;
output ;
end ;
cards ;
1 1 100
1 0 50
0 1 30
0 0 80
;
run ;
proc logistic data = eyes ;
model one = diff / noint ;
run ;
--------------------------------------------------------------------------------
In PROC FREQ, what statistic tells you the relative effect of treatment
versus control ?
---> Odds ratio
In PROC LOGISTIC, what is the relationship of coefficient estimates
to the relative effect of treatment versus control ?
---> exp(coeff) = Odds ratio
Show how this same data can be analyzed using PROC PHREG.
--------------------------------------------------------------------------------
PROC PHREG approach
options linesize = 80 ;
data eyes ;
retain set 0 ;
input trtsucc contsucc count ;
one = 1 ;
do i = 1 to count ;
set = set + 1 ;
group = 1 ;
time = 1 - trtsucc ;
case = trtsucc ;
output ;
group = 0 ;
time = 1 - contsucc ;
case = contsucc ;
output ;
end ;
cards ;
1 1 100
1 0 50
0 1 30
0 0 80
;
run ;
proc print data = eyes ;
proc phreg data = eyes ;
*** model time * case(0) = group / ties = discrete ;
strata set ;
run ;
*** Note: the 'ties = discrete' option was omitted in the example in
notes n54703.017. It is essential in this procedure.
==================================================================================
9. The following dataset shows followup time and event or censoring status
for a clinical trial of Drug 1 versus Drug 2:
Event or
Obs Drug Followup Time Censored? Age
----- ------ --------------- ----------- -----
1 1 73 C 24
2 1 45 E 77
3 2 12 E 40
4 1 56 C 29
5 2 65 E 64
[more observations]
Write a program to analyze this data using PROC LIFETEST (Ignore effect
of age).
--------------------------------------------------------------------------------
data drugs ;
infile 'drugfile' ;
input drug follmons censored age ;
ncensor = 1 ;
if censored = 'E' then ncensor = 0 ;
run ;
proc lifetest data = drugs notable outsurv = surcurve ;
time follmons * ncensor(0);
strata drug ;
run ;
--------------------------------------------------------------------------------
What plots might you want to examine ?
---> Survival, log(survival), hazard.
How do you decide which drug is better ?
---> Look at the two survival curves (drug = 1 vs drug = 2). The curve
with the higher survival rate corresponds to the better drug.
Write a program to analyze this data, estimating the effect of drug
while controlling for age.
--------------------------------------------------------------------------------
proc phreg data = drugs ;
model follmons * ncensor(0) = age drug ;
run ;
==================================================================================
10. Write SAS code to find p-values corresponding to the following statistics:
T-statistic, two-sided, df = 25, t = -1.96
Chi-square statistic, df = 10, X2 = 25.56
F-statistic, df = (5, 100), F = 15.11
Z-statistic, one-sided, z = 2.16.
--------------------------------------------------------------------------------
tprob = 2 * probt(-1.96, 25) ;
cprob = 1 - probchi(25.56, 10) ;
fprob = 1 - probf(15.11, 5, 100) ;
zprob = 1 - probnorm(2.16) ;
==================================================================================
11. Random number generators:
a) Write a SAS program which randomly chooses a day in the year 2004.
--------------------------------------------------------------------------------
data day ;
seed = -1 ;
rday = 1 + int(366*ranuni(seed)) ;
dec3103 = mdy(12, 31, 2003) ;
randday = rday + dec3103 ;
output ;
run ;
--------------------------------------------------------------------------------
b) Write a SAS program which randomly assigns 200 people to three drug
groups, with the following probabilities:
Group 1: prob = .36
Group 2: prob = .24
Group 3: prob = .40
--------------------------------------------------------------------------------
data drugs ;
p1 = .36 ;
p2 = .24 ;
p3 = .40 ;
p12 = p1 + p2 ;
n = 200 ;
do i = 1 to n ;
r = ranuni(-1) ;
group = 1 ;
if r > p1 then group = 2 ;
if r > p12 then group = 3 ;
output ;
end ;
run ;
--------------------------------------------------------------------------------
c) Write a SAS program which randomly generates 100 numbers from a normal
distribution with mean 5 and standard deviation 2.
--------------------------------------------------------------------------------
data x100 ;
do i = 1 to 100 ;
x = 5 + 2 * rannor(-1) ;
output ;
end ;
run ;
--------------------------------------------------------------------------------
d) Write a SAS program to generate a randomization schedule with 300
people, with exactly 100 of them assigned to Drug A and 200 of them
assigned to Drug B.
--------------------------------------------------------------------------------
data randschd ;
p = .3333333333 ;
seed = 12985634 ;
do i = 1 to 300 ;
r = ranuni(seed) ;
assign = 'A' ;
if i gt 100 then assign = 'B' ;
output ;
end ;
run ;
proc sort data = randschd ; by r ;
run ;
--------------------------------------------------------------------------------
e) If you randomize in blocks to guarantee approximate balance between
treatment groups at any point in the randomization schedule, what
desirable characteristic of the schedule is impaired ?
---> Unpredictability.
==================================================================================
12. A datafile has 100 observations for husband-wife pairs as follows:
Family Husband's Husband's Husband's Wife's Wife's Wife's
Income Age DBP Insomnia Age DBP Insomnia
------- --------- --------- --------- ------ ------ --------
$35000 28 88 0 25 70 1
$29300 53 100 1 54 94 1
$67250 41 76 1 41 98 0
[more observations]
a) Write a SAS program to read the datafile and write out a new
file which has one observation for the husband and one for the
wife.
--------------------------------------------------------------------------------
data insomnia ;
infile 'insomnia.data' ;
input income husbage husbdbp husbinsm
wifeage wifedbp wifeinsm ;
husb = 1 ;
age = husbage ;
dbp = husbdbp ;
insom = husbinsm ;
output ;
husb = 0 ;
age = wifeage ;
dbp = wifedbp ;
insom = wifeinsm ;
output ;
run ;
--------------------------------------------------------------------------------
b) The problem is to determine whether DBP predicts insomnia. You
would want to control for income and age, and you want to analyze
the data as paired. Write the appropriate SAS code.
---> It makes no sense to analyze this as paired data. It would make sense
if the predictor of interest were gender.
Most recent update: May 8, 2004.