April 28, 2004 1. A datafile for a program is on an external file with the following address: /home/gnome/bobw/data/datafil1 Data on the file looks like the following: ID Group Date Randomized Gender Birthdate Cholesterol ------- ------ ----------------- ------- ----------- ------------- A-003-4 No-Drug 2004/07/04 M 1976/11/08 230 A-005-8 Drug-A 2004/08/14 F 1970/02/28 -99 A-011-3 Drug-B 2004/10/11 M . 208 A-013-1 Drug-A 2004/10/18 M 1975/04/30 244 A-022-5 No-Drug 2005/03/12 F 1969/05/21 282 A-105-3 Drug-A 2003/12/21 -99 . 180 [more observations] Write a program which reads this datafile and computes the person's age in years. Note that values of gender and cholesterol which are missing are denoted by '-99'. Missing dates are denoted by '.' . Write a procedure to test the hypothesis that the serum cholesterol level is not related to drug assignment (controlling for age and gender). Answers: ---------------------------------------------------------------------------------- options linesize = 80 ; footnote "/home/gnome/john-c/5421/review1.sas &sysdate &systime" ; data chol ; infile 'prob1.data' ; length id $7 group $7 gender $3 ; input id group @23 daterand yymmdd10. gender @49 birtdate yymmdd10. chol ; if gender eq '-99' then gender = '' ; if chol le '-99' then chol = '.' ; run ; proc print data = chol ; run ; proc glm data = chol ; class group gender ; model chol = age gender group / solution ; title 'Analysis of chol vs. drug group, controlling for age, gender' ; run ; ---------------------------------------------------------------------------------- proc glm data = chol ; class group gender ; model chol = age gender group / solution ; title 'Analysis of chol vs. drug group, controlling for age, gender' ; run ; ================================================================================== 2. Dr. Smith performs a regression of diastolic blood pressure on dietary intake of sodium. He plots the residuals and obtains the following graph: x x x x 15 | x x | 10 | x | x x x 5 | x x x x | x x 0 ---------------x-------x----------------------------------------------- | x x x x x -5 | x x | x -10 | x x | x x x -15 | x _______________________________________________________________________ 100 200 300 400 500 600 700 800 1000 Sodium intake (mg/day) a) What is the mean residual ? Zero. b) If you regress the residuals on sodium intake, what will be the result ? ---> 0 intercept, 0 slope (or coefficient for sodium) where-ever it occurs in the analysis section. c) How can you use the residuals to test for outliers ? ---> Output the residuals from proc reg onto a SAS dataset, then use proc univariate on them. d) What does the above pattern of residuals suggest regarding the error structure of the relationship between DBP and sodium intake? ---> Suggests that the variance of the measurements is not constant with respect to sodium intake, and presumably also not constant with respect to the response variable, chol. However you would need to confirm the latter with a graph. If that turns out to be true, then you should consider a transformation. One possible transformation is: transchol = sgn(chol - 80)*sqrt(abs(chol - 80)) ; ================================================================================== 3. The dataset below shows dose-levels of toxic chemical TCHEM, number of mice exposed to that dose-level (NMICE), and number of mice that died (NDIED): TCHEM NMICE NDIED ------- ------- ------- 0 16 1 1 20 3 2 21 4 4 18 8 8 22 14 16 23 16 32 19 19 It is believe that the probability of death is related to the log of the dose of TCHEM. a) Write a program to analyze these data. b) How would you use output from a program to estimate the dose which would be expected to result in the death of 50% of the mice ? -------------------------------------------------------------------------------- a) data toxic ; input dose nmice ndied ; logdose = log(dose + 1) ; nlived = nmice - ndied ; do i = 1 to nlived ; died = 0 ; output ; end ; do j = 1 to ndied ; died = 1 ; output ; end ; run ; proc logistic descending data = toxic ; model died = logdose ; run ; -------------------------------------------------------------------------------- b) Let b0 = estimate of intercept, b1 = estimate of logdose1 from the previous regression. You want to find dose such that .5 = 1 / (1 + exp(-b0 - b1*logdose1)) This is simply a matter of solving this equation for logdose1: 2 = 1 + exp(-b0 - b1*logdose1), or 1 = exp(-b0 - b1*logdose1). Take logs of both sides. Note that log(1) = 0. Therefore 0 = -b0 - b1*logdose1 , or logdose1 = -b0/b1. This means that log(dose + 1) = -b0/b1, or dose + 1 = exp(-b0/b1), or dose = exp(-b0/b1) - 1. ================================================================================== 4. The tables below show results of an experiment by a veterinary researcher on the calming effects of a drug on dogs and cats. The objectives are to see whether dogs and cats have similar reactions to the drug, and whether the drug has a calming effect in either dogs or cats or both animals: Dogs Cats --------------------- --------------------- Drug Placebo Drug Placebo --------------------- --------------------- | | | | | | Agitated | 31 | 43 | | 16 | 50 | | | | | | | --------------------- --------------------- | | | | | | Calm | 69 | 57 | | 84 | 50 | | | | | | | --------------------- --------------------- 100 100 100 100 OR = OR = a) Write a complete SAS program to read in these data. data dogscats ; input animal drug response count ; cards ; 1 1 1 31 1 1 0 69 1 0 1 43 1 0 0 57 0 1 1 16 0 1 0 84 0 0 1 50 0 0 0 50 ; run ; b) Write a PROC FREQ procedure to carry out an appropriate analysis, and discuss how you would use the output. ---> proc freq data = dogscats ; weight count ; tables animal * response * drug / cmh measures ; run ; c) Write a PROC LOGISTIC procedures to carry out an appropriate analysis, and again discuss how you would use the output. ---> 1. You need to re-write the datafile so that there is one observation per animal. 2. proc logistic descending data = dogscats ; class animal response drug ; model response = animal drug animal*drug / clodds = pl ; run ; ================================================================================== 5. a) Why are CLASS variables useful in PROC GLM ? ---> PROC GLM automatically creates indicator variables from the CLASS variables, and correctly codes interactions if interactions are specified. b) Suppose CITY is a categorical variable with 5 possible values, and GENDER is a categorical variable with 2 values, and WEIGHT is a continuous (quantitative) variable. You want to analyze a collection of 100 observations, studying the relationship of WEIGHT to CITY and GENDER. You want to test for influential observations and outliers. Finally, you want to test for whether there is an interaction of CITY and GENDER. Describe how you carry out the analysis using PROC REG. -------------------------------------------------------------------------------- 1. Create indicator variables for each of the cities: e.g., if one of the cities is St. Paul, create a variable called STPAUL which equals 0 if CITY is not equal to StPaul, and which equals 1 if CITY equals St. Paul. Similarly create an indicator variable for GENDER. 2. proc reg data = dataset ; model WEIGHT = STPAUL CHICAGO NEWYORK BEIJING TOKYO GENDER ; output out = regout rstudent = wstudent dffits = wdffits ; run ; proc print data = regout ; var weight wstudent wdffits ; run ; -------------------------------------------------------------------------------- Write PROC GLM code to analyze the same data. -------------------------------------------------------------------------------- proc glm data = dataset ; class city gender ; model weight = city | gender / solution ; run ; [note: GLM does not provide statistics on influential points or outliers] -------------------------------------------------------------------------------- c) Why are multiple comparisons tests (Bonferroni and others) sometimes used in one-way analysis of variance ? ---> To decrease the probability that you will erroneously conclude that a factor is a significant predictor when in fact it is one of several factors that were considered, and the chance that one of them appears to be significant exceeds the probability that it alone is significant if the null hypothesis is true (i.e., actually none of the factors is related to the outcome). ================================================================================== 6. The following datafile is for a case-control study of skin-cancer incidence versus type of hat worn by farmers: Case-Control Case or Set # Control? Type of Hat Eye Color ------------ -------- ----------- --------- 1 case Strap Brown 1 control Cowboy Blue 1 control Strap Brown 2 case Strap Blue 2 control Cowboy Blue 2 control Cowboy Brown 3 case Cowboy Blue 3 control Cowboy Brown 3 control Strap Blue [more observations] Write a program which reads in the data and carries out an appropriate case-control analysis to determine whether the type of hat is related to the incidence of skin cancer, controlling for eye color. -------------------------------------------------------------------------------- data casecon ; input ccset status hat eyecolor ; nstatus = . ; if status eq 'case' then nstatus = 1 ; if status eq 'control' then nstatus = 0 ; nhat = . ; if hat = 'Strap' then nhat = 1 ; if hat = 'Cowboy' then nhat = 2 ; ncolor = . ; if eyecolor = 'Brown' then ncolor = 1 ; if eyecolor = 'Blue' then ncolor = 2 ; time = 1 - nstatus ; cards ; 1 case Strap Brown 1 control Cowboy Blue 1 control Strap Brown 2 case Strap Blue 2 control Cowboy Blue 2 control Cowboy Brown 3 case Cowboy Blue 3 control Cowboy Brown 3 control Strap Blue [more observations] ; run ; proc phreg data = casecon ; model time * nstatus(0) = hat eyecolor / rl ; run ; ================================================================================== 7. You conduct a random survey of people living in Minneapolis, MN, recording their weight, gender, age, and carbohydrate intake (in calories). You want to construct a table like the following: Carbohydrate Intake(g) Weight Carbohydrate --------------------- Percent ------------ Quintile Mean Min Max Mean Age Women Mean StdDev ------------ ---- ----- ----- -------- -------- ---- ------ 1 201 50 300 35 67 56 20 2 233 302 390 34 70 52 14 [etc.] How do you do this ? -------------------------------------------------------------------------------- proc rank data = dataset group = 5 out = outrank ; var carbohyd ; ranks carbquin ; run ; proc means data = outrank n mean std min max ; class carbquin ; var carbohyd age gender weight ; title 'Mean carbo intake, age, gender, and weight by carbo quintile.' ; ================================================================================== 8. The following is a summary of data from an eye study. Each person in the study had one eye randomized to treatment and one eye randomized to control. Each eye was evaluated as having a successful outcome or a failure outcome. Number of Outcome Category People ---------------------------------------- ---------- Treated eye success, control eye success 100 Treated eye success, control eye failure 50 Treated eye failure, control eye success 30 Treated eye failure, control eye failure 80 Write a SAS program which reads in this data, and which carries out an appropriate analysis using (1) PROC FREQ, and (2) PROC LOGISTIC. -------------------------------------------------------------------------------- PROC FREQ approach: data eyes ; input trtsucc contsucc count ; cards ; 1 1 100 1 0 50 0 1 30 0 0 80 ; run ; proc freq data = eyes ; weight count ; tables trtsucc * contsucc / agree ; run ; -------------------------------------------------------------------------------- PROC LOGISTIC approach data eyes ; input trtsucc contsucc count ; one = 1 ; diff = trtsucc - contsucc ; do i = 1 to count ; output ; end ; cards ; 1 1 100 1 0 50 0 1 30 0 0 80 ; run ; proc logistic data = eyes ; model one = diff / noint ; run ; -------------------------------------------------------------------------------- In PROC FREQ, what statistic tells you the relative effect of treatment versus control ? ---> Odds ratio In PROC LOGISTIC, what is the relationship of coefficient estimates to the relative effect of treatment versus control ? ---> exp(coeff) = Odds ratio Show how this same data can be analyzed using PROC PHREG. -------------------------------------------------------------------------------- PROC PHREG approach options linesize = 80 ; data eyes ; retain set 0 ; input trtsucc contsucc count ; one = 1 ; do i = 1 to count ; set = set + 1 ; group = 1 ; time = 1 - trtsucc ; case = trtsucc ; output ; group = 0 ; time = 1 - contsucc ; case = contsucc ; output ; end ; cards ; 1 1 100 1 0 50 0 1 30 0 0 80 ; run ; proc print data = eyes ; proc phreg data = eyes ; *** model time * case(0) = group / ties = discrete ; strata set ; run ; *** Note: the 'ties = discrete' option was omitted in the example in notes n54703.017. It is essential in this procedure. ================================================================================== 9. The following dataset shows followup time and event or censoring status for a clinical trial of Drug 1 versus Drug 2: Event or Obs Drug Followup Time Censored? Age ----- ------ --------------- ----------- ----- 1 1 73 C 24 2 1 45 E 77 3 2 12 E 40 4 1 56 C 29 5 2 65 E 64 [more observations] Write a program to analyze this data using PROC LIFETEST (Ignore effect of age). -------------------------------------------------------------------------------- data drugs ; infile 'drugfile' ; input drug follmons censored age ; ncensor = 1 ; if censored = 'E' then ncensor = 0 ; run ; proc lifetest data = drugs notable outsurv = surcurve ; time follmons * ncensor(0); strata drug ; run ; -------------------------------------------------------------------------------- What plots might you want to examine ? ---> Survival, log(survival), hazard. How do you decide which drug is better ? ---> Look at the two survival curves (drug = 1 vs drug = 2). The curve with the higher survival rate corresponds to the better drug. Write a program to analyze this data, estimating the effect of drug while controlling for age. -------------------------------------------------------------------------------- proc phreg data = drugs ; model follmons * ncensor(0) = age drug ; run ; ================================================================================== 10. Write SAS code to find p-values corresponding to the following statistics: T-statistic, two-sided, df = 25, t = -1.96 Chi-square statistic, df = 10, X2 = 25.56 F-statistic, df = (5, 100), F = 15.11 Z-statistic, one-sided, z = 2.16. -------------------------------------------------------------------------------- tprob = 2 * probt(-1.96, 25) ; cprob = 1 - probchi(25.56, 10) ; fprob = 1 - probf(15.11, 5, 100) ; zprob = 1 - probnorm(2.16) ; ================================================================================== 11. Random number generators: a) Write a SAS program which randomly chooses a day in the year 2004. -------------------------------------------------------------------------------- data day ; seed = -1 ; rday = 1 + int(366*ranuni(seed)) ; dec3103 = mdy(12, 31, 2003) ; randday = rday + dec3103 ; output ; run ; -------------------------------------------------------------------------------- b) Write a SAS program which randomly assigns 200 people to three drug groups, with the following probabilities: Group 1: prob = .36 Group 2: prob = .24 Group 3: prob = .40 -------------------------------------------------------------------------------- data drugs ; p1 = .36 ; p2 = .24 ; p3 = .40 ; p12 = p1 + p2 ; n = 200 ; do i = 1 to n ; r = ranuni(-1) ; group = 1 ; if r > p1 then group = 2 ; if r > p12 then group = 3 ; output ; end ; run ; -------------------------------------------------------------------------------- c) Write a SAS program which randomly generates 100 numbers from a normal distribution with mean 5 and standard deviation 2. -------------------------------------------------------------------------------- data x100 ; do i = 1 to 100 ; x = 5 + 2 * rannor(-1) ; output ; end ; run ; -------------------------------------------------------------------------------- d) Write a SAS program to generate a randomization schedule with 300 people, with exactly 100 of them assigned to Drug A and 200 of them assigned to Drug B. -------------------------------------------------------------------------------- data randschd ; p = .3333333333 ; seed = 12985634 ; do i = 1 to 300 ; r = ranuni(seed) ; assign = 'A' ; if i gt 100 then assign = 'B' ; output ; end ; run ; proc sort data = randschd ; by r ; run ; -------------------------------------------------------------------------------- e) If you randomize in blocks to guarantee approximate balance between treatment groups at any point in the randomization schedule, what desirable characteristic of the schedule is impaired ? ---> Unpredictability. ================================================================================== 12. A datafile has 100 observations for husband-wife pairs as follows: Family Husband's Husband's Husband's Wife's Wife's Wife's Income Age DBP Insomnia Age DBP Insomnia ------- --------- --------- --------- ------ ------ -------- $35000 28 88 0 25 70 1 $29300 53 100 1 54 94 1 $67250 41 76 1 41 98 0 [more observations] a) Write a SAS program to read the datafile and write out a new file which has one observation for the husband and one for the wife. -------------------------------------------------------------------------------- data insomnia ; infile 'insomnia.data' ; input income husbage husbdbp husbinsm wifeage wifedbp wifeinsm ; husb = 1 ; age = husbage ; dbp = husbdbp ; insom = husbinsm ; output ; husb = 0 ; age = wifeage ; dbp = wifedbp ; insom = wifeinsm ; output ; run ; -------------------------------------------------------------------------------- b) The problem is to determine whether DBP predicts insomnia. You would want to control for income and age, and you want to analyze the data as paired. Write the appropriate SAS code. ---> It makes no sense to analyze this as paired data. It would make sense if the predictor of interest were gender.Most recent update: May 8, 2004.